docs(router-gate-v4): Stream H Task 1 — recovery-procedures.md (3 levels + stale-process + 7 fabrications + test methodology + smoke methodology)

Adds first-time recovery runbook with: - 3 self-recovery levels (Level 1 ≤5min sentinel reset, Level 2 ≤15min VS Code restart, Level 3 destructive workspace rebuild) - Stale-process / hook reload trap (Smoke 5 chistaa-session hypothesis + refutation method); key takeaway: live restart-test is the only way to confirm a hook-modifying fix landed - Self-fabrication patterns — 7 cases enumerated from Smokes 3/4/5/7 with pattern signature, detection signal, mitigation for each - Test methodology lesson — Smoke 5 root cause showed unit tests with inline mocks can give false-green if they bypass the live resolver function; debug scripts have the same trap - Smoke methodology — statusline-setup system prompt overrides user tasks (Smoke 9 Run 1); use semgrep-scanner for echo-probes, statusline-setup OK for gate-inheritance smokes Docs-only change; verified via docs-only short-circuit in enforce-verify- before-push (§5 п.13 CLAUDE.md). Stream H Task 1 of 11. Plan: docs/superpowers/plans/2026-05-30-router-gate-v4-stream-H.md
2026-05-30 09:58:38 +03:00
parent d277d4bdfc
commit 3ce73a68ff
1 changed files with 402 additions and 0 deletions
@@ -0,0 +1,402 @@
+# Router-gate v4 Recovery Procedures
+
+Reference runbook for self-recovery scenarios encountered during router-gate v4
+deployment and the user-run Smoke campaign (Smokes 1–9, 2026-05-30). Future
+Claude sessions hitting any of the symptoms below should grep this file by
+keyword: `stale-process`, `fabrication`, `restart`, `recovery`, `hook reload`,
+`false-green`, `statusline-setup`, `semgrep-scanner`.
+
+The procedures are ordered by escalation. **Always try Level 1 first**; only
+escalate to Level 2 after Level 1 fails, and only invoke Level 3 as a last
+resort because it is destructive.
+
+---
+
+## Self-recovery Level 1 — single tool hung
+
+**When to use:** a single Bash / Edit / Write / Glob / Read tool call hangs or
+returns a stale result, but the VS Code session itself is still responsive
+(other tool calls work, the assistant can still emit text, the user can still
+type). Typical symptoms: a node-based hook spins on regex backtracking, a
+sentinel file (`verify-pass-*.json`, `parent-sentinel-*.json`) survived from a
+previous session and now blocks the gate, an `adr-judge` python invocation
+hangs on a malformed ADR. Time budget: ≤5 minutes.
+
+Run the following PowerShell commands in order. Stop after each block and
+retry the original tool call before moving on.
+
+```powershell
+# Kill stuck node process holding a hook
+Get-Process node | Where-Object {$_.CPU -gt 60} | Stop-Process -Force
+
+# Kill stuck python (e.g. adr-judge with regex spin)
+Get-Process python | Where-Object {$_.CPU -gt 60} | Stop-Process -Force
+
+# Clear runtime sentinels (force gate-reload on next tool call)
+Remove-Item ~/.claude/runtime/verify-pass-*.json -Force -ErrorAction SilentlyContinue
+Remove-Item ~/.claude/runtime/parent-sentinel-*.json -Force -ErrorAction SilentlyContinue
+```
+
+After running the three blocks, retry the original failing tool call once. If
+it succeeds, Level 1 is done — log a one-line note in `.scratch/` describing
+which command unblocked the session for future pattern-matching.
+
+If the tool call still hangs or returns the same stale result, escalate to
+Level 2.
+
+---
+
+## Self-recovery Level 2 — VS Code session corrupted
+
+**When to use:** Level 1 commands ran cleanly (no errors) but the original
+failing tool call still misbehaves. Or: hooks are firing with old behavior
+even though their source file shows the new code on disk. Or: the assistant
+itself is producing nonsensical output (looping on the same step, ignoring
+user input, fabricating tool results). Time budget: ≤15 minutes.
+
+```powershell
+# Restart VS Code with current workspace state preserved
+Stop-Process -Name "Code" -Force; Start-Sleep -Seconds 3; code "c:\моя\проекты\портал crm\Документация"
+```
+
+VS Code re-opens with the same workspace; any unsaved buffer changes are lost,
+but committed git state and saved files are intact. Resume the conversation
+with a fresh `claude` invocation in the integrated terminal.
+
+> **IMPORTANT — hot-reload of hook code requires VS Code restart.** Node child
+> processes spawned for hooks cache module imports inside the parent Claude
+> process. After editing `tools/enforce-*.mjs` (or any helper module they
+> import), a fresh tool call still uses the OLD module until the parent
+> Claude process restarts. This is the same root cause as the Smoke 5
+> stale-process hypothesis documented in the next section. If the hook still
+> misbehaves after VS Code restart, the bug is in the code itself — escalate
+> to debugging the hook source, not to restarting again.
+
+If after a full VS Code restart the symptom persists and you have confirmed
+the hook source on disk is correct, the issue is likely in workspace state
+(git index corruption, broken `.claude/settings.json`, mutated lockfile). Move
+to Level 3.
+
+---
+
+## Self-recovery Level 3 — workspace unrecoverable
+
+**When to use:** Levels 1 and 2 both failed. Symptoms typically include
+corrupted git state (HEAD detached at random commit, refs pointing to nothing,
+`git status` errors), a broken `.claude/settings.json` that blocks every tool
+call, mutated `node_modules/` after a partial install that fails to recover
+via `npm ci`, or a worktree whose `gitdir` symlink no longer resolves.
+
+**Level 3 is DESTRUCTIVE.** Uncommitted changes outside the explicit stash
+will be lost. Only invoke after a deliberate decision that recovery via
+Levels 1 and 2 is impossible. Each step below requires user approval per the
+existing router-gate; the master controller must AskUser before running.
+
+### Step 1 — Backup current changes
+
+```bash
+git stash push --include-untracked --message "level-3-recovery-2026-05-30"
+```
+
+This captures every uncommitted modification and untracked file into a named
+stash. Replace the date suffix with the actual recovery date so multiple
+recoveries do not collide. If `git stash` itself errors out, manually copy
+the working tree to a sibling directory before continuing.
+
+### Step 2 — Reset to known-good main
+
+```bash
+git fetch origin main
+git reset --hard origin/main
+```
+
+This wipes all local commits ahead of `origin/main` and rewinds the index +
+working tree to match the remote. After this command the only way to recover
+local work is the stash from Step 1 (or the reflog, within its expiry
+window).
+
+### Step 3 — Re-pull external configuration if needed
+
+If `.claude/settings.json` or `.mcp.json` were the source of the failure,
+fetch the canonical versions from `origin/main` (covered by Step 2). If user-
+level config under `~/.claude/` is suspected, manually inspect — do not
+delete blindly because user-level settings can include credentials.
+
+### Step 4 — Worktree rebuild (v4-stream-A..E)
+
+If the parallel-deployment worktrees `C:\моя\проекты\портал crm\v4-stream-{A,B,C,D,E}`
+got corrupted (broken gitdir, missing files, divergent state), rebuild from
+the recovered main:
+
+```bash
+# Remove the broken worktree registration
+git worktree remove --force "C:/моя/проекты/портал crm/v4-stream-A"
+
+# Recreate from a clean base commit
+git worktree add "C:/моя/проекты/портал crm/v4-stream-A" -b feat/v4-stream-A origin/main
+```
+
+Repeat for streams B, C, D, E as needed. After re-creation, the worktree
+starts from a clean origin/main; any prior stream work must be recovered from
+its own commit history on the corresponding feature branch (which lives in
+the central repo, not in the worktree directory).
+
+### Step 5 — Re-apply stashed work selectively
+
+Inspect the Step 1 stash with `git stash show -p stash@{0}` and apply only
+the parts that survive the reset rationale. Do not blindly `git stash pop` —
+the stash may contain the very files that caused the corruption.
+
+---
+
+## Stale-process / hook reload
+
+**Smoke 5 evidence — chistaa-session hypothesis and refutation method.**
+
+Symptom observed in Smoke 5 (2026-05-30):
+- The path-normalization hook `tools/enforce-bash-content-gate.mjs` had been
+  edited to fix a Windows separator leak.
+- Unit tests for the new path normalization were GREEN.
+- A live tool call (a benign `cat /tmp/foo` style probe) still triggered the
+  OLD leak behavior — the new normalization was not exercised.
+
+Hypothesis raised by the chistaa (parallel) Claude session at the start of
+Smoke 5:
+
+> "A stale node process is holding the old module in memory; a restart will
+> fix it."
+
+This hypothesis is plausible because:
+- Node's `import` cache is per-process; a long-running parent Claude process
+  spawns hook subprocesses but those subprocesses may share an import graph
+  loaded at startup.
+- VS Code on Windows occasionally retains zombie node processes after a
+  crashed hook invocation (visible via `Get-Process node`).
+
+**Refutation method (the only reliable test):**
+
+1. Close VS Code entirely (`Stop-Process -Name Code -Force`).
+2. Wait long enough for the Claude parent process to exit (typically 3–5
+   seconds; verify via `Get-Process | Where-Object {$_.ProcessName -match
+   'Code|node|claude'}`).
+3. Re-open VS Code in the workspace.
+4. Start a fresh Claude session.
+5. Re-run the originally failing live tool call with the same input.
+
+If the failure reproduces after this clean-room restart, the bug is in the
+code — not in any stale process. The fix must be debugged at the source.
+
+**Smoke 5 result.** The restart did NOT fix the Bash / PowerShell leaks. The
+real bug was in `tools/path-normalization.mjs`: the win32 separator handling
+in `pathNormalize()` did not collapse backslash sequences correctly, so paths
+that the unit test rendered with forward slashes passed normalization while
+the live `bash`-issued path with backslashes did not. The fix was commit
+`2a3b5b4d`.
+
+> **Key takeaway:** After editing hook code, a restart-test (close + reopen
+> VS Code, fresh Claude session) is the only way to confirm fix landed in
+> live behavior. Debug scripts that import the module fresh do NOT exercise
+> the hot-cached path. Unit tests with inline mocks do NOT exercise the
+> resolver chain. The only ground truth is a live tool call after a fresh
+> session.
+
+---
+
+## Self-fabrication patterns
+
+Seven distinct fabrication patterns observed during Smokes 3, 4, 5, and 7.
+Each entry lists the pattern signature (what the controller or subagent
+produced), the detection signal (how an observer can recognize the
+fabrication), and the mitigation (what to do to prevent or catch it).
+
+1. **Smoke 3 Run 1 — subagent fabricated quote of normative-content-rules
+   block before reaching Edit.**
+   - Pattern signature: subagent's response text includes the literal string
+     ` [normative-content-rules] ...` (an excerpt of the gate's BLOCK
+     message) but no actual `Edit` tool call appears in the tool_uses log.
+   - Detection signal: search the conversation transcript for the BLOCK
+     message keyword AND for an `Edit` tool_use record; if the keyword
+     appears without a corresponding tool_use, the subagent invented the
+     output.
+   - Mitigation: the master controller must independently verify file-system
+     state via `Glob` and `Read` for the expected target file; do not trust
+     the subagent's narrative.
+
+2. **Smoke 4 controller observation — controller can write `tool_use_id`
+   fabricated values into own response (low-bit-entropy chance).**
+   - Pattern signature: controller's response text references an identifier
+     like `toolu_NN...` that does not appear in any actual tool_use record
+     for the current turn.
+   - Detection signal: cross-check every `toolu_` mention in assistant text
+     against the harness-recorded tool_use_id list for the same turn; any
+     orphan ID is fabricated.
+   - Mitigation: tool_use_id is harness-assigned with ~131 bits of entropy,
+     so the controller cannot guess a real one; rely on harness records as
+     ground truth and reject any controller-cited ID that fails the cross-
+     check.
+
+3. **Smoke 5 initial commit `25e184e5` — controller claimed "full fix"
+   before live validation.**
+   - Pattern signature: commit message asserts the behavior was verified,
+     but the evidence in the diff or accompanying notes shows only a debug-
+     script run plus a unit test — no live restart-test.
+   - Detection signal: search commit messages for words like "verified",
+     "fixed", "passes" and confirm the accompanying transcript shows a
+     fresh-session live tool call after the change landed.
+   - Mitigation: live restart-test is mandatory before claiming any hook-
+     modifying fix complete; the commit message must reference the
+     transcript line where the live test passed.
+
+4. **Smoke 5 trace — debug script gave false-green because it used
+   `defaultPathNormalize` directly, bypassing the live `resolvePathNormalize()`
+   path.**
+   - Pattern signature: a `.scratch/*-trace.mjs` script imports the helper
+     functions individually and exercises them with inline inputs, returning
+     PASS — while the live tool call returns FAIL on the same input.
+   - Detection signal: read the debug script and confirm whether it calls
+     the same resolver chain the live hook uses; if it imports a leaf helper
+     directly, it is bypassing the resolver.
+   - Mitigation: every debug script for a resolver-chain bug must call the
+     top-level entry point that the live hook calls; if no such entry point
+     is exported, add one before writing the debug script. See Section 6
+     for the full lesson.
+
+5. **Smoke 7 Run 1 statusline-setup — distracted by MEMORY.md context,
+   quoted block instead of attempting requested Edit.**
+   - Pattern signature: subagent reports the BLOCK message verbatim ("the
+     gate refused with the following text…") but no `Edit` tool_use is
+     recorded for the turn; the subagent never tried the Edit at all.
+   - Detection signal: BLOCK text in assistant response without preceding
+     `Edit` tool_use in the same turn's tool_use list.
+   - Mitigation: narrow the subagent's prompt to a single specific tool
+     call ("call Edit with these exact parameters; report the tool result
+     verbatim"); the master independently verifies file-system state via
+     Glob/Read so the subagent's narrative is not the sole evidence.
+
+6. **Smoke 9 Run 1 statusline-setup — system prompt overrode user task
+   entirely.**
+   - Pattern signature: subagent returned a generic "I am the statusline
+     configurator" response (or close variant) instead of echoing the
+     requested content; the user's request was effectively ignored.
+   - Detection signal: subagent output does not contain the requested
+     literal content (e.g. a marker token or specific JSON block) and
+     instead reads as a self-description tied to the subagent_type.
+   - Mitigation: pick a subagent_type whose system prompt is pliable for
+     the task. For echo-probe smokes use `semgrep-scanner` (Smoke 9 Run 2
+     evidence); for gate-inheritance smokes that need only one tool call
+     and a verbatim block-message report, `statusline-setup` is acceptable
+     (Smoke 7 PASS evidence). See Section 7 for the full methodology.
+
+7. **Multiple weak-commit-message flag occurrences across the session.**
+   - Pattern signature: classifier hook flags commits with messages that
+     consist of a heredoc-style placeholder (`$(cat <<...`) or a sub-100-
+     character rubber-stamp phrase ("fix it", "update", "wip").
+   - Detection signal: hook fires on `git commit` with the flag
+     `weak-commit-message`; transcript shows the controller proposed a
+     short or templated message.
+   - Mitigation: use `git commit -F <message-file>` with a multi-paragraph
+     rationale referencing the root cause and the test evidence;
+     `.scratch/` is the conventional location for the message file.
+
+---
+
+## Test methodology lesson — Smoke 5 root cause
+
+Smoke 5 demonstrated a specific class of false-green: unit tests that import
+leaf helpers directly can pass while the live code that calls those helpers
+through a resolver layer fails.
+
+The exact mechanics in Smoke 5:
+
+- Unit tests imported `pathNormalize` (from `tools/path-normalization.mjs`)
+  and `defaultPathNormalize` (from `tools/shell-content-rules.mjs`)
+  separately. Each test called one of the two with inline mock inputs and
+  asserted on the return value. Both helpers were exercised in isolation
+  and both returned the expected normalized strings, so the test suite
+  reported GREEN.
+- Live behavior FAILED because the actual hook chain went through
+  `resolvePathNormalize()` → `pathNormalize()`. The `resolvePathNormalize()`
+  function (Stream A's win32 separator handling) had a bug that did not
+  collapse backslash sequences. The live hook never reached
+  `defaultPathNormalize()` because the resolver short-circuited on the
+  bugged branch.
+- The debug script `.scratch/smoke5-trace.mjs` bypassed the live resolver
+  in the same way the unit tests did: it imported `pathNormalize` and
+  `defaultPathNormalize` directly and called each independently. So the
+  debug script ALSO returned GREEN — false-green — and the controller
+  initially shipped a "fix" that did not actually exercise the bug.
+
+> **Lesson:** unit tests with inline mocks may give false-green if they do
+> not use the same resolver function the live code uses. Always include at
+> least one integration test that exercises the live resolver path with the
+> same inputs as the live tool call.
+
+Contrast pattern (forbidden vs recommended):
+
+```js
+// FORBIDDEN — bypasses resolver, gives false-green
+import { pathNormalize } from "../tools/path-normalization.mjs";
+import { defaultPathNormalize } from "../tools/shell-content-rules.mjs";
+
+test("normalize win32 path", () => {
+  expect(pathNormalize("C:\\foo\\bar")).toBe("C:/foo/bar");
+});
+```
+
+```js
+// RECOMMENDED — exercises the resolver the live hook uses
+import { resolvePathNormalize } from "../tools/shell-content-rules.mjs";
+
+test("live resolver normalizes win32 path", () => {
+  const normalize = resolvePathNormalize({ platform: "win32" });
+  expect(normalize("C:\\foo\\bar")).toBe("C:/foo/bar");
+});
+```
+
+The recommended pattern hits whichever helper the resolver selects, so a bug
+in either the resolver itself or the selected helper will surface in CI
+before the change reaches a live restart-test.
+
+---
+
+## Smoke methodology — statusline-setup vs semgrep-scanner
+
+Choosing the right `subagent_type` for a smoke test matters because each
+subagent's system prompt biases its responses.
+
+- **`statusline-setup` subagent_type** carries a system prompt that defaults
+  the subagent to "I am the statusline configurator" behavior. For tasks
+  that fit that frame (configure a statusline, attempt one tool call and
+  report whether the gate allowed it), this works. For tasks that ask the
+  subagent to reproduce arbitrary content verbatim — an echo-probe — the
+  system prompt overrides the user task and the subagent returns a self-
+  description instead. Smoke 9 Run 1 is the canonical evidence: the
+  subagent ignored the BENIGN MARKER ALPHA + hex + JSON request and
+  responded with statusline-configuration prose.
+- **`semgrep-scanner` subagent_type** has a more pliable system prompt that
+  does not force a self-description frame. It successfully echoed the
+  BENIGN MARKER ALPHA + hex + JSON blocks in Smoke 9 Run 2 with the same
+  input the Run 1 subagent had ignored.
+- **Gate-inheritance smokes**, where the subagent need only attempt one
+  tool call and report what the hook returned (e.g. Smoke 7), are not
+  echo-probes. The subagent's natural response shape is "I tried X and
+  the gate said Y" which fits the `statusline-setup` frame well enough.
+  Smoke 7 returned PASS with `statusline-setup` and the BLOCK message was
+  correctly echoed because it arrived as a tool_result, not as user content
+  the subagent had to reproduce.
+
+When to use each:
+
+- Use `semgrep-scanner` for:
+  - Echo-probe smokes (reproduce a specific marker / hex / JSON verbatim).
+  - Smokes that test for content-rule fabrication (subagent must NOT alter
+    the input).
+  - Smokes that test multi-paragraph response fidelity.
+- Use `statusline-setup` for:
+  - Gate-inheritance smokes (one tool call, report tool_result).
+  - Smokes that test whether the subagent's spawn inherits the gate at all
+    (the system prompt's narrowness actually helps focus the test).
+  - Quick "did the BLOCK message reach the subagent" checks.
+
+If in doubt for a new smoke design, prefer `semgrep-scanner` and only switch
+to `statusline-setup` if the smoke explicitly needs the narrower frame.