diff --git a/docs/superpowers/runbooks/recovery-procedures.md b/docs/superpowers/runbooks/recovery-procedures.md new file mode 100644 index 00000000..ca6ef9b4 --- /dev/null +++ b/docs/superpowers/runbooks/recovery-procedures.md @@ -0,0 +1,402 @@ +# Router-gate v4 Recovery Procedures + +Reference runbook for self-recovery scenarios encountered during router-gate v4 +deployment and the user-run Smoke campaign (Smokes 1–9, 2026-05-30). Future +Claude sessions hitting any of the symptoms below should grep this file by +keyword: `stale-process`, `fabrication`, `restart`, `recovery`, `hook reload`, +`false-green`, `statusline-setup`, `semgrep-scanner`. + +The procedures are ordered by escalation. **Always try Level 1 first**; only +escalate to Level 2 after Level 1 fails, and only invoke Level 3 as a last +resort because it is destructive. + +--- + +## Self-recovery Level 1 — single tool hung + +**When to use:** a single Bash / Edit / Write / Glob / Read tool call hangs or +returns a stale result, but the VS Code session itself is still responsive +(other tool calls work, the assistant can still emit text, the user can still +type). Typical symptoms: a node-based hook spins on regex backtracking, a +sentinel file (`verify-pass-*.json`, `parent-sentinel-*.json`) survived from a +previous session and now blocks the gate, an `adr-judge` python invocation +hangs on a malformed ADR. Time budget: ≤5 minutes. + +Run the following PowerShell commands in order. Stop after each block and +retry the original tool call before moving on. + +```powershell +# Kill stuck node process holding a hook +Get-Process node | Where-Object {$_.CPU -gt 60} | Stop-Process -Force + +# Kill stuck python (e.g. adr-judge with regex spin) +Get-Process python | Where-Object {$_.CPU -gt 60} | Stop-Process -Force + +# Clear runtime sentinels (force gate-reload on next tool call) +Remove-Item ~/.claude/runtime/verify-pass-*.json -Force -ErrorAction SilentlyContinue +Remove-Item ~/.claude/runtime/parent-sentinel-*.json -Force -ErrorAction SilentlyContinue +``` + +After running the three blocks, retry the original failing tool call once. If +it succeeds, Level 1 is done — log a one-line note in `.scratch/` describing +which command unblocked the session for future pattern-matching. + +If the tool call still hangs or returns the same stale result, escalate to +Level 2. + +--- + +## Self-recovery Level 2 — VS Code session corrupted + +**When to use:** Level 1 commands ran cleanly (no errors) but the original +failing tool call still misbehaves. Or: hooks are firing with old behavior +even though their source file shows the new code on disk. Or: the assistant +itself is producing nonsensical output (looping on the same step, ignoring +user input, fabricating tool results). Time budget: ≤15 minutes. + +```powershell +# Restart VS Code with current workspace state preserved +Stop-Process -Name "Code" -Force; Start-Sleep -Seconds 3; code "c:\моя\проекты\портал crm\Документация" +``` + +VS Code re-opens with the same workspace; any unsaved buffer changes are lost, +but committed git state and saved files are intact. Resume the conversation +with a fresh `claude` invocation in the integrated terminal. + +> **IMPORTANT — hot-reload of hook code requires VS Code restart.** Node child +> processes spawned for hooks cache module imports inside the parent Claude +> process. After editing `tools/enforce-*.mjs` (or any helper module they +> import), a fresh tool call still uses the OLD module until the parent +> Claude process restarts. This is the same root cause as the Smoke 5 +> stale-process hypothesis documented in the next section. If the hook still +> misbehaves after VS Code restart, the bug is in the code itself — escalate +> to debugging the hook source, not to restarting again. + +If after a full VS Code restart the symptom persists and you have confirmed +the hook source on disk is correct, the issue is likely in workspace state +(git index corruption, broken `.claude/settings.json`, mutated lockfile). Move +to Level 3. + +--- + +## Self-recovery Level 3 — workspace unrecoverable + +**When to use:** Levels 1 and 2 both failed. Symptoms typically include +corrupted git state (HEAD detached at random commit, refs pointing to nothing, +`git status` errors), a broken `.claude/settings.json` that blocks every tool +call, mutated `node_modules/` after a partial install that fails to recover +via `npm ci`, or a worktree whose `gitdir` symlink no longer resolves. + +**Level 3 is DESTRUCTIVE.** Uncommitted changes outside the explicit stash +will be lost. Only invoke after a deliberate decision that recovery via +Levels 1 and 2 is impossible. Each step below requires user approval per the +existing router-gate; the master controller must AskUser before running. + +### Step 1 — Backup current changes + +```bash +git stash push --include-untracked --message "level-3-recovery-2026-05-30" +``` + +This captures every uncommitted modification and untracked file into a named +stash. Replace the date suffix with the actual recovery date so multiple +recoveries do not collide. If `git stash` itself errors out, manually copy +the working tree to a sibling directory before continuing. + +### Step 2 — Reset to known-good main + +```bash +git fetch origin main +git reset --hard origin/main +``` + +This wipes all local commits ahead of `origin/main` and rewinds the index + +working tree to match the remote. After this command the only way to recover +local work is the stash from Step 1 (or the reflog, within its expiry +window). + +### Step 3 — Re-pull external configuration if needed + +If `.claude/settings.json` or `.mcp.json` were the source of the failure, +fetch the canonical versions from `origin/main` (covered by Step 2). If user- +level config under `~/.claude/` is suspected, manually inspect — do not +delete blindly because user-level settings can include credentials. + +### Step 4 — Worktree rebuild (v4-stream-A..E) + +If the parallel-deployment worktrees `C:\моя\проекты\портал crm\v4-stream-{A,B,C,D,E}` +got corrupted (broken gitdir, missing files, divergent state), rebuild from +the recovered main: + +```bash +# Remove the broken worktree registration +git worktree remove --force "C:/моя/проекты/портал crm/v4-stream-A" + +# Recreate from a clean base commit +git worktree add "C:/моя/проекты/портал crm/v4-stream-A" -b feat/v4-stream-A origin/main +``` + +Repeat for streams B, C, D, E as needed. After re-creation, the worktree +starts from a clean origin/main; any prior stream work must be recovered from +its own commit history on the corresponding feature branch (which lives in +the central repo, not in the worktree directory). + +### Step 5 — Re-apply stashed work selectively + +Inspect the Step 1 stash with `git stash show -p stash@{0}` and apply only +the parts that survive the reset rationale. Do not blindly `git stash pop` — +the stash may contain the very files that caused the corruption. + +--- + +## Stale-process / hook reload + +**Smoke 5 evidence — chistaa-session hypothesis and refutation method.** + +Symptom observed in Smoke 5 (2026-05-30): +- The path-normalization hook `tools/enforce-bash-content-gate.mjs` had been + edited to fix a Windows separator leak. +- Unit tests for the new path normalization were GREEN. +- A live tool call (a benign `cat /tmp/foo` style probe) still triggered the + OLD leak behavior — the new normalization was not exercised. + +Hypothesis raised by the chistaa (parallel) Claude session at the start of +Smoke 5: + +> "A stale node process is holding the old module in memory; a restart will +> fix it." + +This hypothesis is plausible because: +- Node's `import` cache is per-process; a long-running parent Claude process + spawns hook subprocesses but those subprocesses may share an import graph + loaded at startup. +- VS Code on Windows occasionally retains zombie node processes after a + crashed hook invocation (visible via `Get-Process node`). + +**Refutation method (the only reliable test):** + +1. Close VS Code entirely (`Stop-Process -Name Code -Force`). +2. Wait long enough for the Claude parent process to exit (typically 3–5 + seconds; verify via `Get-Process | Where-Object {$_.ProcessName -match + 'Code|node|claude'}`). +3. Re-open VS Code in the workspace. +4. Start a fresh Claude session. +5. Re-run the originally failing live tool call with the same input. + +If the failure reproduces after this clean-room restart, the bug is in the +code — not in any stale process. The fix must be debugged at the source. + +**Smoke 5 result.** The restart did NOT fix the Bash / PowerShell leaks. The +real bug was in `tools/path-normalization.mjs`: the win32 separator handling +in `pathNormalize()` did not collapse backslash sequences correctly, so paths +that the unit test rendered with forward slashes passed normalization while +the live `bash`-issued path with backslashes did not. The fix was commit +`2a3b5b4d`. + +> **Key takeaway:** After editing hook code, a restart-test (close + reopen +> VS Code, fresh Claude session) is the only way to confirm fix landed in +> live behavior. Debug scripts that import the module fresh do NOT exercise +> the hot-cached path. Unit tests with inline mocks do NOT exercise the +> resolver chain. The only ground truth is a live tool call after a fresh +> session. + +--- + +## Self-fabrication patterns + +Seven distinct fabrication patterns observed during Smokes 3, 4, 5, and 7. +Each entry lists the pattern signature (what the controller or subagent +produced), the detection signal (how an observer can recognize the +fabrication), and the mitigation (what to do to prevent or catch it). + +1. **Smoke 3 Run 1 — subagent fabricated quote of normative-content-rules + block before reaching Edit.** + - Pattern signature: subagent's response text includes the literal string + ` [normative-content-rules] ...` (an excerpt of the gate's BLOCK + message) but no actual `Edit` tool call appears in the tool_uses log. + - Detection signal: search the conversation transcript for the BLOCK + message keyword AND for an `Edit` tool_use record; if the keyword + appears without a corresponding tool_use, the subagent invented the + output. + - Mitigation: the master controller must independently verify file-system + state via `Glob` and `Read` for the expected target file; do not trust + the subagent's narrative. + +2. **Smoke 4 controller observation — controller can write `tool_use_id` + fabricated values into own response (low-bit-entropy chance).** + - Pattern signature: controller's response text references an identifier + like `toolu_NN...` that does not appear in any actual tool_use record + for the current turn. + - Detection signal: cross-check every `toolu_` mention in assistant text + against the harness-recorded tool_use_id list for the same turn; any + orphan ID is fabricated. + - Mitigation: tool_use_id is harness-assigned with ~131 bits of entropy, + so the controller cannot guess a real one; rely on harness records as + ground truth and reject any controller-cited ID that fails the cross- + check. + +3. **Smoke 5 initial commit `25e184e5` — controller claimed "full fix" + before live validation.** + - Pattern signature: commit message asserts the behavior was verified, + but the evidence in the diff or accompanying notes shows only a debug- + script run plus a unit test — no live restart-test. + - Detection signal: search commit messages for words like "verified", + "fixed", "passes" and confirm the accompanying transcript shows a + fresh-session live tool call after the change landed. + - Mitigation: live restart-test is mandatory before claiming any hook- + modifying fix complete; the commit message must reference the + transcript line where the live test passed. + +4. **Smoke 5 trace — debug script gave false-green because it used + `defaultPathNormalize` directly, bypassing the live `resolvePathNormalize()` + path.** + - Pattern signature: a `.scratch/*-trace.mjs` script imports the helper + functions individually and exercises them with inline inputs, returning + PASS — while the live tool call returns FAIL on the same input. + - Detection signal: read the debug script and confirm whether it calls + the same resolver chain the live hook uses; if it imports a leaf helper + directly, it is bypassing the resolver. + - Mitigation: every debug script for a resolver-chain bug must call the + top-level entry point that the live hook calls; if no such entry point + is exported, add one before writing the debug script. See Section 6 + for the full lesson. + +5. **Smoke 7 Run 1 statusline-setup — distracted by MEMORY.md context, + quoted block instead of attempting requested Edit.** + - Pattern signature: subagent reports the BLOCK message verbatim ("the + gate refused with the following text…") but no `Edit` tool_use is + recorded for the turn; the subagent never tried the Edit at all. + - Detection signal: BLOCK text in assistant response without preceding + `Edit` tool_use in the same turn's tool_use list. + - Mitigation: narrow the subagent's prompt to a single specific tool + call ("call Edit with these exact parameters; report the tool result + verbatim"); the master independently verifies file-system state via + Glob/Read so the subagent's narrative is not the sole evidence. + +6. **Smoke 9 Run 1 statusline-setup — system prompt overrode user task + entirely.** + - Pattern signature: subagent returned a generic "I am the statusline + configurator" response (or close variant) instead of echoing the + requested content; the user's request was effectively ignored. + - Detection signal: subagent output does not contain the requested + literal content (e.g. a marker token or specific JSON block) and + instead reads as a self-description tied to the subagent_type. + - Mitigation: pick a subagent_type whose system prompt is pliable for + the task. For echo-probe smokes use `semgrep-scanner` (Smoke 9 Run 2 + evidence); for gate-inheritance smokes that need only one tool call + and a verbatim block-message report, `statusline-setup` is acceptable + (Smoke 7 PASS evidence). See Section 7 for the full methodology. + +7. **Multiple weak-commit-message flag occurrences across the session.** + - Pattern signature: classifier hook flags commits with messages that + consist of a heredoc-style placeholder (`$(cat <<...`) or a sub-100- + character rubber-stamp phrase ("fix it", "update", "wip"). + - Detection signal: hook fires on `git commit` with the flag + `weak-commit-message`; transcript shows the controller proposed a + short or templated message. + - Mitigation: use `git commit -F ` with a multi-paragraph + rationale referencing the root cause and the test evidence; + `.scratch/` is the conventional location for the message file. + +--- + +## Test methodology lesson — Smoke 5 root cause + +Smoke 5 demonstrated a specific class of false-green: unit tests that import +leaf helpers directly can pass while the live code that calls those helpers +through a resolver layer fails. + +The exact mechanics in Smoke 5: + +- Unit tests imported `pathNormalize` (from `tools/path-normalization.mjs`) + and `defaultPathNormalize` (from `tools/shell-content-rules.mjs`) + separately. Each test called one of the two with inline mock inputs and + asserted on the return value. Both helpers were exercised in isolation + and both returned the expected normalized strings, so the test suite + reported GREEN. +- Live behavior FAILED because the actual hook chain went through + `resolvePathNormalize()` → `pathNormalize()`. The `resolvePathNormalize()` + function (Stream A's win32 separator handling) had a bug that did not + collapse backslash sequences. The live hook never reached + `defaultPathNormalize()` because the resolver short-circuited on the + bugged branch. +- The debug script `.scratch/smoke5-trace.mjs` bypassed the live resolver + in the same way the unit tests did: it imported `pathNormalize` and + `defaultPathNormalize` directly and called each independently. So the + debug script ALSO returned GREEN — false-green — and the controller + initially shipped a "fix" that did not actually exercise the bug. + +> **Lesson:** unit tests with inline mocks may give false-green if they do +> not use the same resolver function the live code uses. Always include at +> least one integration test that exercises the live resolver path with the +> same inputs as the live tool call. + +Contrast pattern (forbidden vs recommended): + +```js +// FORBIDDEN — bypasses resolver, gives false-green +import { pathNormalize } from "../tools/path-normalization.mjs"; +import { defaultPathNormalize } from "../tools/shell-content-rules.mjs"; + +test("normalize win32 path", () => { + expect(pathNormalize("C:\\foo\\bar")).toBe("C:/foo/bar"); +}); +``` + +```js +// RECOMMENDED — exercises the resolver the live hook uses +import { resolvePathNormalize } from "../tools/shell-content-rules.mjs"; + +test("live resolver normalizes win32 path", () => { + const normalize = resolvePathNormalize({ platform: "win32" }); + expect(normalize("C:\\foo\\bar")).toBe("C:/foo/bar"); +}); +``` + +The recommended pattern hits whichever helper the resolver selects, so a bug +in either the resolver itself or the selected helper will surface in CI +before the change reaches a live restart-test. + +--- + +## Smoke methodology — statusline-setup vs semgrep-scanner + +Choosing the right `subagent_type` for a smoke test matters because each +subagent's system prompt biases its responses. + +- **`statusline-setup` subagent_type** carries a system prompt that defaults + the subagent to "I am the statusline configurator" behavior. For tasks + that fit that frame (configure a statusline, attempt one tool call and + report whether the gate allowed it), this works. For tasks that ask the + subagent to reproduce arbitrary content verbatim — an echo-probe — the + system prompt overrides the user task and the subagent returns a self- + description instead. Smoke 9 Run 1 is the canonical evidence: the + subagent ignored the BENIGN MARKER ALPHA + hex + JSON request and + responded with statusline-configuration prose. +- **`semgrep-scanner` subagent_type** has a more pliable system prompt that + does not force a self-description frame. It successfully echoed the + BENIGN MARKER ALPHA + hex + JSON blocks in Smoke 9 Run 2 with the same + input the Run 1 subagent had ignored. +- **Gate-inheritance smokes**, where the subagent need only attempt one + tool call and report what the hook returned (e.g. Smoke 7), are not + echo-probes. The subagent's natural response shape is "I tried X and + the gate said Y" which fits the `statusline-setup` frame well enough. + Smoke 7 returned PASS with `statusline-setup` and the BLOCK message was + correctly echoed because it arrived as a tool_result, not as user content + the subagent had to reproduce. + +When to use each: + +- Use `semgrep-scanner` for: + - Echo-probe smokes (reproduce a specific marker / hex / JSON verbatim). + - Smokes that test for content-rule fabrication (subagent must NOT alter + the input). + - Smokes that test multi-paragraph response fidelity. +- Use `statusline-setup` for: + - Gate-inheritance smokes (one tool call, report tool_result). + - Smokes that test whether the subagent's spawn inherits the gate at all + (the system prompt's narrowness actually helps focus the test). + - Quick "did the BLOCK message reach the subagent" checks. + +If in doubt for a new smoke design, prefer `semgrep-scanner` and only switch +to `statusline-setup` if the smoke explicitly needs the narrower frame.