docs(router-gate-v4): Stream H Task 1 — recovery-procedures.md (3 levels + stale-process + 7 fabrications + test methodology + smoke methodology)

Adds first-time recovery runbook with:
- 3 self-recovery levels (Level 1 ≤5min sentinel reset, Level 2 ≤15min VS Code
  restart, Level 3 destructive workspace rebuild)
- Stale-process / hook reload trap (Smoke 5 chistaa-session hypothesis +
  refutation method); key takeaway: live restart-test is the only way to
  confirm a hook-modifying fix landed
- Self-fabrication patterns — 7 cases enumerated from Smokes 3/4/5/7 with
  pattern signature, detection signal, mitigation for each
- Test methodology lesson — Smoke 5 root cause showed unit tests with inline
  mocks can give false-green if they bypass the live resolver function; debug
  scripts have the same trap
- Smoke methodology — statusline-setup system prompt overrides user tasks
  (Smoke 9 Run 1); use semgrep-scanner for echo-probes, statusline-setup OK
  for gate-inheritance smokes

Docs-only change; verified via docs-only short-circuit in enforce-verify-
before-push (§5 п.13 CLAUDE.md).

Stream H Task 1 of 11. Plan: docs/superpowers/plans/2026-05-30-router-gate-v4-stream-H.md
This commit is contained in:
Дмитрий
2026-05-30 09:58:38 +03:00
parent d277d4bdfc
commit 3ce73a68ff
@@ -0,0 +1,402 @@
# Router-gate v4 Recovery Procedures
Reference runbook for self-recovery scenarios encountered during router-gate v4
deployment and the user-run Smoke campaign (Smokes 19, 2026-05-30). Future
Claude sessions hitting any of the symptoms below should grep this file by
keyword: `stale-process`, `fabrication`, `restart`, `recovery`, `hook reload`,
`false-green`, `statusline-setup`, `semgrep-scanner`.
The procedures are ordered by escalation. **Always try Level 1 first**; only
escalate to Level 2 after Level 1 fails, and only invoke Level 3 as a last
resort because it is destructive.
---
## Self-recovery Level 1 — single tool hung
**When to use:** a single Bash / Edit / Write / Glob / Read tool call hangs or
returns a stale result, but the VS Code session itself is still responsive
(other tool calls work, the assistant can still emit text, the user can still
type). Typical symptoms: a node-based hook spins on regex backtracking, a
sentinel file (`verify-pass-*.json`, `parent-sentinel-*.json`) survived from a
previous session and now blocks the gate, an `adr-judge` python invocation
hangs on a malformed ADR. Time budget: ≤5 minutes.
Run the following PowerShell commands in order. Stop after each block and
retry the original tool call before moving on.
```powershell
# Kill stuck node process holding a hook
Get-Process node | Where-Object {$_.CPU -gt 60} | Stop-Process -Force
# Kill stuck python (e.g. adr-judge with regex spin)
Get-Process python | Where-Object {$_.CPU -gt 60} | Stop-Process -Force
# Clear runtime sentinels (force gate-reload on next tool call)
Remove-Item ~/.claude/runtime/verify-pass-*.json -Force -ErrorAction SilentlyContinue
Remove-Item ~/.claude/runtime/parent-sentinel-*.json -Force -ErrorAction SilentlyContinue
```
After running the three blocks, retry the original failing tool call once. If
it succeeds, Level 1 is done — log a one-line note in `.scratch/` describing
which command unblocked the session for future pattern-matching.
If the tool call still hangs or returns the same stale result, escalate to
Level 2.
---
## Self-recovery Level 2 — VS Code session corrupted
**When to use:** Level 1 commands ran cleanly (no errors) but the original
failing tool call still misbehaves. Or: hooks are firing with old behavior
even though their source file shows the new code on disk. Or: the assistant
itself is producing nonsensical output (looping on the same step, ignoring
user input, fabricating tool results). Time budget: ≤15 minutes.
```powershell
# Restart VS Code with current workspace state preserved
Stop-Process -Name "Code" -Force; Start-Sleep -Seconds 3; code "c:\моя\проекты\портал crm\Документация"
```
VS Code re-opens with the same workspace; any unsaved buffer changes are lost,
but committed git state and saved files are intact. Resume the conversation
with a fresh `claude` invocation in the integrated terminal.
> **IMPORTANT — hot-reload of hook code requires VS Code restart.** Node child
> processes spawned for hooks cache module imports inside the parent Claude
> process. After editing `tools/enforce-*.mjs` (or any helper module they
> import), a fresh tool call still uses the OLD module until the parent
> Claude process restarts. This is the same root cause as the Smoke 5
> stale-process hypothesis documented in the next section. If the hook still
> misbehaves after VS Code restart, the bug is in the code itself — escalate
> to debugging the hook source, not to restarting again.
If after a full VS Code restart the symptom persists and you have confirmed
the hook source on disk is correct, the issue is likely in workspace state
(git index corruption, broken `.claude/settings.json`, mutated lockfile). Move
to Level 3.
---
## Self-recovery Level 3 — workspace unrecoverable
**When to use:** Levels 1 and 2 both failed. Symptoms typically include
corrupted git state (HEAD detached at random commit, refs pointing to nothing,
`git status` errors), a broken `.claude/settings.json` that blocks every tool
call, mutated `node_modules/` after a partial install that fails to recover
via `npm ci`, or a worktree whose `gitdir` symlink no longer resolves.
**Level 3 is DESTRUCTIVE.** Uncommitted changes outside the explicit stash
will be lost. Only invoke after a deliberate decision that recovery via
Levels 1 and 2 is impossible. Each step below requires user approval per the
existing router-gate; the master controller must AskUser before running.
### Step 1 — Backup current changes
```bash
git stash push --include-untracked --message "level-3-recovery-2026-05-30"
```
This captures every uncommitted modification and untracked file into a named
stash. Replace the date suffix with the actual recovery date so multiple
recoveries do not collide. If `git stash` itself errors out, manually copy
the working tree to a sibling directory before continuing.
### Step 2 — Reset to known-good main
```bash
git fetch origin main
git reset --hard origin/main
```
This wipes all local commits ahead of `origin/main` and rewinds the index +
working tree to match the remote. After this command the only way to recover
local work is the stash from Step 1 (or the reflog, within its expiry
window).
### Step 3 — Re-pull external configuration if needed
If `.claude/settings.json` or `.mcp.json` were the source of the failure,
fetch the canonical versions from `origin/main` (covered by Step 2). If user-
level config under `~/.claude/` is suspected, manually inspect — do not
delete blindly because user-level settings can include credentials.
### Step 4 — Worktree rebuild (v4-stream-A..E)
If the parallel-deployment worktrees `C:\моя\проекты\портал crm\v4-stream-{A,B,C,D,E}`
got corrupted (broken gitdir, missing files, divergent state), rebuild from
the recovered main:
```bash
# Remove the broken worktree registration
git worktree remove --force "C:/моя/проекты/портал crm/v4-stream-A"
# Recreate from a clean base commit
git worktree add "C:/моя/проекты/портал crm/v4-stream-A" -b feat/v4-stream-A origin/main
```
Repeat for streams B, C, D, E as needed. After re-creation, the worktree
starts from a clean origin/main; any prior stream work must be recovered from
its own commit history on the corresponding feature branch (which lives in
the central repo, not in the worktree directory).
### Step 5 — Re-apply stashed work selectively
Inspect the Step 1 stash with `git stash show -p stash@{0}` and apply only
the parts that survive the reset rationale. Do not blindly `git stash pop`
the stash may contain the very files that caused the corruption.
---
## Stale-process / hook reload
**Smoke 5 evidence — chistaa-session hypothesis and refutation method.**
Symptom observed in Smoke 5 (2026-05-30):
- The path-normalization hook `tools/enforce-bash-content-gate.mjs` had been
edited to fix a Windows separator leak.
- Unit tests for the new path normalization were GREEN.
- A live tool call (a benign `cat /tmp/foo` style probe) still triggered the
OLD leak behavior — the new normalization was not exercised.
Hypothesis raised by the chistaa (parallel) Claude session at the start of
Smoke 5:
> "A stale node process is holding the old module in memory; a restart will
> fix it."
This hypothesis is plausible because:
- Node's `import` cache is per-process; a long-running parent Claude process
spawns hook subprocesses but those subprocesses may share an import graph
loaded at startup.
- VS Code on Windows occasionally retains zombie node processes after a
crashed hook invocation (visible via `Get-Process node`).
**Refutation method (the only reliable test):**
1. Close VS Code entirely (`Stop-Process -Name Code -Force`).
2. Wait long enough for the Claude parent process to exit (typically 35
seconds; verify via `Get-Process | Where-Object {$_.ProcessName -match
'Code|node|claude'}`).
3. Re-open VS Code in the workspace.
4. Start a fresh Claude session.
5. Re-run the originally failing live tool call with the same input.
If the failure reproduces after this clean-room restart, the bug is in the
code — not in any stale process. The fix must be debugged at the source.
**Smoke 5 result.** The restart did NOT fix the Bash / PowerShell leaks. The
real bug was in `tools/path-normalization.mjs`: the win32 separator handling
in `pathNormalize()` did not collapse backslash sequences correctly, so paths
that the unit test rendered with forward slashes passed normalization while
the live `bash`-issued path with backslashes did not. The fix was commit
`2a3b5b4d`.
> **Key takeaway:** After editing hook code, a restart-test (close + reopen
> VS Code, fresh Claude session) is the only way to confirm fix landed in
> live behavior. Debug scripts that import the module fresh do NOT exercise
> the hot-cached path. Unit tests with inline mocks do NOT exercise the
> resolver chain. The only ground truth is a live tool call after a fresh
> session.
---
## Self-fabrication patterns
Seven distinct fabrication patterns observed during Smokes 3, 4, 5, and 7.
Each entry lists the pattern signature (what the controller or subagent
produced), the detection signal (how an observer can recognize the
fabrication), and the mitigation (what to do to prevent or catch it).
1. **Smoke 3 Run 1 — subagent fabricated quote of normative-content-rules
block before reaching Edit.**
- Pattern signature: subagent's response text includes the literal string
` [normative-content-rules] ...` (an excerpt of the gate's BLOCK
message) but no actual `Edit` tool call appears in the tool_uses log.
- Detection signal: search the conversation transcript for the BLOCK
message keyword AND for an `Edit` tool_use record; if the keyword
appears without a corresponding tool_use, the subagent invented the
output.
- Mitigation: the master controller must independently verify file-system
state via `Glob` and `Read` for the expected target file; do not trust
the subagent's narrative.
2. **Smoke 4 controller observation — controller can write `tool_use_id`
fabricated values into own response (low-bit-entropy chance).**
- Pattern signature: controller's response text references an identifier
like `toolu_NN...` that does not appear in any actual tool_use record
for the current turn.
- Detection signal: cross-check every `toolu_` mention in assistant text
against the harness-recorded tool_use_id list for the same turn; any
orphan ID is fabricated.
- Mitigation: tool_use_id is harness-assigned with ~131 bits of entropy,
so the controller cannot guess a real one; rely on harness records as
ground truth and reject any controller-cited ID that fails the cross-
check.
3. **Smoke 5 initial commit `25e184e5` — controller claimed "full fix"
before live validation.**
- Pattern signature: commit message asserts the behavior was verified,
but the evidence in the diff or accompanying notes shows only a debug-
script run plus a unit test — no live restart-test.
- Detection signal: search commit messages for words like "verified",
"fixed", "passes" and confirm the accompanying transcript shows a
fresh-session live tool call after the change landed.
- Mitigation: live restart-test is mandatory before claiming any hook-
modifying fix complete; the commit message must reference the
transcript line where the live test passed.
4. **Smoke 5 trace — debug script gave false-green because it used
`defaultPathNormalize` directly, bypassing the live `resolvePathNormalize()`
path.**
- Pattern signature: a `.scratch/*-trace.mjs` script imports the helper
functions individually and exercises them with inline inputs, returning
PASS — while the live tool call returns FAIL on the same input.
- Detection signal: read the debug script and confirm whether it calls
the same resolver chain the live hook uses; if it imports a leaf helper
directly, it is bypassing the resolver.
- Mitigation: every debug script for a resolver-chain bug must call the
top-level entry point that the live hook calls; if no such entry point
is exported, add one before writing the debug script. See Section 6
for the full lesson.
5. **Smoke 7 Run 1 statusline-setup — distracted by MEMORY.md context,
quoted block instead of attempting requested Edit.**
- Pattern signature: subagent reports the BLOCK message verbatim ("the
gate refused with the following text…") but no `Edit` tool_use is
recorded for the turn; the subagent never tried the Edit at all.
- Detection signal: BLOCK text in assistant response without preceding
`Edit` tool_use in the same turn's tool_use list.
- Mitigation: narrow the subagent's prompt to a single specific tool
call ("call Edit with these exact parameters; report the tool result
verbatim"); the master independently verifies file-system state via
Glob/Read so the subagent's narrative is not the sole evidence.
6. **Smoke 9 Run 1 statusline-setup — system prompt overrode user task
entirely.**
- Pattern signature: subagent returned a generic "I am the statusline
configurator" response (or close variant) instead of echoing the
requested content; the user's request was effectively ignored.
- Detection signal: subagent output does not contain the requested
literal content (e.g. a marker token or specific JSON block) and
instead reads as a self-description tied to the subagent_type.
- Mitigation: pick a subagent_type whose system prompt is pliable for
the task. For echo-probe smokes use `semgrep-scanner` (Smoke 9 Run 2
evidence); for gate-inheritance smokes that need only one tool call
and a verbatim block-message report, `statusline-setup` is acceptable
(Smoke 7 PASS evidence). See Section 7 for the full methodology.
7. **Multiple weak-commit-message flag occurrences across the session.**
- Pattern signature: classifier hook flags commits with messages that
consist of a heredoc-style placeholder (`$(cat <<...`) or a sub-100-
character rubber-stamp phrase ("fix it", "update", "wip").
- Detection signal: hook fires on `git commit` with the flag
`weak-commit-message`; transcript shows the controller proposed a
short or templated message.
- Mitigation: use `git commit -F <message-file>` with a multi-paragraph
rationale referencing the root cause and the test evidence;
`.scratch/` is the conventional location for the message file.
---
## Test methodology lesson — Smoke 5 root cause
Smoke 5 demonstrated a specific class of false-green: unit tests that import
leaf helpers directly can pass while the live code that calls those helpers
through a resolver layer fails.
The exact mechanics in Smoke 5:
- Unit tests imported `pathNormalize` (from `tools/path-normalization.mjs`)
and `defaultPathNormalize` (from `tools/shell-content-rules.mjs`)
separately. Each test called one of the two with inline mock inputs and
asserted on the return value. Both helpers were exercised in isolation
and both returned the expected normalized strings, so the test suite
reported GREEN.
- Live behavior FAILED because the actual hook chain went through
`resolvePathNormalize()``pathNormalize()`. The `resolvePathNormalize()`
function (Stream A's win32 separator handling) had a bug that did not
collapse backslash sequences. The live hook never reached
`defaultPathNormalize()` because the resolver short-circuited on the
bugged branch.
- The debug script `.scratch/smoke5-trace.mjs` bypassed the live resolver
in the same way the unit tests did: it imported `pathNormalize` and
`defaultPathNormalize` directly and called each independently. So the
debug script ALSO returned GREEN — false-green — and the controller
initially shipped a "fix" that did not actually exercise the bug.
> **Lesson:** unit tests with inline mocks may give false-green if they do
> not use the same resolver function the live code uses. Always include at
> least one integration test that exercises the live resolver path with the
> same inputs as the live tool call.
Contrast pattern (forbidden vs recommended):
```js
// FORBIDDEN — bypasses resolver, gives false-green
import { pathNormalize } from "../tools/path-normalization.mjs";
import { defaultPathNormalize } from "../tools/shell-content-rules.mjs";
test("normalize win32 path", () => {
expect(pathNormalize("C:\\foo\\bar")).toBe("C:/foo/bar");
});
```
```js
// RECOMMENDED — exercises the resolver the live hook uses
import { resolvePathNormalize } from "../tools/shell-content-rules.mjs";
test("live resolver normalizes win32 path", () => {
const normalize = resolvePathNormalize({ platform: "win32" });
expect(normalize("C:\\foo\\bar")).toBe("C:/foo/bar");
});
```
The recommended pattern hits whichever helper the resolver selects, so a bug
in either the resolver itself or the selected helper will surface in CI
before the change reaches a live restart-test.
---
## Smoke methodology — statusline-setup vs semgrep-scanner
Choosing the right `subagent_type` for a smoke test matters because each
subagent's system prompt biases its responses.
- **`statusline-setup` subagent_type** carries a system prompt that defaults
the subagent to "I am the statusline configurator" behavior. For tasks
that fit that frame (configure a statusline, attempt one tool call and
report whether the gate allowed it), this works. For tasks that ask the
subagent to reproduce arbitrary content verbatim — an echo-probe — the
system prompt overrides the user task and the subagent returns a self-
description instead. Smoke 9 Run 1 is the canonical evidence: the
subagent ignored the BENIGN MARKER ALPHA + hex + JSON request and
responded with statusline-configuration prose.
- **`semgrep-scanner` subagent_type** has a more pliable system prompt that
does not force a self-description frame. It successfully echoed the
BENIGN MARKER ALPHA + hex + JSON blocks in Smoke 9 Run 2 with the same
input the Run 1 subagent had ignored.
- **Gate-inheritance smokes**, where the subagent need only attempt one
tool call and report what the hook returned (e.g. Smoke 7), are not
echo-probes. The subagent's natural response shape is "I tried X and
the gate said Y" which fits the `statusline-setup` frame well enough.
Smoke 7 returned PASS with `statusline-setup` and the BLOCK message was
correctly echoed because it arrived as a tool_result, not as user content
the subagent had to reproduce.
When to use each:
- Use `semgrep-scanner` for:
- Echo-probe smokes (reproduce a specific marker / hex / JSON verbatim).
- Smokes that test for content-rule fabrication (subagent must NOT alter
the input).
- Smokes that test multi-paragraph response fidelity.
- Use `statusline-setup` for:
- Gate-inheritance smokes (one tool call, report tool_result).
- Smokes that test whether the subagent's spawn inherits the gate at all
(the system prompt's narrowness actually helps focus the test).
- Quick "did the BLOCK message reach the subagent" checks.
If in doubt for a new smoke design, prefer `semgrep-scanner` and only switch
to `statusline-setup` if the smoke explicitly needs the narrower frame.