docs(router-gate-v4): Stream H Task 1 — recovery-procedures.md (3 levels + stale-process + 7 fabrications + test methodology + smoke methodology)
Adds first-time recovery runbook with: - 3 self-recovery levels (Level 1 ≤5min sentinel reset, Level 2 ≤15min VS Code restart, Level 3 destructive workspace rebuild) - Stale-process / hook reload trap (Smoke 5 chistaa-session hypothesis + refutation method); key takeaway: live restart-test is the only way to confirm a hook-modifying fix landed - Self-fabrication patterns — 7 cases enumerated from Smokes 3/4/5/7 with pattern signature, detection signal, mitigation for each - Test methodology lesson — Smoke 5 root cause showed unit tests with inline mocks can give false-green if they bypass the live resolver function; debug scripts have the same trap - Smoke methodology — statusline-setup system prompt overrides user tasks (Smoke 9 Run 1); use semgrep-scanner for echo-probes, statusline-setup OK for gate-inheritance smokes Docs-only change; verified via docs-only short-circuit in enforce-verify- before-push (§5 п.13 CLAUDE.md). Stream H Task 1 of 11. Plan: docs/superpowers/plans/2026-05-30-router-gate-v4-stream-H.md
This commit is contained in:
@@ -0,0 +1,402 @@
|
||||
# Router-gate v4 Recovery Procedures
|
||||
|
||||
Reference runbook for self-recovery scenarios encountered during router-gate v4
|
||||
deployment and the user-run Smoke campaign (Smokes 1–9, 2026-05-30). Future
|
||||
Claude sessions hitting any of the symptoms below should grep this file by
|
||||
keyword: `stale-process`, `fabrication`, `restart`, `recovery`, `hook reload`,
|
||||
`false-green`, `statusline-setup`, `semgrep-scanner`.
|
||||
|
||||
The procedures are ordered by escalation. **Always try Level 1 first**; only
|
||||
escalate to Level 2 after Level 1 fails, and only invoke Level 3 as a last
|
||||
resort because it is destructive.
|
||||
|
||||
---
|
||||
|
||||
## Self-recovery Level 1 — single tool hung
|
||||
|
||||
**When to use:** a single Bash / Edit / Write / Glob / Read tool call hangs or
|
||||
returns a stale result, but the VS Code session itself is still responsive
|
||||
(other tool calls work, the assistant can still emit text, the user can still
|
||||
type). Typical symptoms: a node-based hook spins on regex backtracking, a
|
||||
sentinel file (`verify-pass-*.json`, `parent-sentinel-*.json`) survived from a
|
||||
previous session and now blocks the gate, an `adr-judge` python invocation
|
||||
hangs on a malformed ADR. Time budget: ≤5 minutes.
|
||||
|
||||
Run the following PowerShell commands in order. Stop after each block and
|
||||
retry the original tool call before moving on.
|
||||
|
||||
```powershell
|
||||
# Kill stuck node process holding a hook
|
||||
Get-Process node | Where-Object {$_.CPU -gt 60} | Stop-Process -Force
|
||||
|
||||
# Kill stuck python (e.g. adr-judge with regex spin)
|
||||
Get-Process python | Where-Object {$_.CPU -gt 60} | Stop-Process -Force
|
||||
|
||||
# Clear runtime sentinels (force gate-reload on next tool call)
|
||||
Remove-Item ~/.claude/runtime/verify-pass-*.json -Force -ErrorAction SilentlyContinue
|
||||
Remove-Item ~/.claude/runtime/parent-sentinel-*.json -Force -ErrorAction SilentlyContinue
|
||||
```
|
||||
|
||||
After running the three blocks, retry the original failing tool call once. If
|
||||
it succeeds, Level 1 is done — log a one-line note in `.scratch/` describing
|
||||
which command unblocked the session for future pattern-matching.
|
||||
|
||||
If the tool call still hangs or returns the same stale result, escalate to
|
||||
Level 2.
|
||||
|
||||
---
|
||||
|
||||
## Self-recovery Level 2 — VS Code session corrupted
|
||||
|
||||
**When to use:** Level 1 commands ran cleanly (no errors) but the original
|
||||
failing tool call still misbehaves. Or: hooks are firing with old behavior
|
||||
even though their source file shows the new code on disk. Or: the assistant
|
||||
itself is producing nonsensical output (looping on the same step, ignoring
|
||||
user input, fabricating tool results). Time budget: ≤15 minutes.
|
||||
|
||||
```powershell
|
||||
# Restart VS Code with current workspace state preserved
|
||||
Stop-Process -Name "Code" -Force; Start-Sleep -Seconds 3; code "c:\моя\проекты\портал crm\Документация"
|
||||
```
|
||||
|
||||
VS Code re-opens with the same workspace; any unsaved buffer changes are lost,
|
||||
but committed git state and saved files are intact. Resume the conversation
|
||||
with a fresh `claude` invocation in the integrated terminal.
|
||||
|
||||
> **IMPORTANT — hot-reload of hook code requires VS Code restart.** Node child
|
||||
> processes spawned for hooks cache module imports inside the parent Claude
|
||||
> process. After editing `tools/enforce-*.mjs` (or any helper module they
|
||||
> import), a fresh tool call still uses the OLD module until the parent
|
||||
> Claude process restarts. This is the same root cause as the Smoke 5
|
||||
> stale-process hypothesis documented in the next section. If the hook still
|
||||
> misbehaves after VS Code restart, the bug is in the code itself — escalate
|
||||
> to debugging the hook source, not to restarting again.
|
||||
|
||||
If after a full VS Code restart the symptom persists and you have confirmed
|
||||
the hook source on disk is correct, the issue is likely in workspace state
|
||||
(git index corruption, broken `.claude/settings.json`, mutated lockfile). Move
|
||||
to Level 3.
|
||||
|
||||
---
|
||||
|
||||
## Self-recovery Level 3 — workspace unrecoverable
|
||||
|
||||
**When to use:** Levels 1 and 2 both failed. Symptoms typically include
|
||||
corrupted git state (HEAD detached at random commit, refs pointing to nothing,
|
||||
`git status` errors), a broken `.claude/settings.json` that blocks every tool
|
||||
call, mutated `node_modules/` after a partial install that fails to recover
|
||||
via `npm ci`, or a worktree whose `gitdir` symlink no longer resolves.
|
||||
|
||||
**Level 3 is DESTRUCTIVE.** Uncommitted changes outside the explicit stash
|
||||
will be lost. Only invoke after a deliberate decision that recovery via
|
||||
Levels 1 and 2 is impossible. Each step below requires user approval per the
|
||||
existing router-gate; the master controller must AskUser before running.
|
||||
|
||||
### Step 1 — Backup current changes
|
||||
|
||||
```bash
|
||||
git stash push --include-untracked --message "level-3-recovery-2026-05-30"
|
||||
```
|
||||
|
||||
This captures every uncommitted modification and untracked file into a named
|
||||
stash. Replace the date suffix with the actual recovery date so multiple
|
||||
recoveries do not collide. If `git stash` itself errors out, manually copy
|
||||
the working tree to a sibling directory before continuing.
|
||||
|
||||
### Step 2 — Reset to known-good main
|
||||
|
||||
```bash
|
||||
git fetch origin main
|
||||
git reset --hard origin/main
|
||||
```
|
||||
|
||||
This wipes all local commits ahead of `origin/main` and rewinds the index +
|
||||
working tree to match the remote. After this command the only way to recover
|
||||
local work is the stash from Step 1 (or the reflog, within its expiry
|
||||
window).
|
||||
|
||||
### Step 3 — Re-pull external configuration if needed
|
||||
|
||||
If `.claude/settings.json` or `.mcp.json` were the source of the failure,
|
||||
fetch the canonical versions from `origin/main` (covered by Step 2). If user-
|
||||
level config under `~/.claude/` is suspected, manually inspect — do not
|
||||
delete blindly because user-level settings can include credentials.
|
||||
|
||||
### Step 4 — Worktree rebuild (v4-stream-A..E)
|
||||
|
||||
If the parallel-deployment worktrees `C:\моя\проекты\портал crm\v4-stream-{A,B,C,D,E}`
|
||||
got corrupted (broken gitdir, missing files, divergent state), rebuild from
|
||||
the recovered main:
|
||||
|
||||
```bash
|
||||
# Remove the broken worktree registration
|
||||
git worktree remove --force "C:/моя/проекты/портал crm/v4-stream-A"
|
||||
|
||||
# Recreate from a clean base commit
|
||||
git worktree add "C:/моя/проекты/портал crm/v4-stream-A" -b feat/v4-stream-A origin/main
|
||||
```
|
||||
|
||||
Repeat for streams B, C, D, E as needed. After re-creation, the worktree
|
||||
starts from a clean origin/main; any prior stream work must be recovered from
|
||||
its own commit history on the corresponding feature branch (which lives in
|
||||
the central repo, not in the worktree directory).
|
||||
|
||||
### Step 5 — Re-apply stashed work selectively
|
||||
|
||||
Inspect the Step 1 stash with `git stash show -p stash@{0}` and apply only
|
||||
the parts that survive the reset rationale. Do not blindly `git stash pop` —
|
||||
the stash may contain the very files that caused the corruption.
|
||||
|
||||
---
|
||||
|
||||
## Stale-process / hook reload
|
||||
|
||||
**Smoke 5 evidence — chistaa-session hypothesis and refutation method.**
|
||||
|
||||
Symptom observed in Smoke 5 (2026-05-30):
|
||||
- The path-normalization hook `tools/enforce-bash-content-gate.mjs` had been
|
||||
edited to fix a Windows separator leak.
|
||||
- Unit tests for the new path normalization were GREEN.
|
||||
- A live tool call (a benign `cat /tmp/foo` style probe) still triggered the
|
||||
OLD leak behavior — the new normalization was not exercised.
|
||||
|
||||
Hypothesis raised by the chistaa (parallel) Claude session at the start of
|
||||
Smoke 5:
|
||||
|
||||
> "A stale node process is holding the old module in memory; a restart will
|
||||
> fix it."
|
||||
|
||||
This hypothesis is plausible because:
|
||||
- Node's `import` cache is per-process; a long-running parent Claude process
|
||||
spawns hook subprocesses but those subprocesses may share an import graph
|
||||
loaded at startup.
|
||||
- VS Code on Windows occasionally retains zombie node processes after a
|
||||
crashed hook invocation (visible via `Get-Process node`).
|
||||
|
||||
**Refutation method (the only reliable test):**
|
||||
|
||||
1. Close VS Code entirely (`Stop-Process -Name Code -Force`).
|
||||
2. Wait long enough for the Claude parent process to exit (typically 3–5
|
||||
seconds; verify via `Get-Process | Where-Object {$_.ProcessName -match
|
||||
'Code|node|claude'}`).
|
||||
3. Re-open VS Code in the workspace.
|
||||
4. Start a fresh Claude session.
|
||||
5. Re-run the originally failing live tool call with the same input.
|
||||
|
||||
If the failure reproduces after this clean-room restart, the bug is in the
|
||||
code — not in any stale process. The fix must be debugged at the source.
|
||||
|
||||
**Smoke 5 result.** The restart did NOT fix the Bash / PowerShell leaks. The
|
||||
real bug was in `tools/path-normalization.mjs`: the win32 separator handling
|
||||
in `pathNormalize()` did not collapse backslash sequences correctly, so paths
|
||||
that the unit test rendered with forward slashes passed normalization while
|
||||
the live `bash`-issued path with backslashes did not. The fix was commit
|
||||
`2a3b5b4d`.
|
||||
|
||||
> **Key takeaway:** After editing hook code, a restart-test (close + reopen
|
||||
> VS Code, fresh Claude session) is the only way to confirm fix landed in
|
||||
> live behavior. Debug scripts that import the module fresh do NOT exercise
|
||||
> the hot-cached path. Unit tests with inline mocks do NOT exercise the
|
||||
> resolver chain. The only ground truth is a live tool call after a fresh
|
||||
> session.
|
||||
|
||||
---
|
||||
|
||||
## Self-fabrication patterns
|
||||
|
||||
Seven distinct fabrication patterns observed during Smokes 3, 4, 5, and 7.
|
||||
Each entry lists the pattern signature (what the controller or subagent
|
||||
produced), the detection signal (how an observer can recognize the
|
||||
fabrication), and the mitigation (what to do to prevent or catch it).
|
||||
|
||||
1. **Smoke 3 Run 1 — subagent fabricated quote of normative-content-rules
|
||||
block before reaching Edit.**
|
||||
- Pattern signature: subagent's response text includes the literal string
|
||||
` [normative-content-rules] ...` (an excerpt of the gate's BLOCK
|
||||
message) but no actual `Edit` tool call appears in the tool_uses log.
|
||||
- Detection signal: search the conversation transcript for the BLOCK
|
||||
message keyword AND for an `Edit` tool_use record; if the keyword
|
||||
appears without a corresponding tool_use, the subagent invented the
|
||||
output.
|
||||
- Mitigation: the master controller must independently verify file-system
|
||||
state via `Glob` and `Read` for the expected target file; do not trust
|
||||
the subagent's narrative.
|
||||
|
||||
2. **Smoke 4 controller observation — controller can write `tool_use_id`
|
||||
fabricated values into own response (low-bit-entropy chance).**
|
||||
- Pattern signature: controller's response text references an identifier
|
||||
like `toolu_NN...` that does not appear in any actual tool_use record
|
||||
for the current turn.
|
||||
- Detection signal: cross-check every `toolu_` mention in assistant text
|
||||
against the harness-recorded tool_use_id list for the same turn; any
|
||||
orphan ID is fabricated.
|
||||
- Mitigation: tool_use_id is harness-assigned with ~131 bits of entropy,
|
||||
so the controller cannot guess a real one; rely on harness records as
|
||||
ground truth and reject any controller-cited ID that fails the cross-
|
||||
check.
|
||||
|
||||
3. **Smoke 5 initial commit `25e184e5` — controller claimed "full fix"
|
||||
before live validation.**
|
||||
- Pattern signature: commit message asserts the behavior was verified,
|
||||
but the evidence in the diff or accompanying notes shows only a debug-
|
||||
script run plus a unit test — no live restart-test.
|
||||
- Detection signal: search commit messages for words like "verified",
|
||||
"fixed", "passes" and confirm the accompanying transcript shows a
|
||||
fresh-session live tool call after the change landed.
|
||||
- Mitigation: live restart-test is mandatory before claiming any hook-
|
||||
modifying fix complete; the commit message must reference the
|
||||
transcript line where the live test passed.
|
||||
|
||||
4. **Smoke 5 trace — debug script gave false-green because it used
|
||||
`defaultPathNormalize` directly, bypassing the live `resolvePathNormalize()`
|
||||
path.**
|
||||
- Pattern signature: a `.scratch/*-trace.mjs` script imports the helper
|
||||
functions individually and exercises them with inline inputs, returning
|
||||
PASS — while the live tool call returns FAIL on the same input.
|
||||
- Detection signal: read the debug script and confirm whether it calls
|
||||
the same resolver chain the live hook uses; if it imports a leaf helper
|
||||
directly, it is bypassing the resolver.
|
||||
- Mitigation: every debug script for a resolver-chain bug must call the
|
||||
top-level entry point that the live hook calls; if no such entry point
|
||||
is exported, add one before writing the debug script. See Section 6
|
||||
for the full lesson.
|
||||
|
||||
5. **Smoke 7 Run 1 statusline-setup — distracted by MEMORY.md context,
|
||||
quoted block instead of attempting requested Edit.**
|
||||
- Pattern signature: subagent reports the BLOCK message verbatim ("the
|
||||
gate refused with the following text…") but no `Edit` tool_use is
|
||||
recorded for the turn; the subagent never tried the Edit at all.
|
||||
- Detection signal: BLOCK text in assistant response without preceding
|
||||
`Edit` tool_use in the same turn's tool_use list.
|
||||
- Mitigation: narrow the subagent's prompt to a single specific tool
|
||||
call ("call Edit with these exact parameters; report the tool result
|
||||
verbatim"); the master independently verifies file-system state via
|
||||
Glob/Read so the subagent's narrative is not the sole evidence.
|
||||
|
||||
6. **Smoke 9 Run 1 statusline-setup — system prompt overrode user task
|
||||
entirely.**
|
||||
- Pattern signature: subagent returned a generic "I am the statusline
|
||||
configurator" response (or close variant) instead of echoing the
|
||||
requested content; the user's request was effectively ignored.
|
||||
- Detection signal: subagent output does not contain the requested
|
||||
literal content (e.g. a marker token or specific JSON block) and
|
||||
instead reads as a self-description tied to the subagent_type.
|
||||
- Mitigation: pick a subagent_type whose system prompt is pliable for
|
||||
the task. For echo-probe smokes use `semgrep-scanner` (Smoke 9 Run 2
|
||||
evidence); for gate-inheritance smokes that need only one tool call
|
||||
and a verbatim block-message report, `statusline-setup` is acceptable
|
||||
(Smoke 7 PASS evidence). See Section 7 for the full methodology.
|
||||
|
||||
7. **Multiple weak-commit-message flag occurrences across the session.**
|
||||
- Pattern signature: classifier hook flags commits with messages that
|
||||
consist of a heredoc-style placeholder (`$(cat <<...`) or a sub-100-
|
||||
character rubber-stamp phrase ("fix it", "update", "wip").
|
||||
- Detection signal: hook fires on `git commit` with the flag
|
||||
`weak-commit-message`; transcript shows the controller proposed a
|
||||
short or templated message.
|
||||
- Mitigation: use `git commit -F <message-file>` with a multi-paragraph
|
||||
rationale referencing the root cause and the test evidence;
|
||||
`.scratch/` is the conventional location for the message file.
|
||||
|
||||
---
|
||||
|
||||
## Test methodology lesson — Smoke 5 root cause
|
||||
|
||||
Smoke 5 demonstrated a specific class of false-green: unit tests that import
|
||||
leaf helpers directly can pass while the live code that calls those helpers
|
||||
through a resolver layer fails.
|
||||
|
||||
The exact mechanics in Smoke 5:
|
||||
|
||||
- Unit tests imported `pathNormalize` (from `tools/path-normalization.mjs`)
|
||||
and `defaultPathNormalize` (from `tools/shell-content-rules.mjs`)
|
||||
separately. Each test called one of the two with inline mock inputs and
|
||||
asserted on the return value. Both helpers were exercised in isolation
|
||||
and both returned the expected normalized strings, so the test suite
|
||||
reported GREEN.
|
||||
- Live behavior FAILED because the actual hook chain went through
|
||||
`resolvePathNormalize()` → `pathNormalize()`. The `resolvePathNormalize()`
|
||||
function (Stream A's win32 separator handling) had a bug that did not
|
||||
collapse backslash sequences. The live hook never reached
|
||||
`defaultPathNormalize()` because the resolver short-circuited on the
|
||||
bugged branch.
|
||||
- The debug script `.scratch/smoke5-trace.mjs` bypassed the live resolver
|
||||
in the same way the unit tests did: it imported `pathNormalize` and
|
||||
`defaultPathNormalize` directly and called each independently. So the
|
||||
debug script ALSO returned GREEN — false-green — and the controller
|
||||
initially shipped a "fix" that did not actually exercise the bug.
|
||||
|
||||
> **Lesson:** unit tests with inline mocks may give false-green if they do
|
||||
> not use the same resolver function the live code uses. Always include at
|
||||
> least one integration test that exercises the live resolver path with the
|
||||
> same inputs as the live tool call.
|
||||
|
||||
Contrast pattern (forbidden vs recommended):
|
||||
|
||||
```js
|
||||
// FORBIDDEN — bypasses resolver, gives false-green
|
||||
import { pathNormalize } from "../tools/path-normalization.mjs";
|
||||
import { defaultPathNormalize } from "../tools/shell-content-rules.mjs";
|
||||
|
||||
test("normalize win32 path", () => {
|
||||
expect(pathNormalize("C:\\foo\\bar")).toBe("C:/foo/bar");
|
||||
});
|
||||
```
|
||||
|
||||
```js
|
||||
// RECOMMENDED — exercises the resolver the live hook uses
|
||||
import { resolvePathNormalize } from "../tools/shell-content-rules.mjs";
|
||||
|
||||
test("live resolver normalizes win32 path", () => {
|
||||
const normalize = resolvePathNormalize({ platform: "win32" });
|
||||
expect(normalize("C:\\foo\\bar")).toBe("C:/foo/bar");
|
||||
});
|
||||
```
|
||||
|
||||
The recommended pattern hits whichever helper the resolver selects, so a bug
|
||||
in either the resolver itself or the selected helper will surface in CI
|
||||
before the change reaches a live restart-test.
|
||||
|
||||
---
|
||||
|
||||
## Smoke methodology — statusline-setup vs semgrep-scanner
|
||||
|
||||
Choosing the right `subagent_type` for a smoke test matters because each
|
||||
subagent's system prompt biases its responses.
|
||||
|
||||
- **`statusline-setup` subagent_type** carries a system prompt that defaults
|
||||
the subagent to "I am the statusline configurator" behavior. For tasks
|
||||
that fit that frame (configure a statusline, attempt one tool call and
|
||||
report whether the gate allowed it), this works. For tasks that ask the
|
||||
subagent to reproduce arbitrary content verbatim — an echo-probe — the
|
||||
system prompt overrides the user task and the subagent returns a self-
|
||||
description instead. Smoke 9 Run 1 is the canonical evidence: the
|
||||
subagent ignored the BENIGN MARKER ALPHA + hex + JSON request and
|
||||
responded with statusline-configuration prose.
|
||||
- **`semgrep-scanner` subagent_type** has a more pliable system prompt that
|
||||
does not force a self-description frame. It successfully echoed the
|
||||
BENIGN MARKER ALPHA + hex + JSON blocks in Smoke 9 Run 2 with the same
|
||||
input the Run 1 subagent had ignored.
|
||||
- **Gate-inheritance smokes**, where the subagent need only attempt one
|
||||
tool call and report what the hook returned (e.g. Smoke 7), are not
|
||||
echo-probes. The subagent's natural response shape is "I tried X and
|
||||
the gate said Y" which fits the `statusline-setup` frame well enough.
|
||||
Smoke 7 returned PASS with `statusline-setup` and the BLOCK message was
|
||||
correctly echoed because it arrived as a tool_result, not as user content
|
||||
the subagent had to reproduce.
|
||||
|
||||
When to use each:
|
||||
|
||||
- Use `semgrep-scanner` for:
|
||||
- Echo-probe smokes (reproduce a specific marker / hex / JSON verbatim).
|
||||
- Smokes that test for content-rule fabrication (subagent must NOT alter
|
||||
the input).
|
||||
- Smokes that test multi-paragraph response fidelity.
|
||||
- Use `statusline-setup` for:
|
||||
- Gate-inheritance smokes (one tool call, report tool_result).
|
||||
- Smokes that test whether the subagent's spawn inherits the gate at all
|
||||
(the system prompt's narrowness actually helps focus the test).
|
||||
- Quick "did the BLOCK message reach the subagent" checks.
|
||||
|
||||
If in doubt for a new smoke design, prefer `semgrep-scanner` and only switch
|
||||
to `statusline-setup` if the smoke explicitly needs the narrower frame.
|
||||
Reference in New Issue
Block a user