From 9f84d9ef09ebe7fe742c5ccd8e8b447cd0bfdd81 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=D0=94=D0=BC=D0=B8=D1=82=D1=80=D0=B8=D0=B9?=
 <noreply@anthropic.com>
Date: Sat, 30 May 2026 20:31:23 +0300
Subject: [PATCH] =?UTF-8?q?docs(router-gate-v4):=20safe-baseline=20spec=20?=
 =?UTF-8?q?v2=20=E2=80=94=20close=20C1/C2/C3/H1=20from=20adversarial=20rev?=
 =?UTF-8?q?iew=20(item=201b)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 ...-05-30-safe-baseline-live-wiring-design.md | 87 +++++++++++++------
 1 file changed, 59 insertions(+), 28 deletions(-)

diff --git a/docs/superpowers/specs/2026-05-30-safe-baseline-live-wiring-design.md b/docs/superpowers/specs/2026-05-30-safe-baseline-live-wiring-design.md
index 0cc619d7..e99aa3e1 100644
--- a/docs/superpowers/specs/2026-05-30-safe-baseline-live-wiring-design.md
+++ b/docs/superpowers/specs/2026-05-30-safe-baseline-live-wiring-design.md
@@ -1,20 +1,31 @@
 # Safe-baseline metering — live wiring (router-gate v4 §3.1.2, item 1b)
 
 **Date:** 2026-05-30
-**Status:** Design approved (brainstorming, 4 sections) — ready for writing-plans.
-**Scope:** Turn the deferred no-op `main()` of `tools/enforce-safe-baseline-metering.mjs` into a live PreToolUse hook that actually meters safe-baseline tool usage per task and hard-blocks a mutating tool once a counter passes its hard threshold without a skill match. Plus the false-block escape (user approval via AskUserQuestion side-channel).
+**Status:** Design v2 — revised after an adversarial self-review (3 CRITICAL + 1 HIGH closed). Ready for writing-plans.
+**Scope:** Turn the deferred no-op `main()` of `tools/enforce-safe-baseline-metering.mjs` into a live PreToolUse hook that meters safe-baseline tool usage per task and hard-blocks a mutating tool once a counter passes its hard threshold without a real skill match. Plus the false-block escape (user approval via a *write-protected* side-channel) and the write-protection hook the escape depends on.
 
-This is router-gate v4 Direction 1 ("skill-substitution laundering": много Read/Grep/Glob/LS/TodoWrite/AskUserQuestion как канал анализа ВМЕСТО invoke recommended Skill, затем mutating-инструмент без единого skill-match). The pure metering + threshold logic already exists in `tools/safe-baseline-metering.mjs` (thresholds, counters, `processEvent`, `decide`); only the live I/O composition and the escape mechanism are missing.
+This is router-gate v4 Direction 1 ("skill-substitution laundering": много Read/Grep/Glob/LS/TodoWrite/AskUserQuestion как канал анализа ВМЕСТО invoke recommended Skill, затем mutating-инструмент без единого skill-match). The pure metering + threshold logic already exists in `tools/safe-baseline-metering.mjs` (thresholds, counters, `processEvent`, `decide`); the live I/O composition, a corrected escape, and the side-channel write-protection are missing.
 
 ---
 
-## Owner decisions (brainstorming)
+## Owner decisions (brainstorming, revised)
 
 1. **Enforcement posture:** hard-block from day 1 (not observe-only).
-2. **False-block escape:** invoke the recommended skill (primary, the discipline goal) **plus** a one-shot user-approval override via AskUserQuestion side-channel (like git approval) for genuine false-positives where no skill applies. **Not** a self-writable controller phrase.
-3. **`skillMatched` definition:** any `Skill` tool_use **OR** `EnterPlanMode` **OR** a `docs/superpowers/plans/` reference within the task → matched. Sticky within the task.
+2. **False-block escape:** invoke the recommended skill (primary) **plus** a one-shot user-approval override via a write-protected AskUserQuestion side-channel for genuine false-positives where no skill applies. **Not** a self-writable controller signal.
+3. **`skillMatched` definition (revised — strict, discipline-first):** matched only by a **real `Skill` OR `EnterPlanMode` tool_use** in the task (an actual tool invocation visible in the transcript). A **text mention** of a plan path does NOT count (it was a self-writable hole — removed). Sticky within the task. Optional future strengthening — match the *recommended* node specifically — is deferred because it needs a node→skill registry mapping (see "Deferred strengthening").
+4. **Priority:** discipline protection and correctness are paramount; cost and speed are secondary.
 
-These three are fixed; everything below is engineering detail consistent with them.
+---
+
+## Honest discipline-protection assessment
+
+Read this before trusting the layer. Even fully fixed, safe-baseline is a **cheap deterministic tripwire**, not a strong scope-discipline guarantee:
+
+- It fires only when a single task accumulates a high count of safe-baseline tools (Read hard = 60, Grep = 30, …) **without any real skill/plan invocation**, then reaches for a mutating tool. Realistically counters accumulate mostly *within one assistant turn* (terse user confirmations reset task boundaries), and 60 reads in one turn is uncommon — so the trigger surface is genuinely small.
+- After the fixes it has **no self-bypass** (skill-match needs a real tool_use; the override is write-protected) and a **working escape** (skill/plan invocations are never blocked). That makes it *sound* — it does what it claims without a trivial dodge.
+- The **strong** scope-consistency check (is THIS tool call consistent with the declared task and recommended skill?) is **Layer 4** (`enforce-llm-judge-per-tool`), which is OFF until owner activation (item 2b). Safe-baseline is the cheap pre-filter beneath it.
+
+Verdict: as a hard guarantee — **LOW–MODERATE**; as an honest, non-bypassable tripwire for blatant laundering — **sound**. The discipline lever that matters most is Layer 4.
 
 ---
 
@@ -26,16 +37,20 @@ These three are fixed; everything below is engineering detail consistent with th
 2. Load the per-session ledger `~/.claude/runtime/safe-baseline-ledger-<sess>.json` = `{ state, lastKeywords }` (absent on first event → `null`).
 3. From the transcript extract:
    - `promptText` — the last user prompt (`lastUserPromptText`).
-   - `currentKeywords` — `extractKeywords(promptText, routerState)` (router-state classifier keywords if present, else deterministic tokenization).
-   - `skillMatchedThisTurn` — `detectSkillMatch(lastTurnEntries(transcript))`.
+   - `currentKeywords` — `extractKeywords(promptText)` (deterministic tokenization — see below; no classifier dependency).
+   - `skillMatchedThisTurn` — `detectSkillMatch(lastTurnEntries(transcript))` **OR** `event.tool_name ∈ {Skill, EnterPlanMode}` (the in-flight escape call counts — see C1 fix).
 4. Call the existing pure `processEvent({ event, priorLedger, currentKeywords, promptText, skillMatched, thresholds })` — task-boundary inference (`shouldInheritTaskId`: reset-marker / keyword-overlap ≥ 2 → continuation; else fresh task, counters from zero) then metering.
 5. Sticky skill-match: `skillMatched = (priorLedger?.state.skill_match_within_task) || skillMatchedThisTurn`. Once true within a task it stays true; persist it in the ledger state.
-6. Override check: before honoring a `hard_block`, look for a fresh unused `approve_safe_baseline_override` record for the current `task_id` (see Override section); if present → allow + consume it (one-shot).
+6. Override check: before honoring a `hard_block`, look for a fresh unused `approve_safe_baseline_override` record for the current `task_id`; if present → allow + consume it (one-shot).
 7. Persist the new ledger.
 8. `hard_block` (and no override) → `exitDecision({ block: true, message })`; `soft_flag` → append to the flags log and exit 0; `allow` → exit 0.
 
 `soft_flag` never blocks (observability only). Only a mutating tool past a hard threshold without skill-match (and without override) blocks.
 
+### C1 fix — the escape must never be blocked
+
+`Skill` and `Task` are in the pure module's MUTATING set (`safe-baseline-metering.mjs:31`), and `evaluateThresholds` hard-blocks any mutating tool past a hard threshold when `skillMatched` is false (`safe-baseline-metering.mjs:92-102`). Naively this blocks the very `Skill` call meant to escape (catch-22). The live head closes this by counting the **current event** in `skillMatchedThisTurn` when `event.tool_name ∈ {Skill, EnterPlanMode}` (step 3). Because `skillMatched` short-circuits `evaluateThresholds` to `allow` (`safe-baseline-metering.mjs:89`), a skill/plan invocation always passes — and then sets the sticky exemption for subsequent Edit/Write/Bash/Task. `Task` is intentionally NOT treated as an escape tool (subagent spawn can itself be a laundering channel) and remains blockable.
+
 ### Safety property of the boundary heuristic
 
 The dangerous direction is *wrongly inheriting* counters across two genuinely different tasks (carrying 60 reads into an unrelated task → false block); this needs keyword-overlap ≥ 2 AND no reset marker, which is uncommon. The opposite error — treating a continuation as a fresh task — *resets* counters to zero, which only *reduces* blocking (safe direction). So the heuristic errs toward fewer false blocks.
@@ -44,28 +59,42 @@ The dangerous direction is *wrongly inheriting* counters across two genuinely di
 
 ## Task-boundary & skill-match detection
 
-### `extractKeywords(promptText, routerState)` (pure)
+### `extractKeywords(promptText)` (pure) — H1 fix
 
-- Prefer `routerState.classification.keywords` (or equivalent) if the classifier writes them — keeps the boundary consistent with how the classifier itself sees the task.
-- Fallback: deterministic tokenization of `promptText` — lowercase, strip RU/EN stopwords, keep tokens length ≥ 4, unique. Deterministic so `lastKeywords` (stored from the prior event) and `currentKeywords` (now) are comparable.
+Deterministic tokenization only: lowercase, strip RU/EN stopwords, keep tokens length ≥ 4, unique, sorted. **No classifier-keyword branch** — the classifier output schema (`router-classifier.mjs:420`) has no `keywords` field, and mixing extraction sources between events would break the overlap comparison. A single deterministic source guarantees `lastKeywords` (stored at event N) and `currentKeywords` (event N+1) are always comparable.
 
-Within one assistant turn the user prompt is constant → `currentKeywords == lastKeywords` → overlap high → counters accumulate across the turn's tool calls. A new user prompt with low overlap (< 2) and no reset marker → fresh task.
+### `detectSkillMatch(turnEntries)` (pure) — C2 fix
 
-### `detectSkillMatch(turnEntries)` (pure)
+Returns `true` only if the turn contains a real **tool_use** of `Skill` OR `EnterPlanMode`. The previous "text reference to `docs/superpowers/plans/`" path is **removed** — it was controller-authored text and therefore a self-writable bypass of the entire hard-block. A `Read` of a plan file no longer counts either (Read is itself a metered safe-baseline tool; counting it would let pure reading self-exempt). Sticky via ledger `state.skill_match_within_task` (OR-accumulated across events in the task); combined with the in-flight escape rule in step 3 / C1.
 
-Returns `true` if the current turn (`lastTurnEntries`) contains a tool_use of `Skill` OR `EnterPlanMode`, OR a `Read`/text reference to `docs/superpowers/plans/`. Sticky via ledger `state.skill_match_within_task` (OR-accumulated across events in the task).
+### Deferred strengthening — recommended-node match
+
+The strict ideal is "skill-match requires the *recommended* node, not just any skill," so the controller can't dodge by invoking an unrelated skill. This needs a node-ID (`#19`) → skill-name (`superpowers:writing-plans`) registry mapping (router-state stores `recommendedNode` per `router-classifier.mjs:163`, but not the skill name). Deferred to avoid a brittle mapping causing lockouts; the user-approval override and Layer 4 cover the gap meanwhile. Documented here so it is not silently dropped.
 
 ---
 
-## Override (false-block escape) — user approval, side-channel
+## Override (false-block escape) — user approval, write-protected side-channel
 
-Mirrors the proven git-approval side-channel, scoped to safe-baseline:
+Mirrors the git-approval side-channel, scoped to safe-baseline:
 
 1. On hard-block, the controller calls `AskUserQuestion` whose option = an exact approval phrase, e.g. `approve safe-baseline override: <short reason>`.
 2. A PostToolUse(AskUserQuestion) consumer (extend `enforce-askuser-answer-parser.mjs` or a paired hook) recognizes that pattern and appends a one-shot record `{ type: 'approve_safe_baseline_override', task_id, ts }` to `~/.claude/runtime/safe-baseline-overrides-<sess>.jsonl`. TTL 5 min, bound to the current `task_id`.
-3. The live `main()`, before blocking, checks for a fresh unused record matching the current `task_id`; if found → allow and mark it consumed (one-shot).
+   - **task_id source:** the parser reads the safe-baseline ledger `~/.claude/runtime/safe-baseline-ledger-<sess>.json` for `state.task_id` (the task the override applies to). If the ledger is absent (no prior metered event) there is nothing to override → the parser writes no record.
+3. The live `main()`, before blocking, checks for a fresh unused record matching the current `task_id`; if found → allow and mark it consumed (**one-shot**). Consume mechanism: the live head records consumed override `ts` values in the ledger (`state.consumed_overrides: number[]`); a record whose `ts` is already listed is skipped. This avoids rewriting the append-only jsonl.
 
-Properties: **not self-writable** (written by the parser from the *user's* AskUserQuestion selection, not from the controller's response text); **bound to `task_id`** (an approval does not leak into another task); **one-shot + TTL** (no lingering hole). The primary escape remains invoking a skill; the override is only for genuine false-positives with no applicable skill.
+**One-shot rationale (M1):** each mutating op past the threshold re-checks. Genuine false-positives (heavy legit reading, no applicable skill, 60+ in one task) are rare (see assessment), so the per-op re-approval friction rarely materializes, and one-shot maximizes control. A socially-engineered single approval therefore cannot unlock an unlimited mutation spree.
+
+### C3 — the side-channel must be write-protected (new dependency)
+
+The "not self-writable" guarantee only holds if the controller cannot write `~/.claude/runtime/safe-baseline-overrides-<sess>.jsonl` directly with the Write/Edit **tool**. Today **no hook gates Write/Edit to `~/.claude/runtime`**: `enforce-read-path-deny.mjs:22` is Read-only; `enforce-normative-content-rules.mjs:11-18` covers only normative docs; the `.claude/runtime` protection in `shell-content-rules.mjs:24` applies only to Bash/PowerShell, not the Write tool. So both the new override file **and the existing git-approval anchor** (`askuser-decisions-<sess>.jsonl`) are currently self-writable via the Write tool.
+
+**Fix (in scope for 1b):** a new PreToolUse hook `enforce-runtime-write-deny.mjs` (matcher `Edit|Write|MultiEdit|NotebookEdit`) that blocks the Write/Edit tools from writing any path under `~/.claude/runtime/**`, reusing `isProtectedPath` / `DEFAULT_PROTECTED_PATTERNS`, fail-CLOSE. Legitimate hooks write there via Node `fs` (not the Claude Write tool), so they are unaffected. This closes the override hole and retroactively hardens git-approval.
+
+**Owner verification:** before relying on this, the owner should check `.claude/settings.json` for any `permissions.deny` already covering Write to `~/.claude/**` (Claude cannot read settings.json — gate-blocked). The new hook is additive defense-in-depth regardless.
+
+### Social-engineering residual (M3, named)
+
+The controller authors the AskUserQuestion option text and reason; the user approves controller-framed text — the same residual as git approval. The override's integrity rests on the user actually scrutinizing the question, exactly as with git. The v4 Stop-hook `llm-judge-response-scan` flags this class of approval-social-engineering when Layer 4 is active. Accepted residual; named so it is not forgotten.
 
 ---
 
@@ -73,34 +102,36 @@ Properties: **not self-writable** (written by the parser from the *user's* AskUs
 
 ### Persistence
 
-- Ledger: `~/.claude/runtime/safe-baseline-ledger-<sess>.json` = `{ state, lastKeywords }`.
+- Ledger: `~/.claude/runtime/safe-baseline-ledger-<sess>.json` = `{ state, lastKeywords }`; `state` also carries `task_id`, `skill_match_within_task`, `consumed_overrides`.
 - Flags log: `~/.claude/runtime/safe-baseline-flags-<sess>.jsonl` (soft_flag observability).
 - Overrides: `~/.claude/runtime/safe-baseline-overrides-<sess>.jsonl`.
 - All file I/O is fail-quiet: any read/write error → treat as no-ledger / no-override and exit 0. The hook never crashes the session.
 
 ### Purity / testability
 
-All logic lives in pure functions (`extractKeywords`, `detectSkillMatch`, `loadOverride`/`consumeOverride`, plus the existing `processEvent`/`decide`). `main()` is only I/O composition. TDD: each new pure function RED→GREEN; an integration test drives `main()` via injected `runtimeDir` + a transcript fixture.
+All logic lives in pure functions (`extractKeywords`, `detectSkillMatch`, `loadOverride`/`consumeOverride`, plus the existing `processEvent`/`decide`). `main()` is only I/O composition. The new `enforce-runtime-write-deny.mjs` has a pure `decide({toolName, filePath})`. TDD: each new pure function RED→GREEN; an integration test drives `main()` via injected `runtimeDir` + a transcript fixture.
 
 ### Registration (owner-applied)
 
-- PreToolUse hook in `.claude/settings.json` (matcher covering all tools — counters grow on Read/Grep/… and block only on mutating tools).
-- PostToolUse(AskUserQuestion) for the override parser (if a separate paired hook).
-- **Claude does not edit `settings.json`** (gate-blocked). The plan produces an exact JSON block for the owner to paste manually. Until registered, the hook is inert (no behavior change).
+- `enforce-safe-baseline-metering` — PreToolUse, matcher scoped to the metered + mutating + escape tools (`Read|Grep|Glob|LS|TodoWrite|AskUserQuestion|Edit|Write|MultiEdit|NotebookEdit|Bash|Skill|Task|EnterPlanMode`), block mode.
+- `enforce-runtime-write-deny` — PreToolUse `Edit|Write|MultiEdit|NotebookEdit`, block mode.
+- override parser — PostToolUse(AskUserQuestion).
+- **Claude does not edit `settings.json`** (gate-blocked). The plan produces an exact JSON block for the owner to paste manually. Until registered, the hooks are inert (no behavior change).
 
 ### Rollout safety
 
-Despite "hard-block from day 1", the plan includes a **mandatory smoke test before live registration**: run the live `main()` against 2-3 real transcript fixtures (single task / task switch / skill-invocation) and confirm boundary + skillMatched fire correctly. This does not change the posture; it just catches gross detection bugs before the hook starts blocking.
+Despite "hard-block from day 1", the plan includes a **mandatory smoke test before live registration**: run the live `main()` against 3-4 real transcript fixtures (single task / task switch / skill-invocation escape / override-consume) and confirm boundary, skillMatched, escape, and override all fire correctly. Plus a smoke for `enforce-runtime-write-deny` (a Write to `~/.claude/runtime/x.jsonl` is blocked; a Write to a normal project path passes). This does not change the posture; it catches gross detection bugs before the hooks start blocking.
 
 ### Scope
 
-~6-8 TDD tasks, estimate 4-6 h.
+~9-12 TDD tasks (added: runtime-write-deny hook, override parser + consume, stricter detection, escape fix), estimate 6-9 h. Cost/speed are secondary per owner priority.
 
 ---
 
 ## Out of scope
 
-- Layer 4 LLM-judge activation (separate owner step, item 2b).
+- Layer 4 LLM-judge activation (separate owner step, item 2b) — the strong scope-discipline lever.
+- Recommended-node skill matching (deferred strengthening — needs node→skill registry).
 - CLAUDE.md / Pravila / PSR / Tooling normative sync (blocked by a parallel session, item 4).
 - Layer 5 VM / biometric / YubiKey (item 6).
 - Any weakening of the router-gate whitelist.