From 9f84d9ef09ebe7fe742c5ccd8e8b447cd0bfdd81 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=D0=94=D0=BC=D0=B8=D1=82=D1=80=D0=B8=D0=B9?= Date: Sat, 30 May 2026 20:31:23 +0300 Subject: [PATCH] =?UTF-8?q?docs(router-gate-v4):=20safe-baseline=20spec=20?= =?UTF-8?q?v2=20=E2=80=94=20close=20C1/C2/C3/H1=20from=20adversarial=20rev?= =?UTF-8?q?iew=20(item=201b)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...-05-30-safe-baseline-live-wiring-design.md | 87 +++++++++++++------ 1 file changed, 59 insertions(+), 28 deletions(-) diff --git a/docs/superpowers/specs/2026-05-30-safe-baseline-live-wiring-design.md b/docs/superpowers/specs/2026-05-30-safe-baseline-live-wiring-design.md index 0cc619d7..e99aa3e1 100644 --- a/docs/superpowers/specs/2026-05-30-safe-baseline-live-wiring-design.md +++ b/docs/superpowers/specs/2026-05-30-safe-baseline-live-wiring-design.md @@ -1,20 +1,31 @@ # Safe-baseline metering — live wiring (router-gate v4 §3.1.2, item 1b) **Date:** 2026-05-30 -**Status:** Design approved (brainstorming, 4 sections) — ready for writing-plans. -**Scope:** Turn the deferred no-op `main()` of `tools/enforce-safe-baseline-metering.mjs` into a live PreToolUse hook that actually meters safe-baseline tool usage per task and hard-blocks a mutating tool once a counter passes its hard threshold without a skill match. Plus the false-block escape (user approval via AskUserQuestion side-channel). +**Status:** Design v2 — revised after an adversarial self-review (3 CRITICAL + 1 HIGH closed). Ready for writing-plans. +**Scope:** Turn the deferred no-op `main()` of `tools/enforce-safe-baseline-metering.mjs` into a live PreToolUse hook that meters safe-baseline tool usage per task and hard-blocks a mutating tool once a counter passes its hard threshold without a real skill match. Plus the false-block escape (user approval via a *write-protected* side-channel) and the write-protection hook the escape depends on. -This is router-gate v4 Direction 1 ("skill-substitution laundering": много Read/Grep/Glob/LS/TodoWrite/AskUserQuestion как канал анализа ВМЕСТО invoke recommended Skill, затем mutating-инструмент без единого skill-match). The pure metering + threshold logic already exists in `tools/safe-baseline-metering.mjs` (thresholds, counters, `processEvent`, `decide`); only the live I/O composition and the escape mechanism are missing. +This is router-gate v4 Direction 1 ("skill-substitution laundering": много Read/Grep/Glob/LS/TodoWrite/AskUserQuestion как канал анализа ВМЕСТО invoke recommended Skill, затем mutating-инструмент без единого skill-match). The pure metering + threshold logic already exists in `tools/safe-baseline-metering.mjs` (thresholds, counters, `processEvent`, `decide`); the live I/O composition, a corrected escape, and the side-channel write-protection are missing. --- -## Owner decisions (brainstorming) +## Owner decisions (brainstorming, revised) 1. **Enforcement posture:** hard-block from day 1 (not observe-only). -2. **False-block escape:** invoke the recommended skill (primary, the discipline goal) **plus** a one-shot user-approval override via AskUserQuestion side-channel (like git approval) for genuine false-positives where no skill applies. **Not** a self-writable controller phrase. -3. **`skillMatched` definition:** any `Skill` tool_use **OR** `EnterPlanMode` **OR** a `docs/superpowers/plans/` reference within the task → matched. Sticky within the task. +2. **False-block escape:** invoke the recommended skill (primary) **plus** a one-shot user-approval override via a write-protected AskUserQuestion side-channel for genuine false-positives where no skill applies. **Not** a self-writable controller signal. +3. **`skillMatched` definition (revised — strict, discipline-first):** matched only by a **real `Skill` OR `EnterPlanMode` tool_use** in the task (an actual tool invocation visible in the transcript). A **text mention** of a plan path does NOT count (it was a self-writable hole — removed). Sticky within the task. Optional future strengthening — match the *recommended* node specifically — is deferred because it needs a node→skill registry mapping (see "Deferred strengthening"). +4. **Priority:** discipline protection and correctness are paramount; cost and speed are secondary. -These three are fixed; everything below is engineering detail consistent with them. +--- + +## Honest discipline-protection assessment + +Read this before trusting the layer. Even fully fixed, safe-baseline is a **cheap deterministic tripwire**, not a strong scope-discipline guarantee: + +- It fires only when a single task accumulates a high count of safe-baseline tools (Read hard = 60, Grep = 30, …) **without any real skill/plan invocation**, then reaches for a mutating tool. Realistically counters accumulate mostly *within one assistant turn* (terse user confirmations reset task boundaries), and 60 reads in one turn is uncommon — so the trigger surface is genuinely small. +- After the fixes it has **no self-bypass** (skill-match needs a real tool_use; the override is write-protected) and a **working escape** (skill/plan invocations are never blocked). That makes it *sound* — it does what it claims without a trivial dodge. +- The **strong** scope-consistency check (is THIS tool call consistent with the declared task and recommended skill?) is **Layer 4** (`enforce-llm-judge-per-tool`), which is OFF until owner activation (item 2b). Safe-baseline is the cheap pre-filter beneath it. + +Verdict: as a hard guarantee — **LOW–MODERATE**; as an honest, non-bypassable tripwire for blatant laundering — **sound**. The discipline lever that matters most is Layer 4. --- @@ -26,16 +37,20 @@ These three are fixed; everything below is engineering detail consistent with th 2. Load the per-session ledger `~/.claude/runtime/safe-baseline-ledger-.json` = `{ state, lastKeywords }` (absent on first event → `null`). 3. From the transcript extract: - `promptText` — the last user prompt (`lastUserPromptText`). - - `currentKeywords` — `extractKeywords(promptText, routerState)` (router-state classifier keywords if present, else deterministic tokenization). - - `skillMatchedThisTurn` — `detectSkillMatch(lastTurnEntries(transcript))`. + - `currentKeywords` — `extractKeywords(promptText)` (deterministic tokenization — see below; no classifier dependency). + - `skillMatchedThisTurn` — `detectSkillMatch(lastTurnEntries(transcript))` **OR** `event.tool_name ∈ {Skill, EnterPlanMode}` (the in-flight escape call counts — see C1 fix). 4. Call the existing pure `processEvent({ event, priorLedger, currentKeywords, promptText, skillMatched, thresholds })` — task-boundary inference (`shouldInheritTaskId`: reset-marker / keyword-overlap ≥ 2 → continuation; else fresh task, counters from zero) then metering. 5. Sticky skill-match: `skillMatched = (priorLedger?.state.skill_match_within_task) || skillMatchedThisTurn`. Once true within a task it stays true; persist it in the ledger state. -6. Override check: before honoring a `hard_block`, look for a fresh unused `approve_safe_baseline_override` record for the current `task_id` (see Override section); if present → allow + consume it (one-shot). +6. Override check: before honoring a `hard_block`, look for a fresh unused `approve_safe_baseline_override` record for the current `task_id`; if present → allow + consume it (one-shot). 7. Persist the new ledger. 8. `hard_block` (and no override) → `exitDecision({ block: true, message })`; `soft_flag` → append to the flags log and exit 0; `allow` → exit 0. `soft_flag` never blocks (observability only). Only a mutating tool past a hard threshold without skill-match (and without override) blocks. +### C1 fix — the escape must never be blocked + +`Skill` and `Task` are in the pure module's MUTATING set (`safe-baseline-metering.mjs:31`), and `evaluateThresholds` hard-blocks any mutating tool past a hard threshold when `skillMatched` is false (`safe-baseline-metering.mjs:92-102`). Naively this blocks the very `Skill` call meant to escape (catch-22). The live head closes this by counting the **current event** in `skillMatchedThisTurn` when `event.tool_name ∈ {Skill, EnterPlanMode}` (step 3). Because `skillMatched` short-circuits `evaluateThresholds` to `allow` (`safe-baseline-metering.mjs:89`), a skill/plan invocation always passes — and then sets the sticky exemption for subsequent Edit/Write/Bash/Task. `Task` is intentionally NOT treated as an escape tool (subagent spawn can itself be a laundering channel) and remains blockable. + ### Safety property of the boundary heuristic The dangerous direction is *wrongly inheriting* counters across two genuinely different tasks (carrying 60 reads into an unrelated task → false block); this needs keyword-overlap ≥ 2 AND no reset marker, which is uncommon. The opposite error — treating a continuation as a fresh task — *resets* counters to zero, which only *reduces* blocking (safe direction). So the heuristic errs toward fewer false blocks. @@ -44,28 +59,42 @@ The dangerous direction is *wrongly inheriting* counters across two genuinely di ## Task-boundary & skill-match detection -### `extractKeywords(promptText, routerState)` (pure) +### `extractKeywords(promptText)` (pure) — H1 fix -- Prefer `routerState.classification.keywords` (or equivalent) if the classifier writes them — keeps the boundary consistent with how the classifier itself sees the task. -- Fallback: deterministic tokenization of `promptText` — lowercase, strip RU/EN stopwords, keep tokens length ≥ 4, unique. Deterministic so `lastKeywords` (stored from the prior event) and `currentKeywords` (now) are comparable. +Deterministic tokenization only: lowercase, strip RU/EN stopwords, keep tokens length ≥ 4, unique, sorted. **No classifier-keyword branch** — the classifier output schema (`router-classifier.mjs:420`) has no `keywords` field, and mixing extraction sources between events would break the overlap comparison. A single deterministic source guarantees `lastKeywords` (stored at event N) and `currentKeywords` (event N+1) are always comparable. -Within one assistant turn the user prompt is constant → `currentKeywords == lastKeywords` → overlap high → counters accumulate across the turn's tool calls. A new user prompt with low overlap (< 2) and no reset marker → fresh task. +### `detectSkillMatch(turnEntries)` (pure) — C2 fix -### `detectSkillMatch(turnEntries)` (pure) +Returns `true` only if the turn contains a real **tool_use** of `Skill` OR `EnterPlanMode`. The previous "text reference to `docs/superpowers/plans/`" path is **removed** — it was controller-authored text and therefore a self-writable bypass of the entire hard-block. A `Read` of a plan file no longer counts either (Read is itself a metered safe-baseline tool; counting it would let pure reading self-exempt). Sticky via ledger `state.skill_match_within_task` (OR-accumulated across events in the task); combined with the in-flight escape rule in step 3 / C1. -Returns `true` if the current turn (`lastTurnEntries`) contains a tool_use of `Skill` OR `EnterPlanMode`, OR a `Read`/text reference to `docs/superpowers/plans/`. Sticky via ledger `state.skill_match_within_task` (OR-accumulated across events in the task). +### Deferred strengthening — recommended-node match + +The strict ideal is "skill-match requires the *recommended* node, not just any skill," so the controller can't dodge by invoking an unrelated skill. This needs a node-ID (`#19`) → skill-name (`superpowers:writing-plans`) registry mapping (router-state stores `recommendedNode` per `router-classifier.mjs:163`, but not the skill name). Deferred to avoid a brittle mapping causing lockouts; the user-approval override and Layer 4 cover the gap meanwhile. Documented here so it is not silently dropped. --- -## Override (false-block escape) — user approval, side-channel +## Override (false-block escape) — user approval, write-protected side-channel -Mirrors the proven git-approval side-channel, scoped to safe-baseline: +Mirrors the git-approval side-channel, scoped to safe-baseline: 1. On hard-block, the controller calls `AskUserQuestion` whose option = an exact approval phrase, e.g. `approve safe-baseline override: `. 2. A PostToolUse(AskUserQuestion) consumer (extend `enforce-askuser-answer-parser.mjs` or a paired hook) recognizes that pattern and appends a one-shot record `{ type: 'approve_safe_baseline_override', task_id, ts }` to `~/.claude/runtime/safe-baseline-overrides-.jsonl`. TTL 5 min, bound to the current `task_id`. -3. The live `main()`, before blocking, checks for a fresh unused record matching the current `task_id`; if found → allow and mark it consumed (one-shot). + - **task_id source:** the parser reads the safe-baseline ledger `~/.claude/runtime/safe-baseline-ledger-.json` for `state.task_id` (the task the override applies to). If the ledger is absent (no prior metered event) there is nothing to override → the parser writes no record. +3. The live `main()`, before blocking, checks for a fresh unused record matching the current `task_id`; if found → allow and mark it consumed (**one-shot**). Consume mechanism: the live head records consumed override `ts` values in the ledger (`state.consumed_overrides: number[]`); a record whose `ts` is already listed is skipped. This avoids rewriting the append-only jsonl. -Properties: **not self-writable** (written by the parser from the *user's* AskUserQuestion selection, not from the controller's response text); **bound to `task_id`** (an approval does not leak into another task); **one-shot + TTL** (no lingering hole). The primary escape remains invoking a skill; the override is only for genuine false-positives with no applicable skill. +**One-shot rationale (M1):** each mutating op past the threshold re-checks. Genuine false-positives (heavy legit reading, no applicable skill, 60+ in one task) are rare (see assessment), so the per-op re-approval friction rarely materializes, and one-shot maximizes control. A socially-engineered single approval therefore cannot unlock an unlimited mutation spree. + +### C3 — the side-channel must be write-protected (new dependency) + +The "not self-writable" guarantee only holds if the controller cannot write `~/.claude/runtime/safe-baseline-overrides-.jsonl` directly with the Write/Edit **tool**. Today **no hook gates Write/Edit to `~/.claude/runtime`**: `enforce-read-path-deny.mjs:22` is Read-only; `enforce-normative-content-rules.mjs:11-18` covers only normative docs; the `.claude/runtime` protection in `shell-content-rules.mjs:24` applies only to Bash/PowerShell, not the Write tool. So both the new override file **and the existing git-approval anchor** (`askuser-decisions-.jsonl`) are currently self-writable via the Write tool. + +**Fix (in scope for 1b):** a new PreToolUse hook `enforce-runtime-write-deny.mjs` (matcher `Edit|Write|MultiEdit|NotebookEdit`) that blocks the Write/Edit tools from writing any path under `~/.claude/runtime/**`, reusing `isProtectedPath` / `DEFAULT_PROTECTED_PATTERNS`, fail-CLOSE. Legitimate hooks write there via Node `fs` (not the Claude Write tool), so they are unaffected. This closes the override hole and retroactively hardens git-approval. + +**Owner verification:** before relying on this, the owner should check `.claude/settings.json` for any `permissions.deny` already covering Write to `~/.claude/**` (Claude cannot read settings.json — gate-blocked). The new hook is additive defense-in-depth regardless. + +### Social-engineering residual (M3, named) + +The controller authors the AskUserQuestion option text and reason; the user approves controller-framed text — the same residual as git approval. The override's integrity rests on the user actually scrutinizing the question, exactly as with git. The v4 Stop-hook `llm-judge-response-scan` flags this class of approval-social-engineering when Layer 4 is active. Accepted residual; named so it is not forgotten. --- @@ -73,34 +102,36 @@ Properties: **not self-writable** (written by the parser from the *user's* AskUs ### Persistence -- Ledger: `~/.claude/runtime/safe-baseline-ledger-.json` = `{ state, lastKeywords }`. +- Ledger: `~/.claude/runtime/safe-baseline-ledger-.json` = `{ state, lastKeywords }`; `state` also carries `task_id`, `skill_match_within_task`, `consumed_overrides`. - Flags log: `~/.claude/runtime/safe-baseline-flags-.jsonl` (soft_flag observability). - Overrides: `~/.claude/runtime/safe-baseline-overrides-.jsonl`. - All file I/O is fail-quiet: any read/write error → treat as no-ledger / no-override and exit 0. The hook never crashes the session. ### Purity / testability -All logic lives in pure functions (`extractKeywords`, `detectSkillMatch`, `loadOverride`/`consumeOverride`, plus the existing `processEvent`/`decide`). `main()` is only I/O composition. TDD: each new pure function RED→GREEN; an integration test drives `main()` via injected `runtimeDir` + a transcript fixture. +All logic lives in pure functions (`extractKeywords`, `detectSkillMatch`, `loadOverride`/`consumeOverride`, plus the existing `processEvent`/`decide`). `main()` is only I/O composition. The new `enforce-runtime-write-deny.mjs` has a pure `decide({toolName, filePath})`. TDD: each new pure function RED→GREEN; an integration test drives `main()` via injected `runtimeDir` + a transcript fixture. ### Registration (owner-applied) -- PreToolUse hook in `.claude/settings.json` (matcher covering all tools — counters grow on Read/Grep/… and block only on mutating tools). -- PostToolUse(AskUserQuestion) for the override parser (if a separate paired hook). -- **Claude does not edit `settings.json`** (gate-blocked). The plan produces an exact JSON block for the owner to paste manually. Until registered, the hook is inert (no behavior change). +- `enforce-safe-baseline-metering` — PreToolUse, matcher scoped to the metered + mutating + escape tools (`Read|Grep|Glob|LS|TodoWrite|AskUserQuestion|Edit|Write|MultiEdit|NotebookEdit|Bash|Skill|Task|EnterPlanMode`), block mode. +- `enforce-runtime-write-deny` — PreToolUse `Edit|Write|MultiEdit|NotebookEdit`, block mode. +- override parser — PostToolUse(AskUserQuestion). +- **Claude does not edit `settings.json`** (gate-blocked). The plan produces an exact JSON block for the owner to paste manually. Until registered, the hooks are inert (no behavior change). ### Rollout safety -Despite "hard-block from day 1", the plan includes a **mandatory smoke test before live registration**: run the live `main()` against 2-3 real transcript fixtures (single task / task switch / skill-invocation) and confirm boundary + skillMatched fire correctly. This does not change the posture; it just catches gross detection bugs before the hook starts blocking. +Despite "hard-block from day 1", the plan includes a **mandatory smoke test before live registration**: run the live `main()` against 3-4 real transcript fixtures (single task / task switch / skill-invocation escape / override-consume) and confirm boundary, skillMatched, escape, and override all fire correctly. Plus a smoke for `enforce-runtime-write-deny` (a Write to `~/.claude/runtime/x.jsonl` is blocked; a Write to a normal project path passes). This does not change the posture; it catches gross detection bugs before the hooks start blocking. ### Scope -~6-8 TDD tasks, estimate 4-6 h. +~9-12 TDD tasks (added: runtime-write-deny hook, override parser + consume, stricter detection, escape fix), estimate 6-9 h. Cost/speed are secondary per owner priority. --- ## Out of scope -- Layer 4 LLM-judge activation (separate owner step, item 2b). +- Layer 4 LLM-judge activation (separate owner step, item 2b) — the strong scope-discipline lever. +- Recommended-node skill matching (deferred strengthening — needs node→skill registry). - CLAUDE.md / Pravila / PSR / Tooling normative sync (blocked by a parallel session, item 4). - Layer 5 VM / biometric / YubiKey (item 6). - Any weakening of the router-gate whitelist.