§ Notes / drafting

Six rules every harness gets right

What I took from reading Codex, Pi, OpenClaw, Symphony, Claude Code, and Hermes. One rule, one example, one reason.

Jun 12, 2026 6 min read harness · design

After reading six agent harnesses end to end, the rules they share end up being more interesting than the things they fight about.

These six show up in every one of them. One sentence each. One concrete example each. One reason it matters.

1. `stop_reason` is unreliable. Parse the stream.

The naive way to dispatch a tool call is to wait for the response's stop_reason to come back as tool_use, then dispatch. That signal lies, and both Codex and Claude Code ship code comments saying so.

The right contract: as soon as a tool_use block arrives in the streaming response, schedule the tool future onto an ordered in-flight queue. Completed is for flushing pending text, updating token counts, emitting diffs, and deciding follow-up. Tool dispatch is not one of those.

sequenceDiagram
  autonumber
  participant H as Harness
  participant M as Model
  participant T1 as Tool A
  participant T2 as Tool B

  H->>M: complete(prompt)
  M-->>H: text chunk
  M-->>H: tool_use block A
  H->>T1: dispatch A
  M-->>H: text chunk
  M-->>H: tool_use block B
  H->>T2: dispatch B
  M-->>H: Completed
  T1-->>H: result A
  T2-->>H: result B
  Note over H,M: A and B run while M is still writing

Why it matters: if you wait for stop_reason, you serialize every tool call behind generation length. Stream-parse and a single turn fires three tools in parallel while the model is still finishing prose.

2. Approval lives in the tool schema. Not in chat.

A popup that says may I run rm -rf? is unloggable, unauditable, and un-automatable. Every mature harness moves the question into the typed protocol.

Codex's on_request.md approval fragment teaches the model to request approval through tool parameters, not free-form text. The schema carries sandbox_permissions, justification, prefix_rule, additional_permissions. The model emits a structured request, the harness routes it. Sandbox mode and approval policy are kept orthogonal and assembled per turn as a Cartesian product:

permissions/sandbox_mode/{read_only,workspace_write,danger_full_access}.md
                              x
permissions/approval_policy/{on_request,on_failure,never,unless_trusted}.md

Pi makes the same point with one synchronous tool_call hook that can allow, block, or modify any call. Claude Code layers an auto-allowlist, then an LLM classifier, then a popup. Three shapes, one underlying rule. The policy is a contract, not a conversation.

Why it matters: anything you negotiate in chat is something you cannot replay, audit, or feature-flag.

3. Compaction is not cleanup. It is compression plus a recovery scaffold.

Pi triggers compaction when the estimated token count crosses contextWindow - reserveTokens (default reserveTokens = 16384). The cut walks backward, snaps to turn boundaries so no tool result is orphaned from its assistant message, and produces a CompactionEntry { summary, firstKeptEntryId }. The firstKeptEntryId is a pointer, not a truncation. The raw JSONL stays on disk.

OpenClaw goes further. Before compaction, the harness runs a silent turn that appends durable facts to memory/YYYY-MM-DD.md. Append only, never overwrite. After compaction, it re-injects the Session Startup and Red Lines sections from AGENTS.md, wrapped in a Post-compaction context refresh envelope that explicitly tells the agent to re-run its startup sequence.

flowchart LR
  A[Over budget] --> B[Silent memory flush turn]
  B --> C[Append memory/2026-06-12.md]
  C --> D[Compact transcript]
  D --> E[Re-inject AGENTS.md sections]
  E --> F[Continue turn]

Why it matters: if compaction is just summarize and move on, every long-running agent quietly forgets why it was working on the task. With a recovery scaffold, identity and constraints survive the compress.

4. Skills live as an inventory. Bodies load on demand.

Pi advertises its skill catalog as an XML block of names and descriptions only. The full SKILL.md body never enters the system prompt. When the model decides a skill is relevant, it calls the read tool to fetch the body.

<skills>
  <skill name="commit" description="Stage, commit, push following repo conventions." />
  <skill name="debug"  description="Triage a failing test or unexpected behavior." />
  <skill name="land"   description="Open a PR, watch CI, merge when green."  />
</skills>

That is the entire surface. Every harness calls this progressive disclosure. The inventory is short, the cache stays warm, and the body is fetched only on turns that actually use it. The same pattern applies to memory (memory/YYYY-MM-DD.md loaded by date), to project rules (paths: glob in YAML frontmatter), to documentation. The model is not told everything. The model is told where to look.

Why it matters: the naive move is to inline every capability into the system prompt "just in case". Six months later your prompt is forty thousand tokens, the cache is busted, and most of the bytes were never relevant to the turn at hand.

5. The model's prose is not durable state.

OpenClaw ships a file at src/auto-reply/reply/agent-runner-reminder-guard.ts that regex-matches the model's output for claims like "I'll remind you", "I'll follow up", "I'll check back".

If a match fires AND no cron job was added this turn AND no existing cron exists on the session key, the harness appends this literal sentence to the transcript:

Note: I did not schedule a reminder in this turn, so this will not trigger automatically.

That is not natural-language understanding. It is a state reconciliation check between the model's prose and the durable scheduler. The model said it would do a thing. The harness asks the question the user actually cares about: did the durable system record the commitment? If not, the user gets told.

The same shape shows up elsewhere. ESAA forbids agents from writing files directly; agents emit agent.result intentions and the orchestrator validates and applies. Codex's update_plan is a UX tool, not a reasoning tool. Claude Code workers return structured <task-notification> payloads instead of free prose.

Why it matters: compaction will eventually eat the model's prose. Whatever was only said, and never recorded, is gone.

6. The session is a tree. Not a flat log.

Pi's session is a JSONL tree of typed entries shaped like { id, parentId, timestamp, type }. Types include header, user, assistant, toolResult, bashExecution, custom, compaction, branchSummary.

Branching is first class. An abandoned exploration gets a branchSummary entry so the work survives navigation away. Compaction inserts a summary pointer without destroying raw history. Plugins write their own state into the session file as custom entries, no separate database needed.

A flat transcript represents none of this. You either lose abandoned work, lose the ability to fork from a past turn, or invent a second source of truth to track what the transcript cannot.

Why it matters: the moment you want to fork from turn 14 to try a different path, a flat log forces you to copy-edit a file. A tree handles it as one new parentId.

The pattern underneath

Every one of these rules has the same shape. The model is not the runtime.

The runtime parses the stream. Owns the policy. Transforms the state. Indexes the capabilities. Reconciles the prose against durable systems. Structures the history. The model is invited in for one turn at a time, on a compiled prompt, under a typed contract, against a state it does not directly own.

That is the only sentence I would underline twice. Every harness in the corpus, on every axis I looked at, is built around it. The mistakes I see in production code are almost always one team or another quietly trusting the model to be the runtime.

Related reading: the system prompt is compiler output, not a string, a seventh thing those six harnesses agree on, up close.

1. stop_reason is unreliable. Parse the stream.