Pranay Varma

The system prompt is compiler output, not a string

Fri, 12 Jun 2026 00:00:00 GMT

I spent a few weekends reading the source of every open-source agent harness I could find. Codex. Pi. OpenClaw. Symphony. Claude Code. Hermes.

I went in expecting an architectural shouting match. Six teams, six harnesses, surely they fight about the basics.

They mostly do. There is one thing they do not fight about.

The thing they all agree on

The system prompt is not a string you write. It is the output of a build step that runs every turn.

Said differently. The naive way is to think of the system prompt as a fixed block of text you author once, paste into the SDK call, and forget. Every mature harness in the corpus rejects that view. They treat the system prompt as the product of a pipeline that runs before each model call, assembled from fragments owned by different parts of the codebase.

Pi calls it out by name. Their docs read: "The prompt is synthesized, not just stored." Codex puts it in a section titled Prompt lesson for builders:

A good harness should feel like it has a prompt compiler, not just a system prompt. Separate stable behavior, policy fragments, local instructions, environment state, and current turn input so each can evolve independently.

Once you start looking at the code, it is everywhere.

The shape

The fragments do not vary much across harnesses. The order, mostly, does not vary either. The most stable layers sit at the bottom and the most dynamic at the top, with a cache boundary somewhere in the middle.

flowchart TB
  subgraph stable[Stable / cached prefix]
    direction TB
    L1[Base instructions]
    L2[Tool inventory + policy fragments]
    L3[AGENTS.md walked root to cwd]
    L4[Skill inventory: names + descriptions in XML]
    L1 --> L2 --> L3 --> L4
  end

  SENT[__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__ sentinel]

  subgraph dynamic[Dynamic / per-turn suffix]
    direction TB
    D1[Environment context as diffable XML]
    D2[Session history + tool outputs]
    D3[Queued steer / current user turn]
    D1 --> D2 --> D3
  end

  L4 --> SENT --> D1

That picture is Codex's Figure 2 with the labels lightly rephrased. It is also Pi's Figure 4. It is OpenClaw's system-prompt builder enumerating roughly fourteen sections. It is Claude Code's chain of getSimpleIntroSection, getSimpleSystemSection, getActionsSection, getUsingYourToolsSection, getSimpleToneAndStyleSection, getOutputEfficiencySection, followed by a literal token __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__, followed by getSessionSpecificGuidanceSection and computeSimpleEnvInfo.

The point of the order is not aesthetics. The point is the prompt cache.

Why the order is load-bearing

Every layer above the cache boundary is content the provider can cache for a five-minute (Anthropic) or longer window. Every layer below is content that genuinely varies turn to turn. If you put a timestamp at the bottom of the stable section, every turn busts the cache. If you put your tool registry in a non-deterministic order, every turn busts the cache.

Claude Code has the funniest version of this. Their tool-registry file (src/tools.ts:190) ships with this comment:

NOTE: This MUST stay in sync with https://console.statsig.com/.../claude_code_global_system_caching, in order to cache the system prompt across users.

Their tool ordering is pinned to a remote feature-flag config because if the order drifts, the prompt-prefix cache silently breaks across users. The system prompt has become so much of a build output that it has a CI-style invariant attached to it. That is not a hand-written string. That is compiler output.

Hermes goes one step further. The system prompt is built once per session and stored as a column in SQLite (sessions.system_prompt). Every continuation reads that column back verbatim. Mid-session mutations require an explicit --now flag. The whole design exists to keep the byte-for-byte prefix stable across continuations.

What gets compiled in

Six things, every time, in roughly this order.

Base instructions. Codex splits this into two files: core/prompt.md (stable contract) plus core/gpt-5.2-codex_prompt.md (model-family overlay). Pi exposes the split as SYSTEM.md (replaces the base) versus APPEND_SYSTEM.md (appends to it). Different verbs, same surface area.

Tool inventory and policy fragments. Codex treats sandbox mode and approval policy as orthogonal fragment families, then assembles them as a Cartesian product:

permissions/sandbox_mode/{read_only,workspace_write,danger_full_access}.md
                              x
permissions/approval_policy/{on_request,on_failure,never,unless_trusted}.md

Approval rides on tool-schema fields like sandbox_permissions, justification, prefix_rule. It is part of the typed protocol. It is not a chat popup.

Project context. AGENTS.md or CLAUDE.md walked root-to-cwd, deeper files later so they override broader ones. Loaded fresh at session boot, often cached for the lifetime of the session.

Skill inventory. Names and descriptions only, usually wrapped in XML. Skill bodies (SKILL.md) load on demand via the Read tool when the model decides one is relevant. Inlining the bodies blows the cache and the context budget for capabilities the turn may never use. Pi's docs call this progressive disclosure.

Environment context. Date, cwd, shell, network, OS. Codex serializes it as XML and diffs it turn to turn, so only the delta gets re-emitted. Most harnesses re-render the whole block every turn. Codex's approach is the better one.

Session history and queued input. Whatever the model needs to continue.

Claude Code adds one more sentence to the memory layer that I think about a lot. When the harness loads CLAUDE.md, it wraps the contents with this:

Codebase and user instructions are shown below. Be sure to adhere to these instructions. IMPORTANT: These instructions OVERRIDE any default behavior and you MUST follow them exactly as written.

Without that wrapper, project memory is just more prose. With it, project memory genuinely outranks the system prompt. A single sentence reorders the precedence stack. It is the kind of thing you only write after watching an LLM ignore a CLAUDE.md rule three times.

The compile step, drawn

A turn does not start by calling the model. It starts by building the thing you are going to send.

sequenceDiagram
  autonumber
  participant U as User
  participant H as Harness
  participant C as Prompt compiler
  participant FS as Files
  participant M as Model

  U->>H: input
  H->>C: build_prompt(turn, history)
  C->>FS: walk AGENTS.md root to cwd
  C->>FS: scan skill inventory
  C->>C: assemble stable layers
  C->>C: insert __SYSTEM_PROMPT_DYNAMIC_BOUNDARY__
  C->>C: assemble dynamic suffix
  C-->>H: compiled prompt
  H->>M: complete(compiled prompt)
  M-->>H: stream tool_use blocks + text
  H->>FS: append entry to transcript

  Note over C,M: Stable prefix hits the cache. Suffix is fresh every turn.

The compile step has a few properties worth naming.

It is strict. Symphony renders its WORKFLOW.md body with a Liquid-like template engine that hard-fails on unknown variables or filters. Their stated reason: "it prevents accidental prompt drift or silent template bugs." A naive renderer that swaps unknown variables for empty strings silently drops prompt content for a week before anyone notices. Strict beats lenient at this scale.

It supports hot reload, with a last-known-good fallback. Symphony's workflow_store.ex polls the workflow file every one second and reloads on mtime change. If the new file fails parse, the last valid copy stays mounted. Lenient renderers go silently blank. Strict renderers without a fallback brick the service during every typo. You want both.

It is scoped to one turn. Codex rebuilds its tool registry on every sampling request, then builds the prompt around it. Tools are not a global runtime object. They are an artifact of this turn's compile pass. That is what makes feature-flagged tools, per-turn allowlists, and sub-agent role gates work at all.

The cache boundary is the contract

Here is the part I missed for the first six months I worked on harnesses. The boundary between stable and dynamic is not just where caching kicks in. It is the contract between two halves of the codebase.

Above the line is the part the harness team owns. The base prompt. The tool registry. The skill index. The memory wrapper. It changes when you ship the harness.

Below the line is the part that changes every turn. Environment context. Recent history. The user's current message.

Once you pick that line, every interesting design decision falls into one of two questions. Is this stable enough to cache? and Does this need to be re-checked every turn? You can argue about whether AGENTS.md belongs above the line. (Most harnesses cache it. They re-read on session boot, not per turn.) You can argue about whether environment context belongs above or below. (Codex puts it just below, and diffs it.) The question is the same question.

If you do not have that line in your head, your prompt grows like a junk drawer.

Where it breaks

Three failure modes, all of them visible in someone's code.

The mega-string. One file. One prompt. Edited by hand. No layers, no order, no cache discipline. Works fine on day one. By month six the prompt has forty sections that contradict each other and nobody knows which one is load-bearing. The fix is not to write a better string. It is to break the string into fragments and write a compiler that joins them in a known order.

The lenient renderer. Templates that render unknown variables as empty strings. You rename a config key. The prompt silently goes blank for that section. The model degrades. Your evals move two points and you spend a week blaming the model upgrade.

The hot-reload with no last-known-good. Symphony's solution is the right one. If hot reload is allowed, the harness has to keep the last valid copy mounted, or every typo in production becomes an outage.

OpenClaw has a fourth one I keep thinking about. They have a regex-based reminder-honesty guard that watches the model's prose for claims like "I'll remind you" and "I'll follow up". If no cron job was added on this turn and none exists on the session key, the harness appends a literal note to the transcript:

Note: I did not schedule a reminder in this turn, so this will not trigger automatically.

They do not trust the model's text to reflect durable system state, because compaction will eat the prose and leave the user holding nothing.

That is the same shape of problem the prompt compiler exists to solve. The model's words are not the runtime. The compiled prompt is.

What I took away

I no longer think of "the system prompt" as a file you edit. I think of it as a stream of fragments owned by different parts of the codebase, joined by a build step that runs every turn, split by a cache boundary that is itself the design.

If your harness has one big prompt string and you edit it by hand, that is the first thing I would refactor. Not because the string is bad. Because the absence of structure makes every other improvement harder.

A prompt compiler is the same idea as any compiler. Source artifacts go in. A canonical output comes out. The interesting work is the layer stack in the middle: the sentinels, the cache discipline, the strict template engine, the last-known-good fallback, the wrapper sentences that re-rank precedence.

Six harnesses, one pattern. The prompt is not the source. The fragments are the source. The prompt is what you ship.

Related reading: six rules every harness gets right, more of what these same six codebases agree on.

Six rules every harness gets right

Fri, 12 Jun 2026 00:00:00 GMT

After reading six agent harnesses end to end, the rules they share end up being more interesting than the things they fight about.

These six show up in every one of them. One sentence each. One concrete example each. One reason it matters.

1. `stop_reason` is unreliable. Parse the stream.

The naive way to dispatch a tool call is to wait for the response's stop_reason to come back as tool_use, then dispatch. That signal lies, and both Codex and Claude Code ship code comments saying so.

The right contract: as soon as a tool_use block arrives in the streaming response, schedule the tool future onto an ordered in-flight queue. Completed is for flushing pending text, updating token counts, emitting diffs, and deciding follow-up. Tool dispatch is not one of those.

sequenceDiagram
  autonumber
  participant H as Harness
  participant M as Model
  participant T1 as Tool A
  participant T2 as Tool B

  H->>M: complete(prompt)
  M-->>H: text chunk
  M-->>H: tool_use block A
  H->>T1: dispatch A
  M-->>H: text chunk
  M-->>H: tool_use block B
  H->>T2: dispatch B
  M-->>H: Completed
  T1-->>H: result A
  T2-->>H: result B
  Note over H,M: A and B run while M is still writing

Why it matters: if you wait for stop_reason, you serialize every tool call behind generation length. Stream-parse and a single turn fires three tools in parallel while the model is still finishing prose.

2. Approval lives in the tool schema. Not in chat.

A popup that says may I run rm -rf? is unloggable, unauditable, and un-automatable. Every mature harness moves the question into the typed protocol.

Codex's on_request.md approval fragment teaches the model to request approval through tool parameters, not free-form text. The schema carries sandbox_permissions, justification, prefix_rule, additional_permissions. The model emits a structured request, the harness routes it. Sandbox mode and approval policy are kept orthogonal and assembled per turn as a Cartesian product:

permissions/sandbox_mode/{read_only,workspace_write,danger_full_access}.md
                              x
permissions/approval_policy/{on_request,on_failure,never,unless_trusted}.md

Pi makes the same point with one synchronous tool_call hook that can allow, block, or modify any call. Claude Code layers an auto-allowlist, then an LLM classifier, then a popup. Three shapes, one underlying rule. The policy is a contract, not a conversation.

Why it matters: anything you negotiate in chat is something you cannot replay, audit, or feature-flag.

3. Compaction is not cleanup. It is compression plus a recovery scaffold.

Pi triggers compaction when the estimated token count crosses contextWindow - reserveTokens (default reserveTokens = 16384). The cut walks backward, snaps to turn boundaries so no tool result is orphaned from its assistant message, and produces a CompactionEntry { summary, firstKeptEntryId }. The firstKeptEntryId is a pointer, not a truncation. The raw JSONL stays on disk.

OpenClaw goes further. Before compaction, the harness runs a silent turn that appends durable facts to memory/YYYY-MM-DD.md. Append only, never overwrite. After compaction, it re-injects the Session Startup and Red Lines sections from AGENTS.md, wrapped in a Post-compaction context refresh envelope that explicitly tells the agent to re-run its startup sequence.

flowchart LR
  A[Over budget] --> B[Silent memory flush turn]
  B --> C[Append memory/2026-06-12.md]
  C --> D[Compact transcript]
  D --> E[Re-inject AGENTS.md sections]
  E --> F[Continue turn]

Why it matters: if compaction is just summarize and move on, every long-running agent quietly forgets why it was working on the task. With a recovery scaffold, identity and constraints survive the compress.

4. Skills live as an inventory. Bodies load on demand.

Pi advertises its skill catalog as an XML block of names and descriptions only. The full SKILL.md body never enters the system prompt. When the model decides a skill is relevant, it calls the read tool to fetch the body.

That is the entire surface. Every harness calls this progressive disclosure. The inventory is short, the cache stays warm, and the body is fetched only on turns that actually use it. The same pattern applies to memory (memory/YYYY-MM-DD.md loaded by date), to project rules (paths: glob in YAML frontmatter), to documentation. The model is not told everything. The model is told where to look.

Why it matters: the naive move is to inline every capability into the system prompt "just in case". Six months later your prompt is forty thousand tokens, the cache is busted, and most of the bytes were never relevant to the turn at hand.

5. The model's prose is not durable state.

OpenClaw ships a file at src/auto-reply/reply/agent-runner-reminder-guard.ts that regex-matches the model's output for claims like "I'll remind you", "I'll follow up", "I'll check back".

If a match fires AND no cron job was added this turn AND no existing cron exists on the session key, the harness appends this literal sentence to the transcript:

Note: I did not schedule a reminder in this turn, so this will not trigger automatically.

That is not natural-language understanding. It is a state reconciliation check between the model's prose and the durable scheduler. The model said it would do a thing. The harness asks the question the user actually cares about: did the durable system record the commitment? If not, the user gets told.

The same shape shows up elsewhere. ESAA forbids agents from writing files directly; agents emit agent.result intentions and the orchestrator validates and applies. Codex's update_plan is a UX tool, not a reasoning tool. Claude Code workers return structured payloads instead of free prose.

Why it matters: compaction will eventually eat the model's prose. Whatever was only said, and never recorded, is gone.

6. The session is a tree. Not a flat log.

Pi's session is a JSONL tree of typed entries shaped like { id, parentId, timestamp, type }. Types include header, user, assistant, toolResult, bashExecution, custom, compaction, branchSummary.

Branching is first class. An abandoned exploration gets a branchSummary entry so the work survives navigation away. Compaction inserts a summary pointer without destroying raw history. Plugins write their own state into the session file as custom entries, no separate database needed.

A flat transcript represents none of this. You either lose abandoned work, lose the ability to fork from a past turn, or invent a second source of truth to track what the transcript cannot.

Why it matters: the moment you want to fork from turn 14 to try a different path, a flat log forces you to copy-edit a file. A tree handles it as one new parentId.

The pattern underneath

Every one of these rules has the same shape. The model is not the runtime.

The runtime parses the stream. Owns the policy. Transforms the state. Indexes the capabilities. Reconciles the prose against durable systems. Structures the history. The model is invited in for one turn at a time, on a compiled prompt, under a typed contract, against a state it does not directly own.

That is the only sentence I would underline twice. Every harness in the corpus, on every axis I looked at, is built around it. The mistakes I see in production code are almost always one team or another quietly trusting the model to be the runtime.

Related reading: the system prompt is compiler output, not a string, a seventh thing those six harnesses agree on, up close.