Methods

How I build production systems: harness engineering, evaluation, and multi-instance coordination.

The autonomous development pipeline

I built an end-to-end coding pipeline, Daedalus, that takes one tracked issue from claim to commit with no human turn in between: an orchestrator plans the change, an implementer writes it test-first, a review gate judges the diff, the quality gates run, and it commits with the issue reference. The controls that make leaving such a run unattended safe are below.

Test-first, reviewed cold

The implementer writes the failing test before the change and makes it pass. A separate review-gate agent then judges the finished diff with no edit tools and none of the implementer's context: it reruns the project's own test and lint commands, reads the issue body, and confirms the diff satisfies each acceptance criterion before returning APPROVE or REJECT with findings tied to file and line.

Red-green TDD: the test exists and fails before the code that satisfies it.
The reviewer did not write the code, so it judges the diff, not its own reasoning.
The reviewer runs the gates itself rather than trusting the implementer's report.

Controls for an unattended run

Removing the human from the loop means the guarantees have to be mechanical. Five controls do that work, each enforced by the harness rather than requested in a prompt.

Reviewability: every run emits test output, lint output, the verdict, the diff status, and each tracker change, so a human can audit it after the fact.
Scope honesty: hooks deny edits to env files, lockfiles, the .git directory, and CI config, and the review gate auto-rejects any file changed outside the plan.
Bounded parallelism: implementation stays in one agent that holds the whole change; the pipeline fans out only to run independent reviewers against a high-risk diff.
Cost caps: per-agent turn limits and an overall budget, with the expensive multi-reviewer pass reserved for changes flagged high-risk.
A clean stop: bounded retries, then a blocked-with-reason terminal state instead of a run that loops forever.

The decision about where to parallelize follows the dependency structure, not the agent count. A single code change is shared-context work (a signature and its callers, a schema and its migrations), so it belongs in one agent. Independent judgments of a finished diff fan out cleanly, which is the one place Daedalus runs agents in parallel.

I wrote up the full reasoning, with the published data behind the parallelization and cost decisions, in Controls for an autonomous coding pipeline. The pipeline itself is documented on the Daedalus project page.

Harness engineering

I treat my harness (the runtime coordinating agent work) as configurable infrastructure I can tune at every layer. Three patterns do most of the work.

Lifecycle hooks

I use hooks to catch unsafe tool calls before they run, hand off follow-up work to the right subagent, and reload project conventions after the runtime compacts a long session. The policy lives in config where I can audit and version it.

Pre-action hooks reject unsafe edits before they hit the filesystem or network.
Post-action hooks route the next step to the appropriate skill or subagent.
Post-compaction hooks reload the rules the agent needs after the runtime summarizes a session.

Scoped subagents

I split work across subagents scoped by role (reviewer, planner, implementer, auditor) and give each one only the tools its role needs. If the reviewer does not have write access, it cannot write. The boundary is declared in config and enforced by the harness.

Correctness: a subagent cannot invoke tools outside its mandate.
Speed: smaller tool surfaces produce faster, more focused outputs.
Auditability: role boundaries are explicit in configuration.

Portable skills

I write skills against the agentskills.io specification so they are not locked to one vendor's harness. I publish them through Backchain.

Context and model budgeting

Two things drive my cost and latency: which model does the work, and how much context loads before it starts. I tune both.

Layered configuration

My setup pulls from three layers at runtime: a user-level set of broadly useful skills (most disabled by default), a domain layer for shared standards and voice across a body of related work, and project-specific rules auto-loaded per repo. Loading all of it everywhere wastes tokens. Loading none of it means the agent forgets its conventions every session.

User layer: widely reusable skills, opt-in per project.
Domain layer: standards and voice guides shared across related work.
Project layer: repository-specific rules, auto-loaded.

Rules, standards, references

My rules are thin and always loaded: naming, git workflow, issue tracking, voice. Standards and references hold the depth (registries, specs, tables) and only load when a skill asks for them.

Rules: always loaded, one topic per file.
Standards: full registries and specifications, loaded on demand.
References: external URL lists, vocabulary, and reading material pulled when relevant.

Model-tier routing

Each skill and subagent declares the model appropriate to the role. Triage, summarization, and shallow lookups go to the fast tier. Implementation and editing go to the mid tier. Architecture, review, and planning go to the capable tier.

Fast tier: triage, summarization, and shallow lookups.
Mid tier: implementation and editing.
Capable tier: architecture, review, and planning.

Evaluation practice

Prompts have no ground truth, so "this feels better" is not evidence. I grade every skill change against a fixed rubric in a reproducible harness and compare runs.

Methods

I use LLM-as-judge grading against per-criterion rubrics, and I run trials in clean-room sessions (claude --bare) so my plugins and CLAUDE.md files do not leak into the evaluation. The full harness runs on every release.

Assertion-based LLM-as-judge grading against per-criterion rubrics.
Clean-room isolation (claude --bare) prevents state leakage between trials.
Python runner in advisors/evals/, reproducible and open source.

Worked example: Engram plugin

Engram's three skills were graded across nine scenarios, each run with the skill and against a bare-prompt baseline, with an LLM-as-judge scoring both the text output and the filesystem state the skill left behind. Mean pass rate was 90% with the skill versus 77% without, a 13-point gain. The lift concentrates in the scenarios where the skill adds structure (an explicit checkpoint format, staleness signals folded into the briefing); scenarios where the baseline already does the job show no gap, which the eval records rather than hides.

The whole run, eighteen executions plus eighteen grading calls, cost $1.07 and finished in about five minutes. A cheap, reproducible eval is what lets me treat a skill change as a measured delta instead of a hunch.

Evaluation framework and method (README)

Multi-instance ownership for parallel agents

I run multiple agent instances in parallel on the same project. Without coordination they overwrite each other's work. This pattern is how I keep them out of each other's way.

Each instance takes a branch-scoped identity and claims tasks atomically from a shared state store. A claim fails immediately when another instance already owns the task. One instance refactors a module, another fixes a bug, a third writes tests. The coordination layer is git worktrees plus a shared tracker. No central scheduler.

The approach builds on Steve Yegge's Beads and Dolt. Claims persist through context compaction and survive session restarts. An issue stays claimed until the branch is merged.

The pattern extends beyond any single harness. Any orchestrator running multiple workers against a shared state store can use the same claim-and-release lifecycle.