Methods
How I build production systems: harness engineering, evaluation, and multi-instance coordination.
Harness engineering
I treat my harness (the runtime coordinating agent work) as configurable infrastructure I can tune at every layer. Three patterns do most of the work.
Lifecycle hooks
I use hooks to catch unsafe tool calls before they run, hand off follow-up work to the right subagent, and reload project conventions after the runtime compacts a long session. The policy lives in config where I can audit and version it.
- • Pre-action hooks reject unsafe edits before they hit the filesystem or network.
- • Post-action hooks route the next step to the appropriate skill or subagent.
- • Post-compaction hooks reload the rules the agent needs after the runtime summarizes a session.
Scoped subagents
I split work across subagents scoped by role (reviewer, planner, implementer, auditor) and give each one only the tools its role needs. If the reviewer does not have write access, it cannot write. The boundary is declared in config and enforced by the harness.
- • Correctness: a subagent cannot invoke tools outside its mandate.
- • Speed: smaller tool surfaces produce faster, more focused outputs.
- • Auditability: role boundaries are explicit in configuration.
Portable skills
I write skills against the agentskills.io specification so they are not locked to one vendor's harness. I publish them through Backchain.
Context and model budgeting
Two things drive my cost and latency: which model does the work, and how much context loads before it starts. I tune both.
Layered configuration
My setup pulls from three layers at runtime: a user-level set of broadly useful skills (most disabled by default), a domain layer for shared standards and voice across a body of related work, and project-specific rules auto-loaded per repo. Loading all of it everywhere wastes tokens. Loading none of it means the agent forgets its conventions every session.
- • User layer: widely reusable skills, opt-in per project.
- • Domain layer: standards and voice guides shared across related work.
- • Project layer: repository-specific rules, auto-loaded.
Rules, standards, references
My rules are thin and always loaded: naming, git workflow, issue tracking, voice. Standards and references hold the depth (registries, specs, tables) and only load when a skill asks for them.
- • Rules: always loaded, one topic per file.
- • Standards: full registries and specifications, loaded on demand.
- • References: external URL lists, vocabulary, and reading material pulled when relevant.
Model-tier routing
Each skill and subagent declares the model appropriate to the role. Triage, summarization, and shallow lookups go to the fast tier. Implementation and editing go to the mid tier. Architecture, review, and planning go to the capable tier.
- • Fast tier: triage, summarization, and shallow lookups.
- • Mid tier: implementation and editing.
- • Capable tier: architecture, review, and planning.
Evaluation practice
Prompts have no ground truth, so "this feels better" is not evidence. I grade every skill change against a fixed rubric in a reproducible harness and compare runs.
Methods
I use LLM-as-judge grading against per-criterion rubrics, and I run trials in clean-room sessions (claude --bare) so my plugins and CLAUDE.md files do not leak into the evaluation. The full harness runs on every release.
- • Assertion-based LLM-as-judge grading against per-criterion rubrics.
- • Clean-room isolation (claude --bare) prevents state leakage between trials.
- • Python runner in advisors/evals/, reproducible and open source.
Worked example: Advisors plugin
Advisors hit 96% on the rubric. The bare-prompt baseline hit 45%. Sixty graded trials per release, zero parse failures. The runner and rubric are in advisors/evals/ if you want to reproduce them.
Multi-instance ownership for parallel agents
I run multiple agent instances in parallel on the same project. Without coordination they overwrite each other's work. This pattern is how I keep them out of each other's way.
Each instance takes a branch-scoped identity and claims tasks atomically from a shared state store. A claim fails immediately when another instance already owns the task. One instance refactors a module, another fixes a bug, a third writes tests. The coordination layer is git worktrees plus a shared tracker. No central scheduler.
The approach builds on Steve Yegge's Beads and Dolt. Claims persist through context compaction and survive session restarts. An issue stays claimed until the branch is merged.
The pattern extends beyond any single harness. Any orchestrator running multiple workers against a shared state store can use the same claim-and-release lifecycle.
Published work
- • github.com/backchainai/backchain-plugins. Portable skills conforming to the agentskills.io specification.
- • backchain-plugins/advisors. Multi-perspective evaluation plugin.
- • advisors/evals/README.md. Open-source evaluation framework.