Notes

Working observations on building agentic workflows in production.

May 2026

  • Evaluating Design in Agentic Development

    A document-loader A/B where the synthetic corpus said reject and a real-world rerun said accept. Notes on corpus shape, joint metrics, and the evaluation criteria I had wrong.

    • evaluation
    • methods

April 2026

  • Evaluating agent skill effectiveness

    A 60-trial A/B of the Advisors plugin against a no-skill baseline: 96% vs 45%, zero parse failures. Open harness, clean-room isolation, reproducible from a public repo.

    • evaluation
    • claude-code
    • methods