Evaluating Design in Agentic Development
A document-loader A/B where the synthetic corpus said reject and a real-world rerun said accept. Notes on corpus shape, joint metrics, and the evaluation criteria I had wrong.
- evaluation
- methods
Working observations on building agentic workflows in production.
A document-loader A/B where the synthetic corpus said reject and a real-world rerun said accept. Notes on corpus shape, joint metrics, and the evaluation criteria I had wrong.
A 60-trial A/B of the Advisors plugin against a no-skill baseline: 96% vs 45%, zero parse failures. Open harness, clean-room isolation, reproducible from a public repo.