Advisors

A panel of role-specific skills that argue from focused perspectives and disagree on purpose.

Problem

Assistants default to agreement. Most strategic decisions do not need another voice confirming the user's instinct, they need a panel that will push back. Advisors is a set of five Claude skills modeled on Lincoln's cabinet pattern: appoint your sharpest opponents, not loyalists, and let them disagree in the room.

Four role-specific advisors (Visionary, Skeptic, Operator, Ethicist) each analyze the same decision through a fixed lens. An orchestrator runs them in parallel and compresses their outputs into a single recommendation with explicit areas of agreement and disagreement.

Architecture

Five skills conforming to the agentskills.io specification: four individual advisors plus an orchestrator (advisory-panel). Each advisor is a self-contained skill with its own system prompt, evaluation rubric, and example set. They are portable across any harness that implements the spec.

                  [user prompt]
                        │
                        ▼
              ┌──advisory-panel──┐
              │   orchestrator   │
              └────────┬─────────┘
                       │
        ┌──────┬───────┼───────┬──────┐
        ▼      ▼       ▼       ▼
   visionary skeptic operator ethicist
        │      │       │       │
        └──────┴───┬───┴───────┘
                  ▼
           synthesis pass
                  │
                  ▼
     [agreement]  [disagreement]
     [risk-adjusted recommendation]

The panel launches the four advisors as independent subagents, waits for all of them to finish, then runs a synthesis pass over the collected outputs. The synthesis pass is where the hard work happens: compressing four long analyses into the three output sections without flattening the disagreements that make the panel useful.

Each advisor can also be invoked on its own (/advisor-skeptic, /advisor-operator, and so on) for focused analysis from a single perspective.

Evaluation

Every skill ships with an evals.json defining per-assertion pass/fail criteria against three shared scenarios (personal career decision, small-business technology adoption, B2B SaaS strategic pivot). An LLM-as-judge grades each output with structured JSON output.

Every trial runs twice per scenario: once with the skill injected as the system prompt, once against a baseline prompt. Both run through claude --bare so hooks, plugins, and CLAUDE.md auto-discovery do not leak into the measurement. Iterations are keyed by git shorthash, with a -dirty suffix when skill files have uncommitted changes.

Published result

The advisory panel scored 96% against the rubric. The bare-prompt baseline scored 45%. 60 graded trials per release, zero parse failures. The runner, rubric, and per-assertion evidence are in the public repo.

Evaluation framework and method (README)

Three learnings

  1. 1. The single-skill evaluation pattern in the agentskills.io spec is not enough for a multi-skill plugin. I extended it with two cross-cutting analyses: perspective diversity (are the four advisors actually producing distinct takes, or converging?) and panel coherence (does orchestration degrade the quality of individual advisor outputs?). Both are gaps the spec does not cover, and both needed their own file formats under analysis/.
  2. 2. Clean-room isolation is the difference between measuring the skill and measuring the harness. Running trials through claude --bare strips hooks, plugins, and CLAUDE.md auto-discovery so only the skill's contribution shows up in the grade. Any eval that runs inside the same runtime configuration the developer uses is measuring both.
  3. 3. Synthesis is harder than the advisors. Any decent role-scoped prompt produces a plausible take. The difficult work is compressing four long analyses into agreement, disagreement, and a risk-adjusted verdict without softening the disagreements. Most synthesis iterations failed by collapsing the four voices into consensus that did not exist.

Links