Drover

AI-powered document classification that herds files into organized folders.

Problem

Scanned PDFs, downloaded invoices, medical records, and email attachments accumulate in Downloads. Filing them by hand is tedious; filing them inconsistently is worse, because the next-time search relies on the naming convention holding up.

Drover classifies each document by domain, category, and type, extracts the vendor and relevant date, and suggests a standardized filename and destination path against a controlled taxonomy. Named after herding dogs that drive livestock, because that is the job: take scattered documents and move them into a consistent structure.

Architecture

A linear pipeline with extensible plugin points at every stage.

[document] → [loader] → [sampler] → [classifier] → [pathbuilder] → [output]
                 │          │              │               │
                 ▼          ▼              ▼               ▼
              [format]  [strategy]     [taxonomy]     [naming policy]

The loader handles PDFs, Office formats, images, email, and plain text. The sampler picks pages by strategy: full, first-n, bookends, or adaptive for long documents, so the classifier is not handed a 200-page PDF verbatim. The classifier runs a 7-step chain-of-thought prompt: extract key information, evaluate dates by priority, determine doctype, analyze candidate categories, determine domain, extract vendor, synthesize subject.

The pathbuilder applies a naming policy (NARA-compliant by default: {doctype}-{vendor}-{subject}-{date}.pdf) to the classifier's structured output and produces a suggested path. Output is either JSON per document or a JSONL batch for pipelines. On macOS, a separate tag command writes the classification as native filesystem tags.

LLM providers are swappable: Ollama (local, the default), OpenAI, Anthropic, OpenRouter. Structured output comes from LangChain's with_structured_output over a Pydantic schema, so parsing is never a regex against model prose.

Evaluation

Drover ships an evaluation harness (eval/ground_truth.jsonl) and an evaluate command that scores classifier output against labeled samples.

Recorded result

On household documents, domain accuracy moved from ~75% with a naive single-shot prompt to ~92% after introducing the 7-step chain-of-thought template. Remaining misclassifications shifted from "wrong domain" to edge cases, which I treat as the acceptable failure mode.

Full rationale and evidence in ADR-001: Chain-of-Thought Prompting.

Three learnings