Drover
AI-powered document classification that herds files into organized folders.
Problem
Scanned PDFs, downloaded invoices, medical records, and email attachments accumulate in Downloads. Filing them by hand is tedious; filing them inconsistently is worse, because the next-time search relies on the naming convention holding up.
Drover classifies each document by domain, category, and type, extracts the vendor and relevant date, and suggests a standardized filename and destination path against a controlled taxonomy. Named after herding dogs that drive livestock, because that is the job: take scattered documents and move them into a consistent structure.
Architecture
A linear pipeline with extensible plugin points at every stage.
[document] → [loader] → [sampler] → [classifier] → [pathbuilder] → [output]
│ │ │ │
▼ ▼ ▼ ▼
[format] [strategy] [taxonomy] [naming policy]
The loader handles PDFs, Office formats, images, email, and plain text. The sampler picks pages by strategy: full, first-n, bookends, or adaptive for long documents, so the classifier is not handed a 200-page PDF verbatim. The classifier runs a 7-step chain-of-thought prompt: extract key information, evaluate dates by priority, determine doctype, analyze candidate categories, determine domain, extract vendor, synthesize subject.
The pathbuilder applies a naming policy (NARA-compliant by default: {doctype}-{vendor}-{subject}-{date}.pdf) to the classifier's structured output and produces a suggested path. Output is either JSON per document or a JSONL batch for pipelines. On macOS, a separate tag command writes the classification as native filesystem tags.
LLM providers are swappable: Ollama (local, the default), OpenAI, Anthropic, OpenRouter. Structured output comes from LangChain's with_structured_output over a Pydantic schema, so parsing is never a regex against model prose.
Evaluation
Drover ships an evaluation harness (eval/ground_truth.jsonl) and an evaluate command that scores classifier output against labeled samples.
Recorded result
On household documents, domain accuracy moved from ~75% with a naive single-shot prompt to ~92% after introducing the 7-step chain-of-thought template. Remaining misclassifications shifted from "wrong domain" to edge cases, which I treat as the acceptable failure mode.
Full rationale and evidence in ADR-001: Chain-of-Thought Prompting.
Three learnings
- 1. Naive "classify this document" prompts hit a ceiling. Explicit numbered reasoning (extract, date priority, doctype, domain, vendor, subject) inside
<classification_analysis>tags pushed accuracy roughly seventeen points and, more importantly, made failures debuggable. The reasoning output is the debug log. - 2. The hardest classification call is distinguishing fundamental purpose from transactional use. A bank statement used for taxes is still a bank statement. The taxonomy rules had to spell that out explicitly ("classify by fundamental purpose, not how the document is used"). Without the rule, the classifier drifts toward whatever labels the user's recent folders suggest.
- 3. Privacy-first was non-negotiable for the use case. Financial and medical documents should not need to leave the device to get a filing suggestion. Multi-provider support exists so users can trade off accuracy against privacy on purpose. The default keeps data on the device.
Links
- • github.com/ckrough/drover. Source, CLI, and configuration reference.
- • ADR-001: Chain-of-Thought Prompting. The reasoning template and its evidence.
- • ADR-002: Privacy-First Design. The local-first rationale.
- • Back to projects