Retriever

RAG-powered Q&A with source citations. Hybrid search, FastAPI backend, SvelteKit frontend, pgvector store.

Problem

Policy and procedure documents live scattered across wikis, shared drives, and PDFs. Keyword search returns noise; reading end-to-end takes hours. Retriever answers natural-language questions against a document corpus and returns answers with clickable source citations back into the original text.

The target user is an operational contributor (a shelter volunteer in the first deployment, generalizable to any organization with documentation) who needs a correct answer and a way to verify it in context.

Architecture

Two pipelines share a single pgvector store. The document pipeline ingests, chunks, and embeds. The query pipeline retrieves, reranks, and synthesizes.

DOCUMENT PIPELINE
[markdown/text] → [chunker] → [embeddings] → [pgvector]

QUERY FLOW
[question] → [hybrid search] → [rerank] → [LLM] → [answer + citations]

SERVICES
[SvelteKit @ Cloudflare Pages]
          │
          ▼
[FastAPI @ Cloud Run] ───▶ [Supabase: Postgres + pgvector + Auth]
          │
          ▼
[Cloudflare AI Gateway] ───▶ [OpenRouter LLM] + [OpenAI embeddings]

Retrieval is hybrid: HNSW cosine over embeddings for semantic match, GIN full-text for lexical match, then a rerank pass. Embeddings are OpenAI's text-embedding-3-small. The LLM is routed through OpenRouter so the generation provider is a config change.

Every LLM and embedding call goes through Cloudflare AI Gateway for rate limiting, caching, and request logging. Application code talks to the gateway; the gateway handles the vendor SDKs. Auth is Supabase with row-level security. Observability is structlog JSON plus OpenTelemetry plus Langfuse for LLM tracing.

Evaluation

Retriever does not publish a benchmark of the scale that Advisors does. Retrieval quality for a specific corpus is measured against that corpus, and the serious evaluation work lives alongside the deployed instance rather than the open-source repo.

What is in the repo: citation-grounded answers (every response renders the exact passage the model drew from), built-in content moderation, and hallucination detection on the synthesis step. The eval posture for this class of system is per-deployment answer-quality review against a gold set of questions drawn from the real document corpus.

Three learnings

  1. 1. Lexical-only search misses paraphrases; semantic-only search returns plausible but wrong passages on policy documents that share vocabulary across unrelated topics. Hybrid retrieval (HNSW cosine plus GIN full-text, both scored into the rerank) is the minimum viable stack for this content type. Dropping either side degrades recall or precision depending on how the question is phrased.
  2. 2. Citations are the contract between the system and the user. Users trust the answer only when they can click into the original document and verify the passage in context. Any RAG system without visible, verifiable citations is a demo.
  3. 3. Routing every LLM and embedding call through Cloudflare AI Gateway rather than direct vendor SDKs added an extra network hop. It also made model swaps, provider fallbacks, rate limiting, and request-level observability a config change. I traded a small amount of latency for operational leverage, and the tradeoff has not been close.

Links