Skip to content

Knowledge Base RAG#

Hybrid keyword + semantic search over a realistic engineering knowledge base, with answer grounding verification so every claim maps to a real source.

What it does#

Answers "How do I deploy to staging?" against a knowledge base of 15+ markdown runbook docs. Every sentence in the response is traceable to a specific chunk of a specific doc — no hallucinations.

Architecture#

  • Ch 4 (State & Collaboration) — chunks and indexes the KB; retrieves at query time
  • Ch 3b (Tools & MCP)search_kb, fetch_doc, list_kb_topics, check_staleness as tools
  • Ch 5 (Evaluation)grounding_checker.py verifies every claim in the answer has supporting evidence in the retrieved chunks
  • Appendix: Vector Memory — optional sentence-transformers backend for the semantic half of hybrid search
Question ──► search_kb (keyword + semantic reranked) ──► top-K chunks
                       ┌────────────────────────────────────┤
                       ▼                                    ▼
                  synthesize answer                      grounding_checker
                       │                                    │
                       └──► verified answer + sources ◄─────┘

Tools#

Tool Purpose
search_kb(query, k=5) Ranked chunks by hybrid score
fetch_doc(doc_id) Full markdown of a doc
list_kb_topics() Top-level categories
check_staleness(doc_id, max_age_days=30) Flag out-of-date content

Knowledge base#

knowledge_base/ contains 15+ realistic markdown docs on a mock SaaS company's engineering operations:

  • Onboarding (dev setup, IDE config, repo access)
  • Deploy runbooks (staging, production, rollback)
  • Incident response (on-call, escalation paths)
  • Coding standards (style, review process)
  • FAQ docs (billing, features, support)
  • Release process, secrets management, feature flags

Enough to test retrieval on a real-sized corpus without licensing issues.

Grounding checker#

grounding_checker.py runs after the answer is generated. For each claim in the answer, it verifies:

  1. The claim appears (lexically or semantically) in one of the retrieved chunks
  2. The citation [1], [2] resolves to a real chunk id
  3. No claim is "unsupported" (i.e., invented by the model with no source)

Returns (grounded: bool, ungrounded_claims: list[str]). The agent refuses to output ungrounded answers in production mode; in mock mode it prints a warning so you can see the failure pattern.

Cost estimate#

~$0.005 per query on Haiku (retrieval is cheap; only synthesis calls the LLM).

Run it#

SWARM_MOCK=true python -m projects.knowledge_base_rag.agent \
    "How do I deploy to staging?"

Expected: cited answer pointing at knowledge_base/deploy_staging.md (or similar), grounding check passes, response includes numbered source citations.

SWARM_MOCK=true .venv/bin/pytest projects/knowledge_base_rag/ -v

Expected: 10+ tests passing (exact-match, paraphrase, out-of-scope, stale-doc warning, multi-doc synthesis, grounding pass/fail detection).

Extending for production#

  • Swap the backend: default is hand-rolled TF-IDF + cosine similarity. For >10k docs, move to sentence-transformers + FAISS (see Appendix: Vector Memory). For multilingual or domain-specific, use Anthropic or OpenAI embeddings.
  • Add reranking: retrieve 20, rerank with a cross-encoder, keep top 5. Dramatic precision boost.
  • Chunk tuning: current default is ~500 tokens per chunk. Domain-specific KBs (legal, medical) often do better with larger chunks and more context.
  • Freshness alerts: wire check_staleness to Slack so doc owners get pinged when their content ages past the retention window
  • Auditable: log every query + retrieval + grounding result so you can investigate "why did the agent say X" after the fact