Knowledge Base RAG#

Hybrid keyword + semantic search over a realistic engineering knowledge base, with answer grounding verification so every claim maps to a real source.

What it does#

Answers "How do I deploy to staging?" against a knowledge base of 15+ markdown runbook docs. Every sentence in the response is traceable to a specific chunk of a specific doc — no hallucinations.

Architecture#

Ch 4 (State & Collaboration) — chunks and indexes the KB; retrieves at query time
Ch 3b (Tools & MCP) — search_kb, fetch_doc, list_kb_topics, check_staleness as tools
Ch 5 (Evaluation) — grounding_checker.py verifies every claim in the answer has supporting evidence in the retrieved chunks
Appendix: Vector Memory — optional sentence-transformers backend for the semantic half of hybrid search

Question ──► search_kb (keyword + semantic reranked) ──► top-K chunks
                                                            │
                       ┌────────────────────────────────────┤
                       ▼                                    ▼
                  synthesize answer                      grounding_checker
                       │                                    │
                       └──► verified answer + sources ◄─────┘

Tools#

Tool	Purpose
`search_kb(query, k=5)`	Ranked chunks by hybrid score
`fetch_doc(doc_id)`	Full markdown of a doc
`list_kb_topics()`	Top-level categories
`check_staleness(doc_id, max_age_days=30)`	Flag out-of-date content

Knowledge base#

knowledge_base/ contains 15+ realistic markdown docs on a mock SaaS company's engineering operations:

Onboarding (dev setup, IDE config, repo access)
Deploy runbooks (staging, production, rollback)
Incident response (on-call, escalation paths)
Coding standards (style, review process)
FAQ docs (billing, features, support)
Release process, secrets management, feature flags

Enough to test retrieval on a real-sized corpus without licensing issues.

Grounding checker#

grounding_checker.py runs after the answer is generated. For each claim in the answer, it verifies:

The claim appears (lexically or semantically) in one of the retrieved chunks
The citation [1], [2] resolves to a real chunk id
No claim is "unsupported" (i.e., invented by the model with no source)

Returns (grounded: bool, ungrounded_claims: list[str]). The agent refuses to output ungrounded answers in production mode; in mock mode it prints a warning so you can see the failure pattern.

Cost estimate#

~$0.005 per query on Haiku (retrieval is cheap; only synthesis calls the LLM).

Run it#

SWARM_MOCK=true python -m projects.knowledge_base_rag.agent \
    "How do I deploy to staging?"

Expected: cited answer pointing at knowledge_base/deploy_staging.md (or similar), grounding check passes, response includes numbered source citations.

SWARM_MOCK=true .venv/bin/pytest projects/knowledge_base_rag/ -v

Expected: 10+ tests passing (exact-match, paraphrase, out-of-scope, stale-doc warning, multi-doc synthesis, grounding pass/fail detection).

Extending for production#

Swap the backend: default is hand-rolled TF-IDF + cosine similarity. For >10k docs, move to sentence-transformers + FAISS (see Appendix: Vector Memory). For multilingual or domain-specific, use Anthropic or OpenAI embeddings.
Add reranking: retrieve 20, rerank with a cross-encoder, keep top 5. Dramatic precision boost.
Chunk tuning: current default is ~500 tokens per chunk. Domain-specific KBs (legal, medical) often do better with larger chunks and more context.
Freshness alerts: wire check_staleness to Slack so doc owners get pinged when their content ages past the retention window
Auditable: log every query + retrieval + grounding result so you can investigate "why did the agent say X" after the fact