Research Assistant#

A corpus-grounded research agent that searches a local document collection, takes notes, synthesises a cited answer, and verifies every citation maps back to a real note.

Problem#

LLMs are confident prose machines. Ask one a factual question and it will answer — often correctly, sometimes confidently wrong. For research tasks the "sometimes" is the whole problem: you cannot trust the output unless you can trace every claim back to a source.

This agent enforces that trace. It:

Retrieves sources from a local corpus (swap for a real browser MCP in production — see the integration section).
Records every fact it plans to cite as a note, with a pointer to the source URL.
Writes a final answer that must use [N] markers, where each N is a real note ID.
Verifies every citation after the fact, flagging unknown IDs, unsupported claims, and factual sentences that forgot a citation.

Architecture#

  question
     |
     v
  plan_sub_questions (deterministic)
     |
     v
  search_web -> fetch_document (per sub-question)
     |
     v
  harvest_notes -> ResearchState
     |
     v
  call_agent (synthesize with [N] citations)
     |
     v
  citation_checker.check
     |
     v
  ResearchResult(answer, notes, citations)

The LLM is only called once — for the final synthesis. Retrieval, note harvesting, and verification are Python. That split keeps the cost predictable, makes the citation guarantee auditable, and means the failure modes are debuggable by reading code.

Working memory#

Notes are the agent's working memory for a single query. We chose notes rather than full-document context because:

The model only sees the facts it will cite. Less text in the prompt = less opportunity for the model to invent claims from stale context.
Each note carries a source URL, so citation verification is a lookup, not a reasoning step.
Notes are cheap to re-use if the agent needs to revise the answer.

Cost estimate#

At default settings (Claude Haiku, 4 sources, ~12 notes, 700 max output):

Single call_agent with ~450 input / 60-100 output tokens.
Haiku 4.5: ~$0.0004 per query.
Sonnet 4.6: ~$0.005 per query.
Opus 4.7: ~$0.10 per query.

Retrieval cost is near-zero because it's local regex. With a real web search, expect to add the cost of 1-2 search API calls and 3-5 fetch-document calls per query.

Tools#

Tool	Purpose
`search_web`	Keyword search over the corpus.
`fetch_document`	Full body + metadata of a corpus URL.
`take_note`	Record a claim + source for later citation.
`list_notes`	Dump the current working-memory notes.
`verify_claim`	Lexical overlap check: is this claim supported?

In production search_web and fetch_document are the only two you would replace.

Running#

Mock mode is the default when there's no API key.

SWARM_MOCK=true python -m projects.research_assistant.agent \
    "When did Unix get the pipe operator?"

Programmatic:

from projects.research_assistant.agent import research

result = await research("What is ReAct?")
print(result.answer)
for n in result.notes:
    print(n.id, n.source_url, n.section)
if not result.citations.all_valid:
    print("CITATION ISSUES:", result.citations.as_dict())

Integration: real web browsing in production#

The mock corpus is just stand-ins for search_web and fetch_document. To wire real browsing, replace those two tools with an MCP client call or a direct HTTP fetch.

Option A: MCP browser server#

Run an MCP server that exposes search_web and fetch_document (Firecrawl, Tavily, Brave Search — all have MCP wrappers).
Connect via swarm.tools.mcp_client.MCPClient.
Register the MCP tools under the same names (search_web, fetch_document). Nothing else in the agent changes — the registry lookup is by name.

Option B: direct HTTP#

Replace search_web with a call to your search provider's API.
Replace fetch_document with httpx.get(url) + HTML-to-text.
Add a robots.txt check before fetching. Rate limit at 2-3 req/s per domain.

Either way, take_note, list_notes, and verify_claim stay local — they do not depend on the source being a local file.

Failure modes#

Hallucinated citations. Model emits [4] when only three notes exist. Caught by the citation checker's unknown_note_ids.
Unsupported citation. Model emits a real note ID but the claim's content tokens don't overlap the cited source. Caught by the overlap check (unsupported_claims).
Uncited factual claim. A sentence with a year or proper noun that doesn't carry a [N] marker. Caught by uncited_claims.
Incomplete research. The retrieval stage missed a source the answer needed. Rare symptom: final answer is shorter than expected or says "I do not have enough information" when the corpus does cover it. Mitigation: the planner expands the question into multiple sub-queries before searching.
Contradictory sources. Two corpus docs disagree. The agent is instructed (system prompt rule 4) to cite both sides rather than pick one. Test test_research_contradictory_sources_are_both_retrieved exercises the retrieval side of this.

Tests#

SWARM_MOCK=true .venv/bin/pytest projects/research_assistant/ -v

Covers corpus loading, search, fetch, note-taking, claim verification, the citation checker (all four issue types), and end-to-end pipeline behaviour on a handful of demo questions.