Appendix: Vector Memory#

The three-layer memory system in Chapter 4 does retrieval by regex. That is enough for an audit trail and for small semantic stores, and it is wrong the moment your agent starts searching for ideas instead of strings. This appendix walks the delta: why keyword search fails on paraphrase, what a vector store gives you in exchange, how to choose a backend, and how to wire a VectorStore into MemoryStore without rewriting the world.

Why keyword search isn't enough#

Keyword search, whether grep over transcripts or LIKE '%foo%' over a database, scores on exact token overlap. The user types "database costs exploded" and the store holds a transcript titled "unusual AWS bill on Sunday". Those two phrases share zero tokens. Keyword retrieval returns nothing. The right document exists, lives one directory away, and the agent still acts like it has amnesia.

Semantic retrieval collapses that gap. Embeddings map both phrases to points in a continuous space where nearby points mean similar things. "Database costs exploded" and "unusual AWS bill" land close enough that cosine similarity ranks them as a match, without a single shared token. The same trick works for synonyms, inflections, and idiomatic rewordings, which is most of the queries a user actually types.

The inverse failure is worth naming too. Pure semantic search can match synonyms that mean the wrong thing. "Python slither constrictor" sounds like a snake and lands near a zoology paragraph, even if your user meant the language. Production systems hedge by combining both signals: a vector store for recall, a keyword score on top for precision, the two summed or passed through a reranker. We cover only the vector half here; the hybrid is a one-line extension once you have the primitive.

The VectorStore API#

swarm.memory.vector_store.VectorStore is the primitive. Two methods you will use daily:

from swarm.memory.vector_store import VectorStore

store = VectorStore()  # defaults to the tfidf backend
await store.add("doc_42", "felines purr when content", metadata={"doc_type": "note"})
await store.add("doc_43", "dogs bark at strangers",   metadata={"doc_type": "note"})

hits = await store.search("cat purrs", k=3)
# [("doc_42", 0.71, {"doc_type": "note"}),
#  ("doc_43", 0.02, {"doc_type": "note"})]

Each hit is a tuple of (doc_id, score, metadata). The score is cosine similarity, always in the [0, 1] range. Values above 0.7 are usually strong matches. Values between 0.3 and 0.7 are the grey zone where you want a reranker or a keyword tiebreaker. Values below 0.3 are almost always noise and should be dropped before you show them to the agent.

Two behaviors worth knowing. First, add() is O(1) and marks the index dirty; the next search() auto-rebuilds. If you are batch-loading a hundred docs, prefer explicit rebuild() once at the end so search latency stays predictable. Second, filter is applied after ranking, which means you should over-fetch when the filter is selective: ask for k=20 if you only expect 5 doc_type=faq entries in the top window.

Choosing a backend#

The store ships with three backends, two of them optional. Pick by corpus size, quality floor, and what is installed.

Backend	When to pick	Latency (10K docs)	Notes
`tfidf` (sklearn)	<10K docs, English, low memory budget	<10 ms / query	Default. No model, no network, no GPU. Lossy on synonyms.
`sentence-transformers`	Queries and docs use different vocabulary	~30 ms / query (CPU)	~80 MB model download. Better recall on paraphrase.
OpenAI / Anthropic embed	Multilingual or domain-specific content	~100 ms + network	Highest quality. Per-token cost. Requires API key.
FAISS + dense	>1 M docs, latency-sensitive	~5 ms ANN / query	Approximate nearest neighbor with HNSW. Exact scan still works for <100K.

The default is deliberate. tfidf holds up surprisingly well when your corpus is English agent transcripts, topic summaries, or Markdown notes with consistent vocabulary. You should start there and measure the recall@k on a small held-out set of queries before moving. The indicator to upgrade is that the store consistently misses paraphrases a human would call matches, even though the right document is in the corpus. That is the signal that your queries and docs are using different words for the same concept, and sentence-transformers will help.

Move to API embeddings only when the vocabulary mismatch is structural: multilingual content, heavy jargon, cross-domain retrieval. The quality gain over sentence-transformers is real but small, and the network round-trip and per-token cost are not free. Measure first.

FAISS is an indexing library, not an embedding model. You layer it on top of a dense backend when the corpus crosses the million-document threshold and exact cosine over the full matrix starts eating your budget. Below that, a numpy dot product is faster than the FAISS overhead.

Integrating with MemoryStore#

MemoryStore.search_transcripts() greps transcripts with a regex. To add a semantic path, wire a VectorStore alongside and pick a retrieval strategy at the call site. The idea is to index every logged turn once, then let the search method decide which backend to hit:

from swarm.memory.store import MemoryStore
from swarm.memory.vector_store import VectorStore

class SemanticMemoryStore(MemoryStore):
    def __init__(self, *args, retrieval_strategy: str = "regex", **kwargs):
        super().__init__(*args, **kwargs)
        self.retrieval_strategy = retrieval_strategy
        self._vector = VectorStore() if retrieval_strategy == "semantic" else None

    async def log_turn(self, agent_id: str, role: str, content: str) -> None:
        await super().log_turn(agent_id, role, content)
        if self._vector is not None:
            doc_id = f"{agent_id}:{role}:{id(content)}"
            await self._vector.add(doc_id, content, {"agent_id": agent_id, "role": role})

    async def search_semantic(self, query: str, k: int = 5):
        if self._vector is None:
            raise RuntimeError("retrieval_strategy='semantic' required")
        return await self._vector.search(query, k=k)

Transcripts still land in the JSONL episodic layer, grep still works, and the vector store is a parallel index you query when the agent needs semantic recall. Two layers, one source of truth. The cost is a rebuild on first search per session, which with tfidf stays under a second for the kind of corpora a single agent produces.

Production gotchas#

Stale index is the first trap. Every add() sets _dirty = True, and the next search() rebuilds from scratch. That is fine for a daemon logging ten turns a minute; it is a disaster when a batch job adds ten thousand entries back-to-back. Call rebuild() explicitly at the end of a batch, or debounce add() calls into a queue and flush on a timer.

Cold start is the second. On daemon restart the in-memory index is empty, and the first search rebuilds the full corpus. With tfidf this is seconds; with sentence-transformers it is tens of seconds because the model also loads. Persist the matrix alongside the entries if cold-start latency matters, or pre-warm on boot by calling rebuild() before accepting traffic.

Approximate nearest neighbor is the third. FAISS HNSW is not exact cosine. At 1M docs and k=10 it typically recovers 95% of the true top-k with ten-times-lower latency. That trade is worth it at scale and pointless below it. Benchmark recall@k on your corpus before flipping the switch, because the lost 5% is usually your long-tail queries.

Memory growth is the fourth. Dense embeddings cost 1.5 KB per doc at 384 dimensions. A million entries is 1.5 GB of RAM before you add any overhead. Plan for it, or page to disk with a memory-mapped index.

Re-indexing during ETL is the last. When you change the embedding model you must re-embed every doc; there is no partial upgrade. Budget wall-clock time and API spend accordingly, and keep the old index live until the new one passes your recall tests.