Agent memory patterns: vector stores, knowledge graphs, and context engineering

An agent’s context window is working memory: fast, expensive, and wiped between sessions. Everything else — what it learned yesterday, what the user prefers, what happened in the last nine steps of a ten-step task — has to be engineered. That engineering has exactly four recurring patterns.

Step 1 — The four patterns

Pattern	Answers	Typical store
Scratchpad	”Where was I in this task?”	Plain file / task state object
Retrieval memory	”What do I know that’s relevant right now?”	Vector store
Knowledge graph	”How do these entities relate?”	Graph / triples
Profile memory	”What’s durably true about this user/project?”	Small curated documents

Most teams jump straight to a vector database. Wrong order: scratchpad and profile memory are cheaper, more debuggable, and cover most real failures. A NOTES.md the agent reads at session start is profile memory. A todo list it updates mid-task is a scratchpad. Ship those first.

Step 2 — Build retrieval memory you can actually inspect

The retrieval loop — store, score, recall, inject — is identical whether the scorer is bag-of-words or a 3,072-dimension embedding. So build it with a scorer you can debug by eye. Create memory.py:

import math
import re
from collections import Counter

def _tokens(text: str) -> Counter:
    return Counter(re.findall(r"[a-z0-9]+", text.lower()))

def _cosine(a: Counter, b: Counter) -> float:
    dot = sum(a[t] * b[t] for t in a.keys() & b.keys())
    norm = math.sqrt(sum(v * v for v in a.values())) * \
           math.sqrt(sum(v * v for v in b.values()))
    return dot / norm if norm else 0.0

class MemoryStore:
    """Retrieval memory with a transparent scorer.

    Swap _cosine-over-token-counts for embedding similarity in production;
    the interface — and every bug you'll meet — stays the same.
    """

    def __init__(self):
        self._items: list[tuple[str, Counter]] = []

    def add(self, text: str) -> None:
        self._items.append((text, _tokens(text)))

    def search(self, query: str, k: int = 3) -> list[tuple[float, str]]:
        q = _tokens(query)
        scored = [( _cosine(q, vec), text) for text, vec in self._items]
        return sorted(scored, reverse=True)[:k]


if __name__ == "__main__":
    memory = MemoryStore()
    memory.add("User prefers TypeScript over Python for new services.")
    memory.add("The staging database refresh runs Sundays at 02:00 UTC.")
    memory.add("Deploys require two approvals since the March incident.")

    for score, text in memory.search("should we pick Python or TypeScript for new services?"):
        print(f"{score:.2f}  {text}")

Run python memory.py — the preference memory scores on top by a wide margin (0.59 vs 0.00 for the unrelated entries on this toy corpus). Before the model sees anything, you can print exactly why a memory was recalled. That transparency is worth keeping until scale forces you off it.

Step 3 — Inject memories as context, not commands

How recalled text enters the prompt matters as much as retrieval quality. Injected memories should be labeled as background, with provenance:

Relevant notes from memory (may be stale, verify before acting):
- [2026-05-12] User prefers TypeScript over Python for new services.

Unlabeled injection is how agents end up obeying a note from three months ago over the user in front of them. Memories inform; the current conversation commands.

Step 4 — When to graduate each component

To embeddings: when synonyms start missing (“DB” never matches “database”). Use any embedding API or a local model; keep the MemoryStore interface.
To a real vector DB (pgvector first if you already run Postgres): when memories outgrow RAM or need sharing across processes.
To a knowledge graph: only when your questions become relational — “which services depend on the thing that changed?” Retrieval memory can’t answer joins; graphs exist for that.

Step 5 — Forgetting is a feature

Memory that only grows becomes noise that costs tokens. Production memory needs an eviction story: expire by age, deduplicate near-identical entries, and — most effective — have the agent periodically rewrite its memory files, compressing ten observations into one conclusion. Curation beats accumulation.

Troubleshooting

The agent keeps acting on stale memories

Add timestamps to every entry, label injected memories as potentially stale (Step 3), and prefer recency-weighted scoring: score * decay(age). Most “the agent is confused” reports are actually “the memory is old.”

Retrieval returns plausible-looking but irrelevant memories

Top-k always returns something — even when nothing is relevant. Add a minimum-score threshold and inject nothing below it. An empty memory section beats a misleading one.

Next in this learning path How to evaluate an AI agent: build a golden-task eval harness from scratch