Skip to content

When AI Treats Your Codebase as Memory: Three Routes and the Price of Each

It started with a paper, Code as Agent Harness (arXiv 2605.18747), and one line in it: semantic memory turns an external codebase into a queryable evidence space that the agent can retrieve from and inject into its active context. The same paper has a second line that ended up running underneath the whole discussion: ungoverned historical records introduce semantic noise, error propagation, and false retrievals, whereas only curated, quality-controlled experiential memory is likely to become a useful asset.

I threw those sentences at the coding agent I had open, and one question led to another until we'd taken the whole idea of "how does an agent actually remember a project" apart. Writing it down.

My first question was a dumb one: how do I turn this project, the one in front of me, into that kind of queryable evidence space?

First, separate the two kinds of "memory"

People lump RAG and agent memory together. Pull them apart and they're two different things. One is an evidence index: chunk the source, embed it, drop it in a vector store, and pull it back by semantic similarity when you need it. The other is distilled facts — one-line conclusions like "the lint command for this repo is task lint."

Lay the two side by side and the difference gets clearer. An evidence-index entry looks like this (fictional example):

{
  "id": "src/billing/parser.py::parse_invoice",
  "path": "src/billing/parser.py",
  "span": [142, 168],
  "symbol": "parse_invoice",
  "kind": "function",
  "text": "def parse_invoice(raw): ...",
  "embedding": [0.013, -0.224, ...],
  "git_blob_sha": "a1b2c3...",
  "indexed_at_commit": "c6397c7"
}

That git_blob_sha is the key to incremental sync, later. A distilled fact is far plainer:

subject:  tooling
fact:     "lint runs task lint, format runs task format"
citation: Taskfile.yaml
scope:    repository

One is "raw evidence, traceable line numbers, changes with the code"; the other is "a human-readable conclusion, slow to change but prone to drift." The line in the paper is mostly about the former, but the thing you use every day tends to be the latter.

Three routes, and scale decides which one

I assumed I'd have to build a vector store. After the conversation I realized that for a project of two hundred-odd files, a vector store is a waste.

The dump-it-all route is the crudest and most effective: dump the entire codebase, docs and all, into the context window at once. Recall is 100%, maintenance is zero, it can never go stale. As long as it fits, it beats every fancier approach. It only breaks in three cases — it doesn't fit, paying for the whole bundle every time gets too expensive, or the context is so long the model "sees but doesn't catch" the few lines that matter (lost in the middle).

The live-query route is on-demand retrieval: grep, glob, and tree-sitter over the current files whenever you need something. Its defining trait is that it produces no intermediate artifacts — you query, you toss the result, no index, no vectors, no background process to babysit. The only thing that lands on disk is the entry doc you wrote yourself (AGENTS.md, an architecture doc, that kind of thing), and that's source, not a derived artifact.

The vector-store route is the vector RAG the paper describes: tree-sitter slices by function or class (keeping imports and docstrings for context), a code-aware embedding model (text-embedding-3-large, voyage-code, jina-code, that family) computes vectors, they go into a local store like sqlite-vec or LanceDB that's a single file you can gitignore, queries blend dense vectors with BM25, and the whole thing gets wrapped in an MCP server that exposes a search_code for any agent. The price is explicit: it spawns an .index/ bundle, it goes stale, it needs continuous syncing.

That's when one thing clicked: there's no free lunch called "instant vector retrieval." Vectors always require building an index first, and that step is the source of all their maintenance cost. The live-query route's instant-ness is bought with grep; the vector-store route's semantics come with a stale index you have to accept.

The code keeps changing — how do you maintain memory?

This was the part I cared about most. I change code every day; doesn't the memory rot every day too?

The key answer is incremental, not rewrite. On the vector side, store a git blob SHA per chunk and diff against it on rebuild: skip the unchanged, re-embed the changed, drop the deleted. Trigger it on pre-push or in CI, not pre-commit — otherwise every commit stalls waiting on embeddings. Add a nightly full reconciliation as a safety net.

The distilled-facts side is subtler, because it drifts the fastest. I learned a counterintuitive point here: the cure for churn is writing facts more abstractly, not rewriting them often. "Lint runs via task lint" survives hundreds of lines of edits without breaking; "the decision logic is at line 142 of parser.py" points at the wrong place the moment a line moves. The first lives long, the second rots in three days. There's a more general principle behind it: prefer instructions that discover current state over hardcoded paths and line numbers.

One more thing worth noting: cross-session memory is cumulative — a new session inherits and updates incrementally, it doesn't wipe and rewrite. This is the same durable-state discipline behind multi-context agent workflows: external state you can resume from beats trusting the context window. Rewriting wastes tokens and, worse, throws away the quality signal accumulated through upvotes and downvotes.

Do five layers of memory fight each other?

I pressed the agent on where its own memory lives. It listed five layers — I've stripped out the actual locations and kept only the nature of each:

  1. Distilled facts — the short-sentence knowledge it stores and votes on; persists long-term across sessions. High risk.
  2. Structural map — a codebase map maintained in the background that gets actively pushed into context. High risk.
  3. Session state — this task's plan, checkpoints, and artifacts; thrown away when done. Low risk.
  4. Queryable history — past sessions' turns, events, and file changes, queryable with SQL, traceable long-term. Medium risk.
  5. Scratch — todos and queues, used-once-and-discarded. Low risk.

Five layers sounds easy to get tangled. But take them apart and the risk lives only in the first two. There are two tests: is the content a claim or a factual record, and does it get actively injected into context? Only what trips both is dangerous — the distilled facts and structural map that surface on their own and might be wrong. Layers 3 and 5 are throwaway logs, layer 4 is an objective record I only use when I deliberately go query it, so none of them can pollute the work in progress. The layering itself is the governance: the raw log never wanders over and contaminates the assertion space — only the curated layer gets taken up automatically.

Verify before use — won't that burn tokens?

I raised what I thought was the question that would break the whole story: if every memory has to be verified before use, then with enough memories you'd be verifying forever and torching tokens.

The agent's answer reframed it for me. Cost isn't a function of total memory count; it's "how many did this task actually base a decision on, and how bad is it if they're wrong." Inject five at once and I might decide on the strength of one — the other four were never used, so they don't get verified.

And most verification is a freeride that costs almost nothing extra. "Lint is task lint" — I just run the command; if it's wrong it errors out on the spot, verified. "Which line is this function on" — I was going to read that file anyway, so I confirm it on the way past. The only ones worth spinning up a separate, token-spending round for are the few that are high-consequence and easily stale and not verifiable in passing. Those are rare.

So how do you cap the count itself? At the write gate and the eviction gate, not the after-the-fact cleanup. Set the bar high: to be storable, a fact has to be useful in the future, independent of the current change, unlikely to shift, and not directly readable from a small code snippet. There's an explicit blacklist alongside it — ephemeral, task-specific, "for now," secrets, anything that changes — none of it gets stored.

Then the rule I think matters most: when you hit a fact you've already stored, the right move is to vote, not store a second copy. Upvote to strengthen, downvote to lower the weight; "learning the same thing again" should raise the weight on the existing row, not grow a new one. Converge by voting, don't bloat by adding. And there's a bound on the read side too: each retrieval only surfaces the top-N relevant entries, so the amount actually injected is decoupled from how big the total store gets.

Repo map: the map at the start of every big task

Late in the conversation it converged on a concrete practice. For any sizable task, first have the agent scan the relevant files and produce a map — "file → class → line → who calls whom" — and drive the rest of the task off that map.

How long you keep that map decides whether it rots. Treat it as a per-task scratchpad, throw it away when done, regenerate each time, and it keeps the "never stale" property. The moment you save it for cross-task reuse, it becomes a derived file that goes stale and needs the same maintenance discipline as a vector index. My pick is the former: put it in the session's plan file, keep it out of version control, regenerate it once per big task.

Does generating the map burn tokens and require a pile of tool calls? I asked that too. The map itself is tiny — three hundred tokens or so. The real cost is the scan, but those files were ones you'd read for the task anyway; the scan just front-loads and compresses the "have to read it regardless" work. Over a long task it usually nets out cheaper. There's a trick to cap the scan cost too: hand it to a separate explore sub-agent, let the pile of greps and noisy output stay in its context, and you get back only a clean map.

There's an angle I hadn't considered: this kind of mechanical extraction doesn't need the strongest model at all. Finding symbols, tagging lines, drawing call graphs is low-reasoning, high-recall work; a Haiku- or mini-class model does it well, and plenty of agent frameworks default to routing explore to a cheap model like that. Keep the strong model for judgment: which files are relevant to this task, which relationships are hidden, what to touch first. Cheap model scans, expensive model judges, each in its lane.

Put it together and the flow at the start of a big task looks roughly like this: first decide whether a map is even worth it (a one-file change just greps, skip it); check whether an existing background map already covers it and use that directly; if not, dispatch an explore sub-agent on a cheap model to scan, in parallel, for a coarse panorama; then the conductor (strong model) combines "task + coarse map" to decide which files to read and which to edit; finally, only the chosen few get zoomed into for a deep read, yielding a focused working-set map. That last step is "filter + magnify" — a wide survey first, then zoom-in — not just carving out a subset. Within a session the map iteratively grows; across sessions it's thrown away and regenerated.

The holes it didn't tell you about

The whole thing sounds smooth, but when I asked the agent to be honest about the limits, those answers were the real point.

Citation line numbers drift silently. A memory says line 142 of parser.py, the code changes, and it now points somewhere else — with no error — and I only catch it if I happen to read that file again. Maintenance is reactive: a stale fact only gets fixed the next time someone uses it and happens to verify it; until then it sits there misleading. There's no active mechanism that says "the code changed, so invalidate the related memory."

Deduplication is best-effort too. A near-duplicate that never gets surfaced for comparison can still slip in as a second entry. And a downvote lowers the weight, it doesn't delete — to truly remove a memory you have to go into the tool's memory-management UI by hand; voting it down while in use won't make it converge away. The weak-model extraction has the same caveat: line-number accuracy and relationship-miss rate are both higher, so don't skip the extra review pass on high-risk changes.

One page to take away

  • Small repo: dump-everything or grep plus entry docs is enough — a vector store is over-engineering.
  • Semantic memory: persist across sessions, maintain incrementally, never rewrite each time; control volume with a high write bar + dedup-into-voting + prune + bounded injection.
  • Anti-churn: write memory at an abstract level; instruction-style beats hardcoded line numbers.
  • Repo map: generate-and-use, put it in the plan file, regenerate per task to guarantee it's never stale; delegate the scan to an explore sub-agent on a cheap model and only judge yourself.
  • Verification discipline > how much you store: memory's value is in verify-before-use and traceable citations, not in count.

So the point isn't how perfect the memory system is. Its value rides on verification discipline and traceable citations, not on how many entries you've stored. That's the through-line of the whole conversation: from dumping everything, to grep, to vectors, every step forward buys you capability that you pay for with a matching share of maintenance responsibility. Which one you pick comes down to how big your codebase is right now, and how much maintenance you're willing to pay.