Context engineering · Field note

Load late, load little:
just-in-time context for conversation history

Most agents drag their entire past into every turn. A better default: keep a thin index of what was said hot, and fetch only the few turns you actually need — intact, on demand.

There is a quiet assumption baked into how most agents handle memory: that more context is safer than less. If the model might need something, put it in the window. The conversation grows, every prior turn rides along on every new request, and we trust the model to find the part that matters.

That assumption breaks twice. It breaks on cost, because an agent loop re-sends its whole window on every step — a hundred stale turns aren't paid for once, they're paid for on turn 101, 102, and every step after. And it breaks on quality, because models don't read a long window evenly. Relevant facts buried in the middle get underweighted; irrelevant bulk competes for attention with the thing that actually answers the question. Past a point, a bigger context produces a worse answer, not just a costlier one.

So the interesting question isn't "how do we fit more in?" It's "how do we keep the window small and dense without losing the one old turn that matters?" This post is the design we built around that question — for the specific case of long conversation history — plus the benchmark we used to keep ourselves honest.

FULL CONTEXT everything hot · diluted · billed every step ← the one    needed turn tokens: ████████████ 100% JIT CONTEXT thin index + only what's needed index · #1…#40 (one line each) ↑ retrieved: the needed turn, intact the other 36 turns stay in cold store, fetchable by #id if a later turn needs them tokens: ████ ~40%
The inversion. Full context keeps every turn resident, so the signal (one turn) is diluted by noise (the rest) and the whole thing is re-billed each step. JIT keeps a compact index always visible and pulls only the few relevant turns back to full heat.

01 / the mechanismA hot index over a cold store

The design borrows directly from how computers have always managed memory that doesn't fit: a small fast tier that's always present, a large slow tier that holds the bulk, and a rule for moving things between them. Virtual memory pages between RAM and disk. We page between the context window and an external store — for attention instead of address space.

Concretely there are two tiers. The cold store holds every turn at full fidelity, keyed by id — nothing is thrown away. The hot index holds one compact entry per turn: a short summary, a little metadata (entities, whether the turn recorded a decision), and an embedding of that summary. The index is cheap enough to keep in the window permanently; the payloads are not.

That permanence matters more than it looks. Because the index — a table of contents — is always present, the model can always see that something exists even when it hasn't loaded the detail. "There was a decision about the deploy window at turn 14" stays visible as a one-liner. The worst failure mode of any retrieval system is silently dropping a relevant fact so the model never knows to ask for it. An always-on index is the guard against that.

HOT · IN THE CONTEXT WINDOW #12 user [decision] "lock deploy window to Tuesday 02:00 UTC" (Deploy, UTC) #13 assistant "confirmed; updating the runbook" (Runbook) …one line per turn — the table of contents the model always sees… summarize + embed (async) lazy fetch by #id COLD · EXTERNAL STORE (full turns, keyed by id) #12 ▸ full text (2.1k tokens) #13 ▸ full text (0.9k tokens) #14 ▸ full text (1.4k tokens) … #40 (the long tail)
Two tiers, one rule. Turns are summarized and embedded into the hot index after each response (off the critical path, so it never blocks a reply). The full text lives in the cold store and is pulled back only when retrieval says a specific turn is needed.

02 / retrievalShortlist by similarity, then let a model pick

When a new turn arrives and the conversation is long enough to bother, we have to decide which old turns to reheat. We do it in two stages — cheap-and-broad, then precise-and-narrow.

First, a semantic shortlist: embed the current request, rank every index entry by similarity, take the top dozen. This is fast, recall-oriented, and costs no model call. Second, a model picks: a small, fast model sees the request and those dozen candidate summaries — never the full text — and returns the handful of turn ids whose full detail is actually worth loading. Similarity finds plausible candidates; the picker applies judgment the embeddings can't.

new turn the query semantic shortlist top-12 by cosine model picks summaries only → [ #12, #14 ] assemble window index + picked turns + recent N verbatim cheap & broad ───────────────▶ precise & narrow
Hybrid retrieval. The shortlist keeps recall high without a model call; the picker raises precision on a short list. Feeding the picker summaries, not full turns keeps that sub-call small and fast.

The window we finally send is assembled from three parts: the system prompt, the scoped index (relevant + decision-flagged lines, so the table of contents doesn't itself grow without bound), the handful of retrieved turns at full fidelity, and the most recent few turns kept verbatim — because recency is free and usually relevant, so there's no reason to make the retriever re-derive it.

03 / a worked exampleThe needle at turn 9

Make it concrete. A sixty-turn conversation; back at turn 9 the assistant said "Decision: the deploy window is Tuesday 02:00 UTC. Let's lock that in." Fifty turns of unrelated chatter follow. Now the user asks: "Remind me, what did we settle on for the deploy window?"

A naive recency window keeps the last four turns — all chatter — and the answer is simply gone. Full context keeps everything, answer included, but pays for sixty turns and dilutes the needle among them. Here's what JIT does instead:

# the request comes in; conversation is over threshold → JIT engages
query  = "what did we settle on for the deploy window?"

# 1 · semantic shortlist over the index (summaries only)
shortlist = top_k(embed(query), index, k=12)
#   → #9 ranks high: its summary shares "deploy / window / settle"

# 2 · model picks from the shortlist's summaries
picked = picker(query, shortlist, max=6)   # → [9]

# 3 · assemble: system + scoped index + full #9 + recent 4 + query
window = assemble(system, index, fetch([9]), recent_4, query)

input turns: 61  →  sent turns: 11   ·   tokens 1066 → 360  (66% saved)

The needle is retrieved intact, the answer is exact, and the window is a third of the size. The point isn't the percentage — it's that JIT beat both baselines at once: more accurate than recency (which lost the turn) and far cheaper than full (which kept all sixty).

04 / the honest partThree-way comparison, including the result that didn't work

The comparison that matters is never "JIT vs. the giant window." It's three-way, because each baseline fails differently:

approachaccuracytokensthe catch
full everythinghigh100%complete but diluted; billed every step
recency last N turns~0%~10%drops anything old, however important
jit index + fetchhigh~40–50%only as good as retrieval recall

That last catch is the whole game, and it cuts both ways. When retrieval is good, JIT gives you a small dense window — cheaper and higher quality. When retrieval misses, JIT has now removed a turn that the full window would at least have kept somewhere. So JIT only wins when retrieval precision clears a bar; below it, you've built a slow leak.

A negative result, kept on the record Our first version injected the entire index on every turn. On long conversations that table of contents grew so large it cost more than the turns it was saving — token "savings" went negative, and got worse the longer the chat. The fix was to scope the injected index to the shortlist plus decision-flagged turns. Only after that did savings turn positive and grow with length. The benchmark existed precisely to catch this; an eval that only ever shows wins isn't measuring anything.
Token savings vs. conversation length 0% 25% 50% 75% 100% 204080 160320 turns in conversation → 94% scoped index full index (negative — clipped)
Why scoping mattered. With the full index injected every turn (red, dashed) savings were negative and worsened with length. Scoping the index to relevant + important lines (amber) flips it: savings climb from break-even at 20 turns to ~94% at 320. Numbers are from our offline harness.

05 / proving it properlyThe LoCoMo benchmark

Self-measured wins on a synthetic generator are a smell test, not proof. To make a claim anyone should believe, you have to run on a public benchmark other systems also report on. For long-term conversational memory, that benchmark is LoCoMo (Maharana et al., presented at ACL 2024, released by snap-research).

LoCoMo is built for exactly the regime JIT targets: very long, multi-session conversations where the answer to a question is some specific thing said far earlier, buried in noise. Its scale is the point — the dataset is ten long conversations averaging on the order of ~600 turns and ~26 sessions each, with roughly ~1,500 question–answer pairs annotated across the set. A short-context trick won't survive it.

What makes it a genuinely useful diagnostic — rather than a single score — is that its questions come in distinct types, and they stress a retrieval-based design very differently:

question typewhat it asksstresses…
single-hopone fact from one turnpure retrieval — the JIT sweet spot
multi-hopjoin facts across disjoint turnspulling several related turns, not just the top one
temporalordering & timelineswhether the index kept timestamps
open-domaininference beyond the stated textreasoning, not just recall
adversarialunanswerable / trapexcluded from scoring by convention

Following the standard protocol, the adversarial category is excluded so results stay comparable to published numbers. The run itself mirrors the architecture: a write phase ingests the whole conversation into the index and cold store, then a read phase answers each question by assembling a window and handing it to a real model. Two scorers are reported together for defensibility — word-overlap F1 against the reference, and an LLM-as-judge for semantic correctness — and crucially, accuracy is broken out per category.

Why the per-category breakdown earns trust A single aggregate number hides the story. The honest pattern a retrieval-first design shows on LoCoMo is strong single-hop, weaker temporal — and the breakdown tells you why: temporal questions can have perfect evidence recall (you fetched the right turns) yet still miss, because answering needs ordering the summaries may have stripped. That cleanly separates the two failure modes worth distinguishing — didn't retrieve it versus retrieved it but couldn't use it — and points straight at the fix (timestamp-aware indexing) instead of a vague "improve retrieval."

The actual numbers

Run across all ten conversations (1,540 non-adversarial QA pairs) with real embeddings (gemini-embedding-001) and a real answerer/judge (gemini-2.5-flash-lite):

modeF1judgeret. recalltoken cost
full0.410.43100%
recency0.000.01~10%
jit0.370.360.71~50%

The shape is the thesis made real: JIT reaches about 90% of full-context F1 at roughly half the tokens, while recency collapses to near-zero because the answer is almost never in the most recent turns. That gap between JIT and recency — same token budget order, wildly different accuracy — is the entire argument for indexing over truncation.

One caution on reading these: the absolute scores are low because the answerer/judge is a small fast model; full-context F1 is only 0.41. All three modes would rise with a stronger model, so the trustworthy signal is the relative gap (JIT ≈ full, recency ≈ 0), not the 90% ratio in isolation — a ratio of two small numbers is noisy. Lead with the structure, not the percentage.

And the per-category breakdown delivers exactly the diagnostic promised above:

category (jit)F1judgeret. recall
single-hop0.480.510.74
multi-hop0.280.260.73
temporal0.220.100.69
open-domain0.130.200.43

Single-hop is strongest — pure retrieval, the design's home ground. Temporal is the tell: retrieval recall is a healthy 0.69 (the right turns were fetched), yet judge accuracy is just 0.10 — evidence present, but ordering the summaries stripped. Open-domain is weakest on both axes, because diffuse inferential context is genuinely hard to surface by similarity. None of this is hidden; it's the map of what to fix next.

The rule held throughout: only place another system's figure beside these if it ran through the identical harness; otherwise it's "reported, not directly comparable." Different judge models and answer-matching make cross-paper numbers a trap.

06 / where this sitsOne slice of a bigger pattern

"Just-in-time context" as a phrase isn't ours — it traces to Anthropic's 2025 context-engineering guidance, which argued for keeping lightweight identifiers in the window and resolving them to full content at runtime. That broader pattern shows up in several places: skills loaded on activation, large tool results offloaded to handles, sub-agents that take a heavy subtask out of the parent's window.

Most of those are forward-looking — the agent decides what to load as it works. The slice described here is retrospective: indexing the conversation's own past and pulling old turns back when they become relevant. And where the common shortcut for long history is compaction — summarize the transcript and discard the detail — the bet here is the opposite: keep every turn intact in cold storage and fetch it whole, so nothing is lost to a lossy summary made before anyone knew which detail would matter. Selection, not compression; conversation history, not all content. That's the corner of the map this occupies.

07 / the parts that are still hardCaveats worth stating plainly

Retrieval precision is the whole ballgame. A missed fetch is a silent quality loss the model often can't detect. The always-on index mitigates the worst case but doesn't erase it.

Summarizing at index time is lossy and early. You compress a turn before you know what future question will be asked of it — a real circular dependency. Keeping the full turn fetchable is the hedge, but the index summary still gates whether you ever go fetch it.

Every lazy fetch and picker call is latency. Push indexing off the response path, keep the picker on a small model, cap the shortlist — and measure whether the sub-call's cost eats the savings it's chasing.

Dynamic windows fight prompt caches. A context that changes every turn invalidates a provider's cached prefix. Keep the stable parts first and the freshly retrieved parts last to preserve what cache you can.

None of these sink the idea. They define where it pays off: long-horizon, low-locality conversations, past a length threshold, where the relevant past is neither recent nor predictable. In that regime — and the benchmark exists to tell you whether you're in it — loading late and loading little is simply the better default.

The code The implementation — the proxy, the hybrid retriever, and the LoCoMo harness described here — lives at github.com/NirajPandey05/jit_context. It runs offline with no keys for a quick look, and takes real embedder / picker / answerer backends when you want comparable numbers.