Why Your AI Agent Gets Worse Over Time — Context Rot Explained

Table of Contents

Have you noticed this pattern: an AI agent starts a long task performing well, but by halfway through it’s repeating mistakes, undoing fixed code, or getting confused by its own earlier instructions? This isn’t the model “getting dumb” — and it’s not your prompts getting worse. It has a name: Context Rot.

TL;DR

Context Rot: the gradual degradation in AI agent output quality as the context window fills with accumulated session history
Root cause: Transformers process all tokens equally — old failures compete with current instructions for attention
Research finding: below 50% context fill, models lose tokens in the middle; above 50%, they lose the earliest tokens
Databricks research: correctness drops measurably after 32K tokens; agents start repeating historical actions
A bigger context window delays the problem but doesn’t fix it — the issue is noise accumulation, not size
Fixes: Memory Pointer Pattern, external memory stores, periodic context compaction, clean context resets

What Is It

Context Rot is what happens when an AI agent’s working memory — its context window — accumulates noise over the course of a task: failed attempts at fixes, outdated versions of code, error messages, contradictory instructions from earlier in the session.

The model doesn’t “lose access” to this old content — it’s all still in the context. The problem is that it all competes for the model’s attention alongside the current, relevant instructions. As noise accumulates, the signal-to-noise ratio drops and output reliability falls.

Why It Matters

As AI agents move from answering single questions to executing multi-step engineering tasks, Context Rot scales from “occasional wrong answer” to “task failure.” A 100-step task where the agent degrades at step 60 means everything after that runs on flawed assumptions — often producing output that’s harder to fix than doing the work manually.

How It Works

The Transformer Attention Problem

Transformers process context without distinguishing “current instruction” from “record of a failed attempt from 30 turns ago” — both are just tokens in the self-attention calculation. This means:

graph LR
    A[Correct Instructions] --> D[Attention Mechanism]
    B[Failed Attempt 1] --> D
    C[Failed Attempt 2] --> D
    E[Outdated Code Version] --> D
    F[Contradictory Earlier Instruction] --> D
    D --> G[Signal-to-noise ratio drops]
    G --> H[Output quality degrades]

Research Findings

Chroma (2025): Tested models at different fill levels and found a consistent pattern:

Context window below 50% full → model tends to lose tokens in the middle
Context window above 50% full → model tends to lose the earliest tokens
In both cases, critical context gets deprioritized before the window is ever full

Databricks Mosaic: Model correctness starts dropping after 32K tokens, with agents increasingly favoring repetitive actions drawn from their growing history rather than novel solutions.

The Bigger Context Window Fallacy

The intuitive response is “just use a bigger context window.” This is a common misconception.

	Larger Context Window	Active Context Management
Addresses root cause	No	Yes
Effect	Delays the problem	Prevents noise accumulation
Cost	Higher (token costs scale linearly)	Engineering effort, but lower inference cost
Best for	Single-pass long document processing	Multi-turn tasks, long-running agents

Context Rot can start when the window is only 30% full. More space doesn’t stop noise from degrading signal quality.

Fixes

1. The Memory Pointer Pattern

Core idea: don’t push data into context; push only a reference (pointer) to where the data lives externally.

AWS demonstrated this with a materials science workflow:

Traditional approach: tool outputs placed directly in context → 20,822,181 tokens consumed → workflow failed
Memory Pointer approach: tool outputs stored externally, context holds only an index reference → 1,234 tokens → succeeded
Efficiency gain: over 16,000× in this case

graph LR
    A[Tool Execution Result] --> B[External Storage DB / Vector Store]
    B --> C[Returns Pointer ID]
    C --> D[Context holds only the pointer]
    D --> E[Data retrieved on demand]

2. Treat Context Like RAM, Not Disk

The framing from mem0.ai and MindStudio:

Context Window is RAM: fast, immediately accessible, and structurally unsuited for anything that needs to survive beyond the current task. Treating it as persistent storage is asking RAM to do the job of a database.

Correct mental model:

Working memory (context window): only what this task needs right now
Long-term memory (external database): project conventions, preferences, past decisions
Knowledge base (vector store): technical documentation, API references

3. Context Compaction

Similar to OS garbage collection: periodically compress conversation history into a summary, retaining key facts and discarding execution details.

Claude Code’s built-in Compaction mechanism and LangChain’s ConversationSummaryMemory both implement this. The agent loses fine-grained history but retains the signal that matters.

4. Clean Context Resets with Persistent Rules Files

StackOne’s recommendation: reset context between tasks rather than carrying all history forward. Keep critical conventions and architecture decisions in a permanent PROJECT_RULES.md or ARCHITECTURE.md — load it fresh at the start of each task rather than inheriting it from conversation history.

Summary

Context Rot is the most underappreciated failure mode in production AI agents. It doesn’t produce error messages or crashes — it just quietly makes outputs worse until you realize you’ve accumulated a pile of problems to fix manually.

Treating the context window as RAM — a finite resource to be budgeted, compacted, and managed deliberately — is one of the core principles of AI engineering in 2025. The agents that stay sharp in production are the ones where someone thought carefully about what goes in the context and when it gets cleaned out.

References

🇺🇸 English

Here's a pattern you've probably run into if you've spent any real time with AI coding agents. You kick off a long task — the agent starts strong, makes good decisions, writes clean code. Then somewhere around the halfway point, things start going sideways. It repeats a mistake it already fixed. It undoes working code. It seems confused by instructions it wrote itself twenty turns ago. And your first instinct is to think the model is getting dumb, or that your prompts are degrading somehow.

Neither of those is what's happening. There's a name for this: Context Rot.

Context Rot is the gradual degradation in AI agent output quality as the context window fills up with accumulated session history. Failed fix attempts. Outdated versions of code. Error messages. Contradictory instructions from earlier in the conversation. All of it is still sitting there in the context — the model hasn't "forgotten" it. The problem is subtler than that. Every single token in context competes for the model's attention. So failed attempt number three from forty turns ago is competing for attention alongside your current, relevant instructions. As the noise piles up, the signal-to-noise ratio drops, and output reliability follows.

This is a structural property of how Transformers work. The attention mechanism doesn't distinguish between "current instruction" and "record of a failed approach from half an hour ago." Both are just tokens. There's no concept of relevance or recency baked in at a fundamental level.

And the research backs this up pretty specifically. Chroma ran tests at different context fill levels and found something interesting: when the context window is less than half full, models tend to lose track of tokens in the middle. When it's more than half full, they start losing the earliest tokens. In both cases, critical context gets deprioritized before the window is even close to full. Databricks found that correctness starts measurably dropping after around 32,000 tokens, and agents begin favoring repetitive actions drawn from their growing history rather than generating novel solutions to the problem in front of them.

Now here's where a lot of people go wrong: the intuitive fix is to just use a bigger context window. More space, more room, problem solved. This is a fallacy. Context Rot can start when the window is only thirty percent full. The issue isn't running out of space — it's that noise accumulates and degrades signal quality regardless of how much total space you have. A bigger window just delays the problem. It doesn't address the root cause.

So what does actually fix it?

The most dramatic solution is what AWS demonstrated with something called the Memory Pointer Pattern. The core idea: instead of pushing data directly into the context, you push only a reference to where that data lives externally. Think of it like a variable pointing to memory rather than a copy of the data inline. AWS tested this on a materials science workflow. The traditional approach — tool outputs placed directly in context — consumed over twenty million tokens and the workflow failed. The Memory Pointer approach stored outputs externally and kept only an index reference in context. That same workflow completed successfully on just over a thousand tokens. That's a sixteen-thousand-times efficiency difference.

The broader mental model here comes from mem0 and MindStudio, and it's the one I think is most clarifying: treat the context window like RAM, not like a filing cabinet. RAM is fast and immediately accessible, but it's structurally the wrong tool for anything that needs to persist. You wouldn't store your database in RAM. The context window is the same — it should hold only what this specific task needs right now. Long-term memory, project conventions, past decisions — those belong in an external database. Technical documentation and API references belong in a vector store. The context window is for working memory, full stop.

Two more practical techniques worth knowing: Context Compaction, which is essentially garbage collection for conversation history — you periodically compress old turns into a summary, keeping the key facts and discarding the execution details. Both Claude Code and LangChain have built-in mechanisms for this. And Clean Context Resets, recommended by StackOne — instead of carrying all session history forward between tasks, you reset context and load a fresh copy of a permanent rules file. Critical architecture decisions live in something like `PROJECT_RULES.md` and get loaded at the start of each new task rather than inherited from a long, noisy conversation history.

So let me leave you with three things to take away from this.

First, when an AI agent degrades mid-task, that's not a model quality problem — it's a context management problem. The architecture is working as designed; the context just wasn't managed.

Second, bigger context windows are not the answer. They're a delay, not a fix. The engineering work is in controlling what enters the context and when it gets cleaned out.

And third, the mental model shift that actually changes how you build agents: context window is RAM. Manage it like a finite, precious resource — budget it deliberately, compact it regularly, and don't treat it like a place to store things.

The agents that stay sharp in production are the ones where someone thought carefully about context hygiene from the start. Everyone else ends up debugging a pile of quietly accumulated garbage.

🇹🇼 中文

你有沒有遇過這種情況：用 AI 代理跑一個長任務，前幾十步執行得很漂亮，但越到後面越離譜——把已經修好的地方改壞、重複同樣的錯誤、甚至跟自己的指令互相矛盾？這不是模型偷懶，也不是你 prompt 寫得不好。這個現象有個名字，叫做 **Context Rot**，情境腐敗。

今天我們來拆解它的原理，還有怎麼真正解決它。

---

先說問題的本質。AI 代理在執行任務的過程中，Context Window 會不斷累積東西：失敗的嘗試、舊版的程式碼、重複的錯誤訊息、互相矛盾的指令。Transformer 在處理這些內容的時候，不會區分「這是有用的指令」還是「這是三步前的錯誤記錄」——它們都是 Token，在注意力機制裡平等競爭。訊號被噪音稀釋，輸出品質就跟著下滑。

Chroma 的研究發現了一個有趣的規律：Context 還沒到一半的時候，模型容易遺失中間位置的內容；超過一半之後，最早的內容開始消失。換句話說，不管你的 Context Window 有多大，關鍵資訊都有可能在某個時間點悄悄蒸發。Databricks 的研究則更直接：超過 32K Token 之後，代理的正確率明顯掉下來，而且會開始重複歷史中出現過的操作，而不是嘗試新的解法。

這個問題在單次問答的場景影響還不大，但在「執行 100 個步驟的工程任務」裡，如果第 60 步開始退化，後面所有的工作都建立在錯誤假設上，最後的結果比沒用 AI 更難收拾。

---

很多工程師聽到這裡的第一反應是：「那就換更大的 Context Window 不就好了？」但這是個常見的誤解。Context Rot 在 Context 只有 30% 滿的時候就可能開始發生。更大的空間只是推遲問題爆發的時間點，不是解法。

真正的解法是**主動管理 Context**，像管理記憶體一樣。有四個方向：

**第一個，Memory Pointer Pattern。** 核心思想是：不要把資料本身塞進 Context，只放資料的位址。工具的執行結果存到外部資料庫，Context 裡只放一個索引 ID，需要的時候再去讀取。AWS 在材料科學工作流程的測試裡，傳統做法消耗了兩千多萬 Token 然後失敗，Memory Pointer 的做法只用了一千多個 Token 就成功了。效率差距超過一萬六千倍。

**第二個，把 Context 當 RAM，不是磁碟。** 這是一個很好的比喻：RAM 快速、可即時存取，但任務結束就清空，你不會把公司所有資料都塞進 RAM。正確的做法是分層：Context Window 只放當前任務必要的內容；長期記憶——專案規範、歷史決策——放外部資料庫；技術文件和 API 參考放向量庫，查的時候再撈。

**第三個，定期壓縮。** 類似作業系統的垃圾回收：把對話歷史壓縮成摘要，保留關鍵事實，丟掉執行細節。Claude Code 內建的 Compaction 機制、LangChain 的 ConversationSummaryMemory，都在做這件事。

**第四個，乾淨的起點。** 任務之間重置 Context，不要試圖把所有歷史帶進下一個任務。重要的架構決策和規範存成一個固定的文件，每次任務開始時讀取，而不是從對話歷史繼承。

---

總結三個核心帶走：

一，Context Rot 是 AI 代理最容易被忽略的工程問題——它不會崩潰，不會報錯，只是悄悄讓品質下滑，直到問題已經積累到難以收拾。

二，更大的 Context Window 不是解法，因為問題的根源是訊號雜訊比，不是空間大小。

三，把 Context 當 RAM 管理：有限資源，需要主動預算、壓縮、分層存放，而不是無限往裡面加東西。

← Previous What Is a Data Lakehouse? From Data Warehouses to Open Table Formats

Next → My Take on Apple's 2025 M4 Lineup

AI Agent Bills Exploding? A Practical Guide to Model and Tool Selection

AI agent billing spikes come from three places: using a stronger model than the task requires, no depth limit on tool call loops, and context window waste from passing full history every round. The correct cost control strategy is matching model capability to task complexity, not using the strongest model for everything.

#ai #llm #cost-optimization #agent #engineering

tech

June 6, 2026

How AI Reshapes How You Think: The Cognitive Shift Beyond the Tool

AI tools change more than your speed — they change how you think. The shift from 'how to do it' to 'what to do' and 'is this right?' has real long-term implications for engineers.

#ai #cognitive-change #llm #productivity #thinking #knowledge-work