Over 60,000 GitHub repositories now contain some variant of an AGENTS.md, CLAUDE.md, or COPILOT-INSTRUCTIONS.md file. The major AI coding vendors have been recommending them as a standard practice — give the agent a map of the territory before it starts exploring, and it’ll navigate more efficiently. The premise is intuitive. The empirical record, it turns out, is something else.
A paper published in February 2026 by researchers at ETH Zurich and DeepMind, “Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?”, is the first rigorous benchmark evaluation of whether these files actually improve agent task completion rates. They constructed a new benchmark — AGENTbench — comprising 138 tasks drawn from 12 real Python repositories where developers had already written context files. They tested four coding agents (Claude Code, Codex, GPT models, and Qwen) across three conditions: no context file, LLM-generated context file, and developer-written context file. The result is uncomfortable for anyone who has been following vendor guidance.
LLM-generated context files reduce task success rates by roughly 2–3% and increase inference costs by more than 20%. Developer-written files do better — a ~4% improvement — but impose a similar cost penalty of up to 19%. Neither delivers the productivity gain the current industry narrative implies, and one actively makes things worse.
Why the Files Underperform
The tempting interpretation is that agents aren’t reading the context files carefully. The study data says the opposite. Agents follow context file instructions precisely. When a context file mentions a specific tool, agents use that tool 2.5 times more frequently than when it’s absent. Grep operations, test executions, and file traversals all increase measurably when context is provided. Reasoning token usage in GPT models jumps 14–22% — the agents are thinking harder about the guidance they’ve been given. The failure mode isn’t that the files are being ignored. It’s what the files are telling agents to do, and the cost of following those instructions faithfully.
The core problem with LLM-generated context files is redundancy. When researchers removed markdown and documentation files from the evaluation environment, those same LLM-generated context files improved performance by 2.7% — actually outperforming developer-written files. The information in a generated context file isn’t useless; it’s a duplicate of information the agent can discover by reading the repository’s own documentation. You’re not giving the agent a map — you’re handing it a summary of a map it already has access to, and then charging it to read both.
Developer-written context files avoid the redundancy problem, which is why they show a modest performance gain. But the behavioral analysis reveals the mechanism behind their cost: they shift agents toward more thorough exploration patterns. More testing, more file traversal, more repository-specific tooling. On AGENTbench, context files added an average of 3.92 steps to agent execution paths. The agent is following the instructions, and more instructions means more steps, which means a longer and more expensive path to the same destination.
The Architectural Implication
This matters well beyond the specific case of coding agents, because the AGENTS.md pattern is an instance of a much broader design question that enterprise teams are actively working through: how much pre-context do you give an agent at task initialization, in what form, and with what discipline around relevance?
The current default assumption in context engineering is that more information is better — agents are sophisticated enough to filter out what they don’t need. The evidence here says that assumption is wrong. Agents don’t filter gracefully. They process and act on the guidance they’re given, even when that guidance adds steps that don’t improve outcomes. In a coding agent executing a single task, that costs you a few extra tokens. In enterprise deployments where agents run at scale — across hundreds of repositories, automated pipelines, continuous integration workflows — a 20% inference cost premium is a real line item, and the performance degradation compounds it.
The failure mode the study is documenting has a name in systems design: it’s a form of instruction bloat. The agent’s behavior space is being constrained and directed by context that doesn’t reduce ambiguity — it increases cognitive load without reducing the search space. A context file that says “always run the full test suite before committing” is an example of a constraint that sounds reasonable but costs more than it’s worth when the agent would have run the relevant tests anyway and now has to execute the entire suite at each step. The instruction is being followed exactly as intended. The outcome is not what was intended.
Here’s what that actually means architecturally: context files are an ambient, always-present input that agents apply globally to their task execution. That makes them an appropriate location for genuinely minimal, high-signal guidance — the kind of information the agent cannot discover from existing repository documentation and that materially changes the correct approach to the task. It makes them an inappropriate location for general engineering hygiene the agent already knows, repository overviews that duplicate the README, and LLM-generated summaries of documentation that already exists in the codebase.
The paper’s recommendation — that human-written context files “should describe only minimal requirements” — is a call for precision in a practice that has drifted toward comprehensiveness. The average context file in the AGENTbench benchmark ran 641 words. That’s not a context file. That’s an instruction manual, and the evidence says agents treat it like one.
What Practitioners Should Actually Do
The practical implications are specific enough to act on immediately.
If your context files were generated by an LLM tool as part of your development environment setup, the appropriate first step is to measure whether they’re helping before assuming they are. The study suggests the most likely outcome of disabling them is that agent performance holds approximately flat and your inference costs drop. That’s a test worth running.
If you’re maintaining developer-written context files, apply a strict minimalism discipline. Include only information the agent cannot discover from existing repository documentation, and only information that changes the correct approach to a task in a meaningful way. The bar should be: “Would an experienced engineer unfamiliar with this specific repository get this wrong without being told?” If the answer is yes, it belongs in the context file. Coding standards, testing philosophy, and general documentation guidance almost certainly don’t clear that bar. Non-obvious toolchain configurations, custom test runners, and repository-specific deployment constraints might.
The deeper shift is in how to think about context engineering as a practice. The intuition that richer context makes agents more effective is not categorically wrong — it’s wrong in the specific, common case where the context duplicates what the agent can already access. The study’s redundancy finding is the key result to carry forward: when the agent can read the documentation directly, a context-file summary of that documentation adds cost without subtracting ambiguity. Effective context files are the ones that provide information the agent genuinely doesn’t have, not the ones that aggregate information it can find elsewhere.
Reading the Limitations Honestly
The benchmark covers 138 tasks across 12 Python repositories. That’s a credible foundation, and the consistency of results across four different agents — Claude Code, Codex, GPT models, and Qwen — makes the directional findings more robust than they would be from a single-agent evaluation. But it’s still a relatively narrow window. The study evaluated task completion rate; it didn’t measure code quality, security properties, or the maintainability of agent-produced patches. Context files may function differently in languages with unusual or niche conventions where agent training coverage is sparse, or in domains where the gap between what the agent knows and what the codebase requires is larger than in the standard Python repositories studied here.
The right response to this paper isn’t to delete every context file you have and never think about them again. It’s to stop assuming they’re helping without evidence that they are. The vendors will adjust their guidance eventually — the empirical record on AGENTS.md is now accumulating, and the story it’s telling isn’t the one that led to 60,000 repository adoptions. Until vendor guidance catches up to the data, the safest position for enterprise teams is skepticism toward comprehensive context files, investment in measuring their actual effect on your agents, and a deliberate shift toward the minimal-and-precise end of the spectrum when you do use them.
The principle the study is pointing at is one that good systems engineers recognize from other contexts: in a complex adaptive system, instructions that get followed are constraints on behavior. Add more constraints than the system needs, and you don’t get a more efficient system — you get a more expensive one that’s no more correct.