Continuous Improvement of Cursor Agent Harness: A Measurement-Driven Methodology

Core Argument: This article reveals an important engineering philosophy — the quality of the Harness is not designed but measured. Through contextual window evolution, a dual-layer measurement system of Keep Rate + LLM semantic evaluation, and anomaly detection alerts, Cursor establishes a data-driven closed loop for agent quality improvement. This methodology is more practical than any configuration document on how to build an agent that I have seen.

Background: Harness is a Product, Not a Configuration

Cursor opens with a key shift in understanding:

“We approach building the Cursor agent harness the way we’d approach any ambitious software product.”

This statement elevates the Harness from a simple configuration of “a system prompt + a few tool definitions” to a product-level engineering that requires measurement → hypothesis → experimentation → iteration.

Most agent developers still view the Harness as mere “parameter tuning” — adjusting temperature, changing the system prompt, or trying different tool descriptions. However, Cursor demonstrates a completely different engineering paradigm: viewing the Harness as an independent product with its own quality metrics and improving it using production-level software engineering measurement methods.

I. Evolution of Contextual Windows: From Guardrails to Dynamic Context

1.1 Early Stage: Extensive Static Context Filling

At the end of 2024, when Cursor first launched its coding agent, the model’s capabilities were relatively weak, and its ability to select context was poor. At that time, Cursor performed extensive “context engineering” filling:

Folder Layout: Static parsing of codebase structure as the initial session context.
Semantic Matching Fragments: Matching relevant code snippets from the codebase based on query semantics.
User Manual Attachments: User-uploaded files were pre-filled after compression.

These practices shared a common feature: assuming the model cannot autonomously choose context and requires an external system to determine what is important.

1.2 Current Stage: Removing Guardrails + Dynamic Retrieval

Now, most of these static fillings have “mostly long gone”:

“That is mostly long gone.”

The current approach is to:

Retain a few useful static contexts (operating system, git status, current and recently viewed files).
Change all others to dynamic retrieval, allowing the agent to autonomously decide what context is needed during work.

This is an inevitable result of improved model capabilities: when the model can accurately determine “which piece of code I need,” having an external system make that decision introduces noise.

I believe: Cursor’s evolutionary path reveals an important design principle — the agent’s context selection ability is a critical watershed. When the model is strong enough, the Harness should shift from “proactive filling” to “responsive provision.” This aligns with Anthropic’s concept of “attention budget”: the model’s own context selection ability determines where we should perform “context engineering” and where we can let the model autonomously decide.

1.3 Remaining Static Context

Cursor retains a small amount of static context because they hold stable value across all models and scenarios:

# Cursor's retained static context
current_context = [
    "operating_system",     # Operating system type affecting tool behavior
    "git_status",          # Current branch/modification state affecting code decisions
    "current_file",        # The file the user is currently editing
    "recently_viewed"      # List of recently opened files (implicit intent signal)
]

This retention list itself is a valuable reference — it indicates what context is consistently useful in any situation. Everything else can be dynamic.

II. Measurement System: Dual-Layer Evaluation Framework

This is the core part of the article. Cursor has established two levels of quality measurement:

2.1 First Level: Offline Evaluation (CursorBench)

Cursor maintains its own evaluation benchmark, CursorBench, combined with publicly available evaluation sets, providing quick standardized quality readings that support quality comparisons over time.

However, Cursor explicitly points out the limitations of public evaluations:

“Even the best benchmarks only approximate real usage.”

Benchmarks can only approximate real usage scenarios, which is an eternal pain point for evaluation designers. Cursor chooses not to rely on a single evaluation but instead establishes multi-layered measurements.

2.2 Second Level: Online Experiments (A/B Testing)

Cursor conducts A/B testing on real users, deploying two or more Harness variants simultaneously to measure quality differences in real usage.

The specific metrics used are divided into two categories:

Quantifiable Engineering Metrics (directionally useful but cannot fully answer “how well does the agent perform”):

Latency (response delay)
Token efficiency (token consumption efficiency)
Tool call count (number of tool calls)
Cache hit rate (cache hit rate)

Deeper Quality Metrics (truly answer “how is the agent’s quality”):

Keep Rate: Code Retention Rate

“For a given set of code changes that the agent proposed, we track what fraction of those remain in the user’s codebase after fixed intervals of time.”

This design is extremely clever. It does not require manual labeling; instead, it utilizes code retention itself as a proxy for user satisfaction — if the user’s code retains the changes generated by the agent, it indicates those changes were accepted.

Key Insight: The effectiveness of Keep Rate lies in the “stickiness” of code — users do not retain changes just because they are “okay”; they must be “truly useful” to be kept. This naturally filters out noise from “just usable” changes.

LLM Semantic Evaluation: User Response Sentiment Analysis

Cursor employs an LLM to read user responses to the agent’s initial outputs, assessing semantic satisfaction:

“A user moving on to the next feature is a strong signal the agent did its job, while a user pasting a stack trace is a reliable signal that it didn’t.”

This method transforms “user satisfaction” into a scalable, automatically evaluable metric — no manual scoring is needed; the model directly assesses the sentiment of user responses.

I believe: The combination of Keep Rate + LLM semantic evaluation provides a practical framework for measuring agent quality. The former measures objective code behavior outcomes, while the latter assesses subjective user experience. Together, they are closer to a true evaluation of “agent actual quality” than any single metric.

2.3 An Abandoned Experiment

Cursor shares an abandoned experiment: attempting to use a more expensive model for context summarization to observe its impact on agent quality.

Result: The improvement in agent quality was negligible, but the costs increased significantly. The conclusion is not worth it.

“In one experiment, we tried a more expensive model for context summarization and observed it made a negligible difference in agent quality that wasn’t worth the higher cost.”

The true value of this case is not to say “do not use expensive models for summarization,” but rather to demonstrate how Cursor uses experimental data to terminate seemingly promising but ultimately unworthy directions. This is a counterintuitive yet extremely important engineering discipline — not all “seemingly correct” optimizations are worth deploying.

III. Degradation Tracking and Repair: Anomaly Detection Alert System

3.1 Nature of the Problem: Complexity of Harness and Surface Bugs

As functionality increases, the state space of the Harness grows dramatically. Any complex software system faces the same issue: the larger the state space, the more potential bugs there are, many of which can only be detected at scale.

Cursor specifically points out that tool call errors can lead to severe consequences:

“Tool call errors can be extremely harmful to a session in Cursor. While the agent can often self-correct, errors remain in context, wasting tokens and causing ‘context rot,’ where accumulated mistakes degrade the quality of the model’s subsequent decisions.”

Here, the concept of “context rot” is introduced — errors accumulate and contaminate subsequent decisions. This is more dangerous than merely “tool call failures” because it can lead the agent to stray further in the wrong direction without realizing it.

3.2 Error Classification System

Cursor classifies tool errors into two categories:

Unknown Errors → Always treated as bugs, triggering alerts without exception.

“Any unknown error represents a bug in the harness, and we treat it accordingly.”

Expected Errors → Requires further analysis and cannot be directly classified as bugs.

These errors can arise from various reasons:

InvalidArguments: The model proposed incorrect parameters
UnexpectedEnvironment: Conflicting information in the context window
ProviderError: Tool providers (e.g., GenerateImage, WebSearch) downtime
UserAborted: User actively interrupted
Timeout: Timeout

Expected errors may be bugs or expected behavior — for example, a timeout on grep may be a tool performance issue or a result of the codebase being too large, leading the model to generate inefficient queries.

3.3 Alert Strategy: Two Mechanisms

Threshold Alerts: Triggered immediately when the unknown error rate exceeds a fixed threshold (because unknown errors = bugs).

Anomaly Detection Alerts: Triggered when the expected error rate significantly deviates from the baseline (baseline calculated per tool and model).

“We compute baselines per-tool and per-model, because different models may mess up tool calls at different rates.”

Calculating baselines per tool and model is a key detail — different models exhibit significant differences in error patterns during tool calls, and a unified baseline cannot be used.

3.4 Automated Repair Loop

Cursor also has a Weekly Automation that uses a special Skill, teaching the model how to:

Search logs
Identify new or suddenly increasing errors
Create or update tickets in the backlog

“We lean heavily on Cloud Agents to kick off fixes for many issues at once, and can even trigger them directly from Linear.”

The core insight here is: using agents to fix bugs in the agent harness. This is the highest form of “measurement-driven improvement” — not manually analyzing logs or creating issues, but allowing AI to read logs, discover anomalies, and create repair tasks autonomously.

In a focused sprint, this automation process reduced “unexpected tool call errors” by an order of magnitude.

IV. Model Customization: Different Harnesses for Different Models

4.1 Tool Format Matching

“OpenAI’s models are trained to edit files using a patch-based format, while Anthropic’s models are trained on string replacement. Either model could use either tool, but giving it the unfamiliar one costs extra reasoning tokens and produces more mistakes.”

This is a practical engineering observation — mismatched tool formats directly increase token consumption and error rates. Cursor’s approach is to configure each model with the tool format it is accustomed to during training:

OpenAI models → patch-based editing
Anthropic models → string replacement editing

This is not just about “providing the right tool”; it is about customizing Harness behavior based on the model’s training data distribution to reduce reasoning costs and error rates.

4.2 Deep Customization: Prompt Adjustments by Model

The depth of customization includes:

Custom prompts for different providers
Custom prompts for different model versions

OpenAI models are “more literal and precise, strong in instruction following,” while Claude is “more intuitive, with a higher tolerance for imprecise instructions.”

4.3 Mitigating Model Quirks with Harness

Cursor shares a case of “context anxiety”:

“We observed one model develop what we came to call context anxiety: As its context window filled up, it would start refusing work, hedging that the task seemed too big. We were able to reduce the behavior through prompt adjustments.”

This is a typical case of model behavior quirks being mitigated by the Harness. The model’s exhibited “context anxiety” is not a bug but an inherent behavior under specific state distributions. Through prompt adjustments (possibly informing the model that “context filling is normal, do not lose confidence because of it”), the Harness successfully reduced this behavior.

I believe: This indicates that even with strong model capabilities, there will still be model-level quirks that the Harness needs to address. Models are not perfect, and one of the important responsibilities of the Harness is to build stable behavior on top of the model’s imperfections.

V. Challenges of Mid-Conversation Model Switching

This is one of the most technically deep parts of the article. Cursor describes the technical challenges of allowing users to switch models mid-conversation:

5.1 Core Issue: Different Models Have Different Behaviors, Prompts, Tool Shapes

When users switch models, Cursor automatically switches to the corresponding model’s Harness — including that model’s customized prompts and tools. However, the conversation history was generated by another model, making it out of distribution for the current model.

5.2 Solution: Custom Instructions to Guide Takeover

Cursor adds custom instructions to inform the model:

It is taking over a mid-chat session (not starting from scratch)
Guiding it not to call tools that appear in the conversation history but are not in its own toolset

This is the Harness layer’s handling of model transitions — not changing the model’s behavior itself but guiding the model’s correct responses at the system prompt level.

5.3 Cache Miss Issues and Mitigation Attempts

Another challenge is cache hit rate:

“Caches are provider- and model-specific, so switching means a cache miss and a slower, more expensive first turn.”

Cursor attempts to summarize the conversation during switching to provide the model with a clean summary, reducing cache penalties. However, it finds that if the user is engaged in a complex task, the summary may lose important details.

Conclusion: It is recommended that users stick to the same model within a conversation unless there is a clear reason to switch.

I believe: This conclusion is crucial. It indicates that in advanced agent systems, model switching still incurs non-negligible costs. In most scenarios, “choosing one model to use throughout” remains the better choice. Model switching should be a thoughtful decision rather than a casual switch.

VI. Subagent: An Alternative to Bypass Model Switching Challenges

Cursor also mentions an alternative solution: using subagents to handle subtasks that require specific model capabilities.

“Another way to sidestep the challenges of mid-conversation model switching is to instead use a subagent, which starts from a fresh context window.”

Subagents start from a brand new context window, avoiding the issues of mismatched history distribution. This aligns with Anthropic’s proposed “Initiator Agent + Coding Agent” dual-component architecture — breaking complex tasks into multiple agents, each starting from a clean context, rather than trying to manage switching costs on a single agent with an ever-accumulating context.

VII. Harness and the Future: Multi-Agent Systems

Cursor concludes by looking to the future:

“The future of AI-assisted software engineering will be multi-agent. Instead of running every subtask through a single agent, the system routes tasks to specialized agents.”

This aligns perfectly with Anthropic’s agent evolution path — when agent capabilities are strong enough, a single agent can handle complex tasks; in the next stage, collaboration among multiple specialized agents will bring greater capability enhancements.

Cursor emphasizes the architectural challenges in multi-agent scenarios:

Different agents require different toolsets (specialized tools)
A correct routing mechanism is needed to determine which agent handles which task
Agents need communication and coordination mechanisms

Core Insights Summary

About Harness Engineering

Measurement-driven improvement > Design-driven improvement. Cursor’s core methodology is to guide Harness optimization with data (Keep Rate, LLM semantic evaluation, anomaly detection) rather than intuition. A good Harness is not designed; it is measured — hypothesized — experimented — and iterated.
The evolution direction of contextual windows is from “static filling” to “dynamic retrieval”. When the model’s capabilities are strong enough, having an external system select context for the model introduces noise. Retain a small amount of stable and useful static context, and make everything else dynamic.
“Context rot” is the most dangerous byproduct of tool call errors. Errors do not occur in isolation; they accumulate and contaminate subsequent model decisions. A good Harness needs error isolation and recovery mechanisms.

About the Measurement System

Keep Rate + LLM semantic evaluation provide scalable quality measurement. The former measures objective code behavior outcomes (retention rate), while the latter measures subjective user experience (response sentiment). Together, they are closer to assessing “true quality” than any single metric.
“Not worth it” conclusions are as important as “worth it” conclusions. Cursor used experimental data to terminate seemingly promising directions (the more expensive context summarization model). This is a counterintuitive yet extremely important engineering discipline.
Baselines must be calculated separately by tool and model. Different models exhibit significant differences in error rates during tool calls, and using a unified baseline can lead to missed or false reports.

About Model Customization

Tool format matching directly affects token consumption and error rates. Providing the model with the tool format it is accustomed to during training, rather than a “usable” format, can significantly reduce reasoning costs and errors.
Model quirks can be mitigated through Harness. The “context anxiety” case illustrates that even with strong model capabilities, Harness is needed to build stable behavior on top of model imperfections.