Your AI Patches Still Break Production? The Fix Isn’t a Better Model, It’s a Better Workflow.

By Haggai Shachar

In our previous post, we showed how general-purpose AI coding assistants produce vulnerability patches that compile, pass tests, and still break production. The data was clear: a persistent ~20% defect rate across compatibility, correctness, and precision – not because the models can’t code, but because they skip the critical step of understanding the APIs deltas.

We ended that post with a promise: “the benchmark also tested a workflow that incorporates structured remediation planning… that workflow scored 89.4 on the same benchmark, a +9% improvement over the best general-purpose tool. The models didn’t change. The plan did.”

This is that post. Here’s what the planning-first approach looks like in practice, how it works under the hood, and the code-level evidence that context – not capability – is the differentiator.

The Missing Step: What Changes Between Versions?

Let’s restate the root cause. When you prompt an AI coding assistant to “upgrade package X to version Y,” the model receives two things: the code as it exists today, and a target version number. What it doesn’t receive is the semantic delta – the structured understanding of what changed between version A and version B at the API level.

That’s the gap. And it’s not a gap you can close with a better prompt.

Consider what a human developer does when upgrading a library with breaking changes. They don’t just bump the version and see what breaks. They read the changelog. They identify which APIs were removed, renamed, or restructured. They scan the codebase for every call site that uses the affected functions. They build a mental model of what needs to change and why – and then they write code.

General-purpose coding tools skip all of that. They pattern-match against the code snapshot, hallucinate what the new API probably looks like, and generate. Sometimes they get lucky. Often – about 20% of the time – they don’t.

Backline’s Approach: Analyze → Plan → Code → Verify → Evaluate

Backline is a vulnerability management platform where remediation is one stage in a structured security workflow. The critical architectural distinction is what happens before code generation begins.

When Backline encounters an SCA finding that requires a dependency upgrade with breaking changes, it runs a multi-stage orchestration pipeline:

Stage 1: Analyze – What Broke?

Before touching a single line of code, Backline gathers intelligence from multiple sources – the package itself, official documentation, community resources, and patterns learned from previous upgrades – to build a comprehensive picture of what could break.

Stage 2: Plan – Where Does It Matter?

Having identified what changed, the system maps where those changes affect the codebase. For each breaking symbol, Backline runs a targeted usage search across the repository:

Every call site that references a removed function
Every error comparison that uses a changed sentinel
Every type assertion against a restructured interface

The result is a per-symbol Remediation Plan – a structured document that tells the code generator exactly what to do at each location:

Stage 3: Code – Generate With Context

Only now does code generation begin. The foundation model receives:

The codebase
The specific files and line numbers that need changes
Per-symbol migration instructions with examples
The API diff as grounding context

The model’s coding ability stays exactly the same. What changes is the quality of the instructions it receives. Instead of receiving a vague directive like “upgrade this package” – which treats a complex architectural planning problem as a simple text-overwrite task – the model gets deterministic instructions: “at line 47 of server.go , replace the == comparison with errors.Is() because the error type now wraps net.ErrClosed” (an actual coding task with full context).

Stage 4: Verify – Did It Work?

After generation, Backline runs the project’s test suite, validates compilation, and confirms the vulnerability is resolved. If CI fails, it can iterate – but with the same structured context, not blind retry loops.

Stage 5: Evaluate – Did It Follow the Plan?

Verification tells you the code compiles and tests pass. It doesn’t tell you whether the agent did everything the plan specified, or whether it introduced changes the plan never asked for. Those are different questions – and they require evaluating the patch against the plan itself, not just against CI. We cover this in detail in the next post in this series.

The Scoreboard: Planning vs. No Planning

We ran both approaches through the same SCA RemBench benchmark – 25 repositories, each with a real CVE requiring a breaking-change upgrade. Same evaluation criteria. Same LLM-judge ensemble. Same zero-intervention conditions.

Agent	Compatibility (50%)	Correctness (30%)	Precision (20%)	Weighted Total
Backline	88.7	94.3	84.1	89.4
Gemini 3 Pro	82.2	84.3	78.5	82.1
Claude Code (Opus 4.6)	79.9	79.7	77.0	79.3
Codex (GPT-5.2)	77.1	78.3	71.5	76.3

Backline outperformed the nearest competitor by +7.3 weighted points. But the dimension-level breakdown tells a more interesting story:

Correctness (+13.5 vs. average): The largest gap. When the code generator knows exactly which API to use and how, it uses it correctly. This is the planning step paying off directly.
Compatibility (+9.0 vs. average): Fewer silent functionality losses because the system identifies every affected call site before generation, rather than hoping the model notices them all.
Precision (+8.4 vs. average): Less unnecessary code because the instructions are scoped to exactly what changed – no hallucinated wrapper classes, no reimplemented utilities, no speculative refactors.

War Stories: Same Upgrade, Different Outcomes

The averages tell the statistical story. The code shows what it actually looks like.

Express 3.21.2 → 4.20.0 – When Planning Prevents Dead Code

In our previous post, we showed how Gemini 3 Pro stubbed utils = {} and let five functions silently return null. Here’s what the planning-first approach produced for the same upgrade.

Backline’s Remediation Plan identified that utils.accepts in Express 3 maps to req.accepts() in Express 4 – a method now available directly on the request object. The instruction was specific: “Replace utils.accepts(req.headers.accept) with req.accepts(['text/html', 'application/json', 'text/plain']) .”

The generated fix:

 function demonstrateAccepts(req) {
-  if (utils.accepts && req) {
-    var acceptedTypes = utils.accepts(req.headers.accept || 'text/html');
+  if (req && req.accepts) {
+    var acceptedTypes = req.accepts(['text/html', 'application/json', 'text/plain']);
     return acceptedTypes;
   }
   return null;
 }

No stub. No dead code. The function works exactly as it did before – it just uses the Express 4 API now.

For comparison, Codex took its own wrong turn on the same upgrade. Rather than migrating to Express 4’s built-in replacements, it hand-rolled 115 lines of reimplemented Express 3 internals in a new utils.js file. Technically compatible – but a permanent maintenance burden that fights every future Express upgrade.

quic-go v0.42.0 → v0.49.1 – Getting the Error Semantics Right

This is the case we showed in part one, where Codex used http.ErrServerClosed (wrong package) and Claude Code kept == (wrong comparison method). Both compiled. Both passed tests. Both were wrong.

Backline’s breaking-change analysis identified the specific change: ErrServerClosed now wraps net.ErrClosed , making direct equality comparison silently fail. The Remediation Plan mapped all six comparison sites in the codebase and specified errors.Is() as the required migration.

func CheckServerClosedError(err error) bool {
    return errors.Is(err, quic.ErrServerClosed)
}

Right sentinel. Right comparison method. Every site migrated. This isn’t a harder coding task – it’s an easier one, because the instructions are explicit.

axios 0.21.1 → 1.13.5 – The Upgrade Nobody Else Attempted

This case exposes the starkest difference in the benchmark. Axios 1.x removed axios.all and axios.spread, replacing them with native JavaScript equivalents (Promise.alland destructuring). Every usage in the codebase needed to migrate.

Backline replaced them cleanly:

-axios
-  .all([Promise.resolve('John Doe'), Promise.resolve('Dark Mode')])
-  .then(
-    axios.spread((name, theme) => {
-      setUserInfo(`User: ${name}, Theme: ${theme}`);
-    })
-  )
+Promise.all([Promise.resolve('John Doe'), Promise.resolve('Dark Mode')])
+  .then(([name, theme]) => {
+    setUserInfo(`User: ${name}, Theme: ${theme}`);
+  })
   .catch(() => setUserInfo('Error loading profile'));

What did the general-purpose tools do?

Codex forced an older version via package.jsonoverrides – zero source changes, deprecated APIs still in place
Claude Code bumped the version in package.jsononly – axios.alland axios.spreadremain untouched and will throw at runtime
Gemini added new tests for existing functions without modifying any deprecated patterns

These attempts represent three distinct execution failures. The root cause remains identical: none of the general-purpose assistants mapped the breaking symbols to their specific call sites before writing code.

beego v2.0.5 → v2.3.6 – Precision Under Pressure

In beego v2.3.6, redis.DefaultKeychanged from a var to a const– the library’s signal to remove shared mutable state. The correct migration: move the key from package-level to instance-level on CacheManager.

Backline did exactly that – a simple defaultKeystring field initialized from redis.DefaultKey:

return &CacheManager{
    cache:      c,
    defaultKey: redis.DefaultKey,
}, nil

The competitors each hallucinated additional complexity:

Claude Code added JSON config parsing to extract the key – a new code path and failure mode that didn’t exist before
Codex added a boolean defaultKeySetflag and a getter with conditional logic – unnecessary branching for something simple initialization handles
Gemini created a package-level var CurrentDefaultKey– preserving exactly the mutable global state the library was signaling to remove

All three work. None were necessary. Each adds lines that future developers must understand, maintain, and potentially debug on the next upgrade cycle.

Why the Same Models Produce Better Code

It’s worth emphasizing: Backline uses foundation models for code generation. The same class of models that power the coding assistants it outperformed. The difference isn’t a better model – it’s a better workflow.

When a model receives “upgrade quic-go to v0.49.1,” it has to:

Figure out what changed (from training data, internet search, or hallucination)
Find every affected location in the code
Determine the correct migration for each location
Generate the fix

Steps 1-3 are planning tasks, not coding tasks. General-purpose coding tools attempt all four in a single generation pass. Backline completes steps 1-3 through structured analysis, then hands step 4 to the model with full context. The model only does what it’s actually good at: writing code to a specification.

This is why the Correctness gap (+13.5 points) is the largest. When the specification says “use errors.Is()with quic.ErrServerClosed,” the model uses it. When the specification says “replace utils.acceptswith req.accepts(),” the model does it. The coding task becomes trivial once the planning is done.

When Does Planning Help Most?

The benchmark data reveals a pattern: the advantage of structured planning scales with migration complexity.

On simple upgrades – a renamed constant, a single-site API change – general-purpose tools often match Backline’s scores.

But on complex upgrades – multiple affected symbols, behavioral changes that aren’t syntactically visible, APIs that moved between packages – the gap widens dramatically. The axios case and the Express case show what happens when planning is essential but absent.

For teams managing hundreds of dependencies, the distribution of upgrades matters. Most patch-level bumps require no code changes. But the ones that do – the major version upgrades, the framework migrations, the breaking-change paths – are exactly where AI patches most frequently fail, and where structured planning provides the largest benefit.

Conclusion: The Context Is the Product

The SCA RemBench results confirm a simple thesis: vulnerability remediation is a planning problem that ends with code generation. The models are capable. The missing ingredient is structured analysis of what changed, where it matters, and how to migrate – delivered to the model as explicit instructions rather than left as an implicit inference task.

The improvement from planning isn’t magic. It’s the difference between asking a contractor to “renovate the kitchen” and handing them architectural drawings. The skill is the same. The outcome depends entirely on the quality of the instructions.

For organizations running remediation at scale – hundreds of repositories, thousands of dependencies, quarterly upgrade cycles – that difference compounds. A ~20% defect rate across hundreds of patches means dozens of silent regressions per cycle. Backline’s +9% improvement means fewer bridge calls, fewer rollbacks, and more trust in the automated pipeline.

The models will keep improving. But better models with bad context still produce bad patches. The leverage is in the workflow – in the structured analysis that turns “upgrade this package” into a set of precise, grounded, verifiable migration instructions.

While this post focused on dependency upgrades as a concrete example, the same analyze-plan-execute workflow powers Backline’s remediation across every finding type: container image vulnerabilities, IaC and cloud misconfigurations, exposed secrets, identity risks, and more. The principle is universal – structured context before code generation produces better outcomes regardless of the security domain.

That’s what Backline does.

See it in action on your repositories

Next in this series: we’ll discuss about the evaluation framework behind Stage 5 – measuring adherence and coverage against the plan itself, not just CI.