20% of vulnerability AI-patches that pass CI breaks production. Here’s why.

By Haggai Shachar

The “Successful” Patch That Broke Production

This is the first entry in a series examining the structural failures of automated vulnerability remediation. A green CI pipeline is often a false comfort; in our recent benchmarking of leading AI coding assistants, including Claude Code, Gemini 3 Pro, and Codex, we found that patches frequently compile and pass unit tests while silently breaking core application logic. Using the SCA RemBench framework, which evaluates real-world repositories with breaking API changes, the data reveals a persistent 20% defect rate in patch quality. These models aren’t failing because they lack coding capability, but because they treat a complex architectural migration as a simple text-completion task.

The Context Gap: Task vs. Semantic Understanding

The core of the issue is the “context gap”: most general-purpose models focus on the code as it exists now, failing to handle the version-specific delta between a library’s current state and its secure target. They perform “naive” upgrades – bumping the version and fixing obvious syntax errors while ignoring behavioral contracts. For example, we’ve seen models stub out removed APIs with null returns, allowing the build to pass while the application’s ability to parse data or handle errors effectively vanishes. Treating a library upgrade like a syntax fix is like replacing a car’s engine with a newer model and only checking if the hood still closes, without verifying if the transmission actually connects to the new block. This confirms that vulnerability remediation is not a coding problem, but a planning problem that ends with code generation.

What Makes a Good Patch?

The standard checks – “does it compile?”, “do the tests pass?”, “is the vulnerability cleaned?” – are necessary but insufficient. They tell you the patch works; they don’t tell you whether it’s good. To evaluate AI-generated remediation patches rigorously, we need to measure the semantic quality of the change itself.

Through our work on the SCA RemBench benchmark (described below), we developed three evaluation dimensions that capture the distinct ways a patch can fail – even when it appears to succeed:

Compatibility (50% weight)

The question: Does the patch preserve the original application behavior?

This carries the highest weight because a fix that resolves a CVE but breaks a core feature is an unacceptable regression risk. Compatibility failures are the most dangerous class of patch defect because they often pass automated tests while altering runtime semantics.

Common compatibility failures include:

Stubbing removed APIs with empty objects or return null, so functions silently stop working
Dropping middleware or error-handling behavior during framework upgrades
Simplifying multi-step initialization patterns into single calls that lose error granularity

A patch scores high on Compatibility when every function that worked before the upgrade continues to produce the same outputs, handle the same edge cases, and propagate the same errors after the upgrade.

Correctness (30% weight)

The question: Does the tool use the new APIs as the library authors intended?

Correctness measures whether the patch correctly adopts the upgraded library’s API – right method signatures, right types, right error sentinels, right field names. A tool can preserve compatibility by accident (e.g., a dead code path that never executes) while still being incorrect in its API usage. Correctness failures are particularly insidious because they often look right – the code uses a function that exists, references an error type that compiles – but the specific choice is wrong for the migration at hand.

A patch scores high on Correctness when every call site uses the new API exactly as the library’s documentation and changelog prescribe.

Precision (20% weight)

The question: Does the tool avoid “code noise”?

Precision measures whether the AI introduced unnecessary complexity, unrelated refactors, or “hallucinated” extra features that increase the long-term maintenance burden. It carries the lowest weight because unnecessary changes are a maintenance concern rather than a safety concern – but they still matter. Every unnecessary line is a line that future developers must understand, maintain, and potentially debug.

Common precision failures include:

Adding boolean flags or wrapper subclasses for cases that simple initialization already handles
Introducing JSON config parsing where none existed before
Re-implementing removed internal utilities from scratch instead of using the library’s official replacements
Leaving stale migration comments and version-specific annotations in the code

A patch scores high on Precision when it touches only what the migration requires – no more, no less.

The Weighted Total

These three dimensions combine into a single score:

Weighted Total = 0.5 × Compatibility + 0.3 × Correctness + 0.2 × Precision

The weights reflect a priority ordering: breaking existing behavior is worse than misusing a new API, which is worse than adding unnecessary code. Together, the three dimensions give teams a vocabulary for discussing patch quality that goes beyond “it works” or “it doesn’t.”

SCA RemBench: Putting AI to the Test

To quantify the gap between “compiles successfully” and “correctly remediated,” we utilized the SCA RemBench framework – a rigorous benchmark designed to evaluate AI-driven remediation under real-world conditions.

The Dataset: The benchmark consists of 25 repositories, each built around a verified CVE in the OSS ecosystem. The distribution reflects modern enterprise stacks: 56% Go, 28% JavaScript, and 16% Python.

Rigorous Testing Conditions:

Targeting “Non-Naive” Upgrades: Every case was selected because the patched version introduces breaking API changes. A simple version bump would result in compilation failure or runtime crashes.
Isolated Upgrade Surface: Repositories have a median size of ~450 LOC. This is a deliberate design choice to isolate the upgrade surface (API calls, type references, and error handling) and remove “noise” like context window pressure or navigation difficulty.
Zero Intervention: All tools were tested with internet access (to fetch changelogs) but with no retries, no manual prompt tuning, and no human edits.

Patches were scored by a multi-provider LLM-judge ensemble — Claude Opus 4.6, GPT-5.2, and Gemini 3 Pro. Because each of these models also powers one of the tools being evaluated, using a single judge would risk the LLM favoring its own output. Averaging scores across all three providers ensures no model judges itself unchecked. Judges also receive verified test results and vulnerability metadata as grounding signals.

The Scoreboard: How General-Purpose AI Coding Tools Perform

Here’s how the three tools performed across all three semantic dimensions:

Agent	Compatibility (50%)	Correctness (30%)	Precision (20%)	Weighted Total
Gemini 3 Pro	82.2	84.3	78.5	82.1
Claude Code (Opus 4.6)	79.9	79.7	77.0	79.3
Codex (GPT-5.2)	77.1	78.3	71.5	76.3

All three tools land in the 76–82 range. The scores reflect a structural gap, not a capability gap: the models can write the code, but they lack the upfront analysis needed to write the right code.

War Stories: When General AI Hallucinates a Fix

The scores above are averages. The individual cases show what those numbers actually look like in code.

Case 1: Express 3.21.2 → 4.20.0 – CVE-2024-29041 (Compatibility)

CVE-2024-29041 is an open-redirect vulnerability affecting all Express versions before 4.19.2.

Upgrading to v4.20.0 fixes the CVE but requires migrating away from two Express 3 patterns that no longer exist in v4: the bundled express.bodyParser() and express.cookieParser() middleware were extracted to standalone packages, and several utility functions previously exported from express/lib/utils (such as parseParams) were removed.

The correct fix replaces utils.parseParams with the qs package – the same query-string parser Express 4 uses internally:

var qs = require('qs');
function demonstrateParseParams() {
  var paramString = 'id=123&name=john&category=admin';
  var params = qs.parse(paramString);
  console.log('parseParams result:', params);
  return params; // { id: '123', name: 'john', category: 'admin' }
}

Here’s what Gemini 3 Pro produced instead:

// Gemini 3 Pro - stubs utils = {}, all utility functions silently return null
var utils = {}; // express/lib/utils is not available in Express 4
function demonstrateParseParams() {
  if (utils.parseParams) { // always false
    var params = utils.parseParams('id=123&name=john&category=admin');
    return params;
  }
  return null; // always reached
}

Five functions now silently return null – and the tests still pass only because they check the same dead guards. The code compiles, the tests are green, but the application can no longer parse query parameters, match route patterns, or negotiate content types. A clean CI pipeline hiding broken core functionality.

Case 2: quic-go v0.42.0 → v0.48.2 – CVE-2024-53259 (Correctness)

CVE-2024-53259 is an ICMP injection vulnerability affecting all quic-go versions before 0.48.2.

Upgrading to v0.48.2 fixes the CVE but introduces a breaking change in error handling: server and connection errors now wrap net.ErrClosed, which means direct == comparison against quic.ErrServerClosed silently fails – code must switch to errors.Is() for correct matching.

The correct fix is straightforward:

func CheckServerClosedError(err error) bool {
    return errors.Is(err, quic.ErrServerClosed)
}

Neither Codex nor Claude Code got this right:

// Codex GPT-5.2 - wrong error sentinel entirely
func CheckServerClosedError(err error) bool {
    return errors.Is(err, http.ErrServerClosed) // http, not quic
}

// Claude Code Opus 4.6 - right sentinel, wrong comparison method
if err == quic.ErrServerClosed {
    log.Println("Server was closed gracefully")
    return nil
}

Codex adopted errors.Is() but matched against http.ErrServerClosed – a different sentinel from a different package. Claude Code used the right sentinel but kept the == operator, which silently fails now that the error wraps net.ErrClosed. Both patches compile. Both pass tests. Both failed to use the new API correctly – a textbook Correctness failure that no automated check would catch.

Case 3: urllib3 2.2.1 → 2.6.3 – CVE-2026-21441 (Precision)

CVE-2026-21441 – urllib3 bypassed decompression safeguards when following redirects, allowing a crafted response to exhaust memory before read limits were enforced.

Upgrading to 2.6.3 fixes the CVE; the only code change required was a one-line test assertion update to reflect that the Retry class gained a new retry_after_max attribute in the patched version:

assert hasattr(retry, 'retry_after_max')  # attribute exists in 2.6.3

Here’s what Codex GPT-5.2 produced instead:

class RetryCompat(Retry):
    def __getattribute__(self, name):
        if name == "retry_after_max":
            raise AttributeError(name)
        return super().__getattribute__(name)

A 6-line subclass that actively hides the new attribute. Any production code that later depends on retry_after_max will silently break without explanation. The tool didn’t just fail to make the minimal change – it introduced a mechanism that actively fights the upgrade.

The Root Cause: Code-First Is the Wrong Paradigm

The pattern across all three cases – and across the full benchmark – is consistent. These tools fail not because they can’t write code, but because they skip straight to writing code without first understanding what changed. There is no planning phase. No analysis of the delta between library versions. No mapping of which call sites are affected or how behavioral contracts shifted. The model sees the codebase as it is today and the target version number – and starts generating.

That missing step is the root cause. This is a context problem, not a capability problem. The same foundation models that produce flawed patches in a code-first workflow can produce excellent patches when given the right context upfront. The benchmark data suggests that the missing ingredient is a structured planning and analysis phase – understanding the breaking changes, mapping the affected call sites, and building a remediation plan before any code is generated.

Vulnerability remediation isn’t a coding problem. It’s a planning problem that ends with coding.

Conclusion: The Gap Is Real – Now What?

The SCA RemBench results quantify what many security teams already feel intuitively: AI coding assistants produce patches that look right but aren’t. Tests pass, code compiles, and the vulnerability scanner goes green – but the application has silently lost functionality, adopted the wrong error semantics, or accumulated unnecessary complexity that will compound across future upgrades.

The gap between “compiles successfully” and “correctly remediated” is not small. The best general-purpose tool in our benchmark scored 82 out of 100 – meaning roughly one in five aspects of the average patch had a defect in compatibility, correctness, or precision. For organizations managing hundreds of dependencies across dozens of repositories, that error rate translates into real production risk.

The good news: this is a solvable problem. The benchmark also tested a workflow that incorporates structured remediation planning – analyzing breaking changes, mapping call sites, and building a migration roadmap before code generation begins. That workflow scored 89.5 on the same benchmark, a +7.4 point improvement over the best general-purpose tool. The models didn’t change. The context did.

In an upcoming post, we’ll dive into what that planning-first approach looks like in practice – how it works, where it helps most, and the concrete results across all 25 benchmark repositories.