LLM-assisted binary diffing: finding 1-days before PoCs drop

TL;DR — When a vendor ships a security patch, the binary itself tells the full story. Researchers have always diffed patched vs. unpatched binaries to reverse-engineer vulnerabilities. LLMs now compress that process from days to hours. This post walks through a complete technical pipeline: acquiring binaries, structuring diffs for LLM consumption, prompt engineering for vulnerability classification, and validating the output — with working code at every stage.

The 1-Day Window

Every Patch Tuesday, Microsoft publishes security updates with deliberately vague descriptions: "Remote Code Execution vulnerability in Windows Kernel." No technical details. No PoC. Just a CVE number, a severity rating, and a patched binary.

But here's the thing — the patch itself is the vulnerability disclosure. The diff between the patched and unpatched binary reveals exactly what was broken: which function, which check was missing, which boundary wasn't validated. For years, skilled reverse engineers have exploited this asymmetry. They diff the binaries, find the vuln, build the exploit, and use it against the enormous population of systems that haven't patched yet.

That window between patch release and widespread deployment is the 1-day window. It's always been valuable. LLMs are about to make it dangerous.

Why LLMs Change the Equation

Traditional patch diffing requires a reverse engineer who can:

Navigate thousands of changed functions to find the security-relevant ones
Read decompiled C pseudocode fluently
Recognize vulnerability patterns (off-by-one, integer overflow, UAF, type confusion)
Reason about exploitability — can an attacker reach this code? What primitives does it give?

This is a rare skillset. Maybe a few hundred people worldwide can do it quickly and reliably. LLMs don't replace them, but they act as a force multiplier that makes the initial triage phase almost instant.

Why this works now:

Decompiler output is basically C — Ghidra and IDA produce pseudocode that looks like C. LLMs are trained on enormous amounts of C. They can reason about it.
Context windows are large enough — You can feed entire function pairs (before/after) with caller context. A year ago, you'd be truncating critical code.
Vulnerability patterns are well-documented — The model has seen thousands of CVE descriptions, write-ups, and exploit analyses during training. It knows what an integer overflow looks like.

The result: tasks that took an experienced researcher 4-8 hours of focused work can now be triaged in minutes. The human still validates, but the LLM does the heavy lifting of pattern recognition.

The Pipeline Architecture

Here's what we're building end-to-end:


┌──────────────┐     ┌──────────────┐     ┌──────────────────┐
│  Patch drops  │────▶│ Extract bins │────▶│  BinDiff/Diaphora│
│  (Patch Tue)  │     │  (pre/post)  │     │  (function diff)  │
└──────────────┘     └──────────────┘     └────────┬─────────┘
                                                    │
                     ┌──────────────┐     ┌────────▼─────────┐
                     │  Structured   │◀────│ Headless Ghidra   │
                     │  LLM Prompt   │     │ (decompile both)  │
                     └───────┬──────┘     └──────────────────┘
                             │
                    ┌────────▼─────────┐
                    │   LLM Analysis   │
                    │  (multi-round)   │
                    └────────┬─────────┘
                             │
                    ┌────────▼─────────┐
                    │   Validation +   │
                    │   Scoring        │
                    └──────────────────┘

Each stage has real engineering decisions. Let's walk through every one.

Stage 1: Acquiring the Binaries

This sounds trivial. It isn't. Half the battle is reliably getting the exact pre-patch and post-patch versions of the right binary.

Windows (Patch Tuesday)

Winbindex is the gold standard. It indexes every version of every Windows system DLL and driver ever shipped, keyed by KB number. You can pull the exact binary pair you need.


import requests
import json
import subprocess
import os
from pathlib import Path

class BinaryAcquirer:
    """
    Acquires pre-patch and post-patch Windows binaries 
    using Winbindex for a given KB update.
    """
    
    WINBINDEX_API = "https://winbindex.m417z.com/data/by_filename_compressed"
    SYMBOL_SERVER = "https://msdl.microsoft.com/download/symbols"
    
    def __init__(self, output_dir="./binaries"):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
    
    def get_file_versions(self, filename):
        """
        Query Winbindex for all known versions of a Windows binary.
        Returns a dict mapping version strings to download metadata.
        """
        url = f"{self.WINBINDEX_API}/{filename}.json.gz"
        resp = requests.get(url)
        resp.raise_for_status()
        return resp.json()
    
    def find_patch_pair(self, filename, kb_number):
        """
        Given a filename (e.g., 'ntoskrnl.exe') and KB number,
        find the versions immediately before and after the patch.
        
        Returns (pre_patch_info, post_patch_info) or raises if not found.
        """
        versions = self.get_file_versions(filename)
        
        # Filter versions, sort by timestamp
        sorted_versions = sorted(
            versions.items(),
            key=lambda x: x[1].get("timestamp", 0)
        )
        
        post_patch = None
        pre_patch = None
        
        for version_str, info in sorted_versions:
            if kb_number.upper() in json.dumps(info).upper():
                post_patch = (version_str, info)
                break
        
        if not post_patch:
            raise ValueError(f"KB {kb_number} not found for {filename}")
        
        # The version immediately before in the sorted list is our pre-patch
        post_idx = [v[0] for v in sorted_versions].index(post_patch[0])
        if post_idx > 0:
            pre_patch = sorted_versions[post_idx - 1]
        
        return pre_patch, post_patch
    
    def download_binary(self, file_info, output_name):
        """
        Download a specific binary version from Microsoft's symbol server
        or directly from the update package.
        """
        output_path = self.output_dir / output_name
        
        if "fileInfo" in file_info:
            # Use PE hash to download from symbol server
            fi = file_info["fileInfo"]
            timestamp = format(fi["timestamp"], "X")
            size = format(fi["virtualSize"], "X")
            url = f"{self.SYMBOL_SERVER}/{output_name}/{timestamp}{size}/{output_name}"
            
            resp = requests.get(url, stream=True)
            resp.raise_for_status()
            
            with open(output_path, "wb") as f:
                for chunk in resp.iter_content(chunk_size=8192):
                    f.write(chunk)
        
        print(f"[+] Downloaded: {output_path} ({output_path.stat().st_size} bytes)")
        return output_path


# --- Alternative: Extract from .msu update packages directly ---

def extract_from_msu(msu_path, target_filename, output_dir):
    """
    Extract a specific file from a Windows Update .msu package.
    
    MSU structure:
      .msu -> contains .cab files
        .cab -> contains actual binaries (sometimes nested)
    """
    work_dir = Path(output_dir) / "msu_work"
    work_dir.mkdir(parents=True, exist_ok=True)
    
    # Step 1: Extract the .msu (it's a cabinet archive)
    subprocess.run(
        ["expand", "-F:*", str(msu_path), str(work_dir)],
        check=True, capture_output=True
    )
    
    # Step 2: Find and extract the inner .cab
    for cab in work_dir.glob("*.cab"):
        subprocess.run(
            ["expand", "-F:*", str(cab), str(work_dir / "inner")],
            check=True, capture_output=True
        )
    
    # Step 3: Locate the target binary
    results = list((work_dir / "inner").rglob(target_filename))
    if not results:
        raise FileNotFoundError(
            f"{target_filename} not found in {msu_path}"
        )
    
    return results[0]

Linux Kernel

For Linux, you have it easier — the source is public. But binary-level analysis on compiled kernel modules is still interesting because compiler optimizations obscure the vulnerability. The source diff might show a simple bounds check, but the compiled code might have been vectorized, inlined, or reordered.


# Get the exact commit that patched a CVE
git log --all --grep="CVE-2024-XXXXX" --format="%H %s"

# Get the parent (pre-patch) commit
git rev-parse <patch_commit>^

# Build both versions of the specific module
git checkout <pre_patch_commit>
make M=drivers/target_subsystem/
cp drivers/target_subsystem/target.ko ./target_pre.ko

git checkout <post_patch_commit>
make M=drivers/target_subsystem/
cp drivers/target_subsystem/target.ko ./target_post.ko

Stage 2: Diffing — BinDiff vs Diaphora

Both tools match functions between two binaries and assign similarity scores. The interesting functions are the ones with similarity between 0.5 and 0.99 — similar enough to be the same function, but different enough that something changed.

Why Diaphora Wins for This Pipeline

Diaphora exports results to SQLite, which makes programmatic access trivial. BinDiff uses a custom binary format that's painful to parse.


import sqlite3
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class FunctionDiff:
    """Represents a single changed function between two binary versions."""
    name: str
    address_original: int
    address_patched: int
    similarity_ratio: float
    pseudocode_original: Optional[str] = None
    pseudocode_patched: Optional[str] = None
    callers: Optional[List[str]] = None
    
    @property
    def is_security_relevant(self):
        """
        Heuristic: functions with similarity 0.7-0.99 are most likely
        to be security patches. Below 0.7 might be refactors.
        Above 0.99 is probably just metadata/version changes.
        """
        return 0.7 <= self.similarity_ratio <= 0.99


class DiaphoraAnalyzer:
    """
    Extracts and ranks changed functions from Diaphora's SQLite output.
    Focuses on identifying security-relevant patches.
    """
    
    def __init__(self, db_path):
        self.db = sqlite3.connect(db_path)
        self.db.row_factory = sqlite3.Row
    
    def get_changed_functions(self, min_ratio=0.5, max_ratio=0.99):
        """
        Extract functions that changed between versions.
        
        Sorted by ratio ASC — most changed first — because 
        the biggest changes are often the most interesting patches.
        
        Filters out:
        - Perfect matches (ratio = 1.0) — unchanged
        - Very low matches (ratio < 0.5) — likely refactors, not patches
        """
        cursor = self.db.execute("""
            SELECT 
                name,
                address,
                address2,
                ratio,
                pseudocode,
                pseudocode2,
                md_index          -- Complexity metric
            FROM results 
            WHERE ratio < ? 
              AND ratio > ?
              AND name NOT LIKE '%guard%'     -- Filter out CFG stubs
              AND name NOT LIKE '%security_cookie%'
            ORDER BY ratio ASC
        """, (max_ratio, min_ratio))
        
        functions = []
        for row in cursor:
            diff = FunctionDiff(
                name=row["name"],
                address_original=row["address"],
                address_patched=row["address2"],
                similarity_ratio=row["ratio"],
                pseudocode_original=row["pseudocode"],
                pseudocode_patched=row["pseudocode2"]
            )
            functions.append(diff)
        
        return functions
    
    def get_security_candidates(self):
        """
        Returns functions most likely to be security patches.
        Uses multiple heuristics beyond just similarity ratio.
        """
        all_changed = self.get_changed_functions()
        
        candidates = []
        for func in all_changed:
            score = self._security_score(func)
            if score > 0.5:
                candidates.append((score, func))
        
        # Sort by security relevance score, descending
        candidates.sort(key=lambda x: x[0], reverse=True)
        return candidates
    
    def _security_score(self, func: FunctionDiff) -> float:
        """
        Heuristic scoring for how likely a function change is 
        a security patch vs. a feature change or refactor.
        """
        score = 0.0
        
        # Similarity ratio sweet spot
        if 0.85 <= func.similarity_ratio <= 0.98:
            score += 0.4  # Small, targeted change = likely a fix
        
        if func.pseudocode_patched and func.pseudocode_original:
            patched = func.pseudocode_patched.lower()
            original = func.pseudocode_original.lower()
            
            # New bounds checks added
            new_checks = [
                "if (", "< 0", "> 0", "<= 0", ">= 0",
                "!= null", "== null", "!= 0",
                "size", "length", "count", "bound"
            ]
            for check in new_checks:
                if check in patched and check not in original:
                    score += 0.3
                    break
            
            # New error handling
            if "return" in patched and patched.count("return") > original.count("return"):
                score += 0.2
            
            # Lock/synchronization added (race condition fix)
            sync_keywords = ["lock", "mutex", "spinlock", "critical_section"]
            for kw in sync_keywords:
                if kw in patched and kw not in original:
                    score += 0.4
                    break
        
        # Function name hints
        security_names = [
            "validate", "check", "verify", "sanitize",
            "parse", "decode", "deserialize", "callback",
            "alloc", "free", "release", "dispatch"
        ]
        name_lower = func.name.lower()
        for hint in security_names:
            if hint in name_lower:
                score += 0.1
                break
        
        return min(score, 1.0)
    
    def close(self):
        self.db.close()

Running Diaphora


# In IDA Pro (or use the Ghidra port):
# 1. Open the ORIGINAL binary
# 2. Run diaphora.py → export to original.sqlite

# 3. Open the PATCHED binary  
# 4. Run diaphora.py → diff against original.sqlite
# 5. Results saved to diaphora_results.sqlite

Stage 3: Headless Decompilation at Scale

You need decompiled pseudocode for both versions of every changed function. Doing this manually is insane. Ghidra's headless mode is the answer.


import subprocess
import json
from pathlib import Path
from typing import Dict

class HeadlessGhidra:
    """
    Drives Ghidra in headless mode to decompile specific functions
    from a binary. Only decompiles functions flagged by Diaphora
    to avoid wasting time on unchanged code.
    """
    
    GHIDRA_HOME = "/opt/ghidra"  # Adjust to your installation
    
    def __init__(self, project_dir="./ghidra_projects"):
        self.project_dir = Path(project_dir)
        self.project_dir.mkdir(parents=True, exist_ok=True)
    
    def decompile_functions(
        self, 
        binary_path: str, 
        function_addresses: list,
        project_name: str = "diffproject"
    ) -> Dict[int, str]:
        """
        Decompile specific functions from a binary using Ghidra headless.
        
        Args:
            binary_path: Path to the binary to analyze
            function_addresses: List of function addresses (int) to decompile
            project_name: Ghidra project name
            
        Returns:
            Dict mapping address -> decompiled pseudocode string
        """
        # Write target addresses to a file for the Ghidra script
        addr_file = self.project_dir / "target_addrs.json"
        addr_file.write_text(json.dumps(
            [hex(addr) for addr in function_addresses]
        ))
        
        output_file = self.project_dir / "decompiled_output.json"
        
        # Run Ghidra headless analyzer
        cmd = [
            f"{self.GHIDRA_HOME}/support/analyzeHeadless",
            str(self.project_dir),
            project_name,
            "-import", binary_path,
            "-postScript", "DecompileTargets.java",
            "-scriptPath", str(Path(__file__).parent / "ghidra_scripts"),
            "-overwrite",
            "-deleteProject",  # Clean up after
        ]
        
        result = subprocess.run(
            cmd,
            capture_output=True,
            text=True,
            timeout=600  # 10 min timeout per binary
        )
        
        if result.returncode != 0:
            print(f"[!] Ghidra stderr:\n{result.stderr[-2000:]}")
            raise RuntimeError("Ghidra analysis failed")
        
        # Parse output
        if output_file.exists():
            return json.loads(output_file.read_text())
        
        return {}

And the corresponding Ghidra script (DecompileTargets.java):


// DecompileTargets.java — Ghidra postScript
// Decompiles only the functions at addresses specified in target_addrs.json
// Outputs results to decompiled_output.json

import ghidra.app.decompiler.DecompInterface;
import ghidra.app.decompiler.DecompileResults;
import ghidra.app.script.GhidraScript;
import ghidra.program.model.listing.Function;
import ghidra.program.model.listing.FunctionManager;
import ghidra.program.model.address.Address;
import com.google.gson.Gson;
import com.google.gson.reflect.TypeToken;
import java.io.*;
import java.util.*;

public class DecompileTargets extends GhidraScript {
    
    @Override
    public void run() throws Exception {
        // Read target addresses
        File addrFile = new File(
            getProjectRootFolder().getProjectLocator()
                .getProjectDir().getParent(),
            "target_addrs.json"
        );
        
        Gson gson = new Gson();
        List<String> targetAddrs = gson.fromJson(
            new FileReader(addrFile),
            new TypeToken<List<String>>(){}.getType()
        );
        
        // Set up decompiler
        DecompInterface decomp = new DecompInterface();
        decomp.openProgram(currentProgram);
        
        FunctionManager funcMgr = currentProgram.getFunctionManager();
        Map<String, Object> results = new HashMap<>();
        
        for (String addrStr : targetAddrs) {
            long addrLong = Long.parseLong(
                addrStr.replace("0x", ""), 16
            );
            Address addr = currentProgram.getAddressFactory()
                .getDefaultAddressSpace().getAddress(addrLong);
            Function func = funcMgr.getFunctionAt(addr);
            
            if (func == null) {
                // Try to find containing function
                func = funcMgr.getFunctionContaining(addr);
            }
            
            if (func != null) {
                DecompileResults res = decomp.decompileFunction(
                    func, 120, monitor  // 120 second timeout per function
                );
                
                if (res.depiledFunction() != null) {
                    Map<String, String> funcData = new HashMap<>();
                    funcData.put("name", func.getName());
                    funcData.put("pseudocode", 
                        res.getDecompiledFunction().getC());
                    funcData.put("signature", 
                        func.getSignature().getPrototypeString());
                    
                    // Get callers (cross-references)
                    List<String> callers = new ArrayList<>();
                    for (var ref : getReferencesTo(func.getEntryPoint())) {
                        Function caller = funcMgr.getFunctionContaining(
                            ref.getFromAddress()
                        );
                        if (caller != null) {
                            callers.add(caller.getName());
                        }
                    }
                    funcData.put("callers", String.join(", ", callers));
                    
                    results.put(addrStr, funcData);
                }
            }
        }
        
        // Write output
        File outFile = new File(addrFile.getParent(), 
            "decompiled_output.json");
        try (FileWriter fw = new FileWriter(outFile)) {
            gson.toJson(results, fw);
        }
        
        println("[+] Decompiled " + results.size() + " functions");
    }
}

Key Optimization: Don't Decompile Everything

On a binary like ntoskrnl.exe with 30,000+ functions, full decompilation takes over an hour. We only need the ~20 functions Diaphora flagged. This brings it down to seconds.


# Only decompile what Diaphora flagged as changed
analyzer = DiaphoraAnalyzer("diaphora_results.sqlite")
candidates = analyzer.get_security_candidates()

# Extract just the addresses we need
original_addrs = [c[1].address_original for c in candidates]
patched_addrs = [c[1].address_patched for c in candidates]

ghidra = HeadlessGhidra()
original_decomp = ghidra.decompile_functions(
    "ntoskrnl_original.exe", original_addrs
)
patched_decomp = ghidra.decompile_functions(
    "ntoskrnl_patched.exe", patched_addrs
)

Stage 4: Prompt Engineering — The Critical Layer

This is where most people would screw up. You can't just dump two walls of pseudocode and say "find the bug." The model needs structured context and specific questions.

The Prompt Template


def build_analysis_prompt(func_diff, original_code, patched_code, callers):
    """
    Constructs a structured prompt for LLM vulnerability analysis.
    
    Key principles:
    - Show BOTH versions side-by-side (not just the diff)
    - Include caller context (reachability matters)
    - Ask structured questions (prevents rambling)
    - Request specific output format (parseable)
    """
    
    prompt = f"""## Binary Patch Analysis

### Target
- **Function**: `{func_diff.name}`
- **Binary**: ntoskrnl.exe (Windows Kernel)
- **Similarity ratio**: {func_diff.similarity_ratio:.3f}
- **Known callers**: {', '.join(callers) if callers else 'Unknown'}

### BEFORE (Unpatched / Vulnerable Version):

{original_code}



### AFTER (Patched Version):

{patched_code}



### Analysis Tasks

**Task 1 — Vulnerability Classification**
Examine the diff between the two versions. Classify the vulnerability 
into one of: buffer overflow, integer overflow, out-of-bounds read/write, 
use-after-free, type confusion, race condition, null pointer dereference, 
logic bug, or other (specify).

Identify the EXACT lines that changed and explain what they reveal.

**Task 2 — Reachability Assessment**
Given the known callers listed above, assess:
- Can an unprivileged user-mode process trigger this code path?
- What Windows API calls or operations would lead here?
- Are there any gating checks that limit reachability?

**Task 3 — Exploitation Primitive**
If the vulnerability is triggerable:
- What memory corruption primitive does it provide? 
  (arbitrary write, relative write, read, info leak, etc.)
- What is the corruption target? (adjacent heap object, stack variable, etc.)
- What's the attacker-controlled input that influences the corruption?

**Task 4 — Trigger Sketch**
Write a minimal C proof-of-concept skeleton that would:
1. Reach the vulnerable function
2. Supply the input that triggers the vulnerability
Do NOT write a full exploit. Just reach the bug.

### Output Format
Respond with clearly labeled sections matching each task number.
For Task 1, also include a confidence score (low/medium/high) for 
your classification.
"""
    return prompt

Multi-Round Chaining — Why Single Prompts Aren't Enough

Don't ask one mega-question. Chain the analysis across multiple rounds so each step validates the previous one.


import anthropic
from typing import Dict, Any

class VulnAnalyzer:
    """
    Multi-round LLM analysis pipeline for vulnerability classification.
    
    Each round builds on the previous, with validation between steps.
    This catches hallucinations early before they compound.
    """
    
    def __init__(self, model="claude-sonnet-4-20250514"):
        self.client = anthropic.Anthropic()
        self.model = model
        self.conversation_history = []
    
    def analyze(self, func_diff, original_code, patched_code, callers) -> Dict[str, Any]:
        results = {}
        
        # --- Round 1: Classification ---
        r1_prompt = f"""Analyze this binary patch. I'll show you the original 
and patched versions of function `{func_diff.name}`.

ORIGINAL (vulnerable):

{original_code}



PATCHED (fixed):

{patched_code}



Classify the vulnerability type. What specific code change reveals it?
Confidence: low/medium/high.
Respond concisely — classification + evidence only."""

        r1_response = self._ask(r1_prompt)
        results["classification"] = r1_response
        
        # --- Round 2: Reachability (only if R1 is high confidence) ---
        if "high" in r1_response.lower() or "medium" in r1_response.lower():
            r2_prompt = f"""Good. Now assess reachability.

Known callers of `{func_diff.name}`: {', '.join(callers)}

Can an unprivileged user-mode process reach this function?
What API calls or operations would trigger it?
Be specific about the call chain."""

            r2_response = self._ask(r2_prompt)
            results["reachability"] = r2_response
        
        # --- Round 3: Exploitation primitive ---
        r3_prompt = """Based on your classification and reachability analysis:

What exploitation primitive does this give an attacker?
(arbitrary write, relative OOB, info leak, etc.)

What is the corrupted target and what does the attacker control?"""

        r3_response = self._ask(r3_prompt)
        results["exploitation"] = r3_response
        
        # --- Round 4: PoC skeleton ---
        r4_prompt = """Write a minimal C proof-of-concept that reaches the 
vulnerable function with attacker-controlled input.

Requirements:
- Must compile on Windows (use Win32 APIs)
- Just trigger the bug, don't exploit it
- Include comments explaining each step
- Use the specific call chain you identified"""

        r4_response = self._ask(r4_prompt)
        results["poc_skeleton"] = r4_response
        
        return results
    
    def _ask(self, prompt: str) -> str:
        """Send a message maintaining conversation context."""
        self.conversation_history.append({
            "role": "user", 
            "content": prompt
        })
        
        response = self.client.messages.create(
            model=self.model,
            max_tokens=4096,
            system="""You are an expert vulnerability researcher 
specializing in Windows kernel security. You analyze binary patches 
to identify and classify vulnerabilities. Be precise, technical, 
and concise. Do not speculate beyond what the code shows.""",
            messages=self.conversation_history
        )
        
        assistant_msg = response.content[0].text
        self.conversation_history.append({
            "role": "assistant",
            "content": assistant_msg
        })
        
        return assistant_msg

Stage 5: Validation — Catching Hallucinations

The LLM will be wrong sometimes. It hallucinates Win32 API calls, invents struct fields that don't exist, and misclassifies subtle bugs. You need automated sanity checks.


import re
from dataclasses import dataclass
from typing import List, Tuple

@dataclass
class ValidationResult:
    passed: bool
    score: float  # 0.0 to 1.0
    issues: List[str]
    

class PatchValidator:
    """
    Validates LLM vulnerability analysis against known heuristics.
    
    This doesn't prove the analysis is correct, but it catches
    obviously wrong classifications and hallucinated details.
    """
    
    # Map of patch patterns to expected vulnerability classes
    PATCH_PATTERNS = {
        "bounds_check_added": {
            "pattern": r"if\s*\([^)]*[<>]=?\s*\d+",
            "expected_classes": [
                "buffer overflow", "out-of-bounds", 
                "integer overflow"
            ],
            "confidence_boost": 0.3
        },
        "null_check_added": {
            "pattern": r"if\s*\([^)]*[!=]=\s*(NULL|0|nullptr)",
            "expected_classes": [
                "null pointer dereference", "use-after-free"
            ],
            "confidence_boost": 0.25
        },
        "lock_added": {
            "pattern": r"(mutex|spinlock|lock|critical_section|KeAcquire|ExAcquire)",
            "expected_classes": ["race condition"],
            "confidence_boost": 0.4
        },
        "size_validation": {
            "pattern": r"(size|length|count|num)\s*[<>]=?\s*",
            "expected_classes": [
                "buffer overflow", "integer overflow", 
                "out-of-bounds"
            ],
            "confidence_boost": 0.35
        },
        "type_check_added": {
            "pattern": r"(type|kind|tag)\s*[!=]=\s*",
            "expected_classes": ["type confusion"],
            "confidence_boost": 0.3
        }
    }
    
    def validate_classification(
        self, 
        llm_classification: str,
        original_code: str,
        patched_code: str
    ) -> ValidationResult:
        """
        Cross-check the LLM's vulnerability classification against 
        observable patch patterns.
        """
        issues = []
        score = 0.5  # Start neutral
        
        # Find what was ADDED in the patch
        # (Naive approach — real implementation should use AST diffing)
        patched_lines = set(patched_code.splitlines())
        original_lines = set(original_code.splitlines())
        new_lines = patched_lines - original_lines
        new_code = "\n".join(new_lines)
        
        matched_patterns = []
        
        for pattern_name, pattern_info in self.PATCH_PATTERNS.items():
            if re.search(pattern_info["pattern"], new_code, re.IGNORECASE):
                matched_patterns.append(pattern_name)
                
                # Check if LLM's classification aligns 
                # with what the patch pattern suggests
                llm_class_lower = llm_classification.lower()
                expected = pattern_info["expected_classes"]
                
                if any(exp in llm_class_lower for exp in expected):
                    score += pattern_info["confidence_boost"]
                else:
                    issues.append(
                        f"Patch pattern '{pattern_name}' suggests "
                        f"{expected}, but LLM classified as: "
                        f"'{llm_classification}'"
                    )
                    score -= 0.2
        
        if not matched_patterns:
            issues.append(
                "No recognizable patch patterns found — "
                "manual review recommended"
            )
            score -= 0.1
        
        # Check for common hallucination indicators
        hallucination_flags = self._check_hallucinations(
            llm_classification, patched_code
        )
        issues.extend(hallucination_flags)
        score -= 0.15 * len(hallucination_flags)
        
        score = max(0.0, min(1.0, score))
        
        return ValidationResult(
            passed=score >= 0.5 and len(hallucination_flags) == 0,
            score=score,
            issues=issues
        )
    
    def _check_hallucinations(
        self, 
        classification: str, 
        patched_code: str
    ) -> List[str]:
        """
        Detect common LLM hallucination patterns in vuln analysis.
        """
        flags = []
        
        # If LLM says "race condition" but no sync primitives 
        # were added, it's likely wrong
        if "race" in classification.lower():
            sync_evidence = re.search(
                r"(lock|mutex|spinlock|atomic|interlocked)",
                patched_code, re.IGNORECASE
            )
            if not sync_evidence:
                flags.append(
                    "HALLUCINATION: 'race condition' classified but "
                    "no synchronization primitives found in patch"
                )
        
        # If LLM says "use-after-free" but the patch only 
        # adds bounds checks, probably wrong
        if "use-after-free" in classification.lower():
            if not re.search(r"(free|release|delete|deref)", 
                           patched_code, re.IGNORECASE):
                flags.append(
                    "SUSPECT: 'use-after-free' classified but no "
                    "free/release related changes visible"
                )
        
        return flags
    
    def validate_poc_compiles(self, poc_code: str) -> Tuple[bool, str]:
        """
        Attempt to compile the PoC skeleton to catch hallucinated APIs.
        Uses cl.exe (MSVC) or x86_64-w64-mingw32-gcc as fallback.
        
        Returns (success, error_message).
        """
        import tempfile
        
        with tempfile.NamedTemporaryFile(
            suffix=".c", mode="w", delete=False
        ) as f:
            f.write(poc_code)
            f.flush()
            
            # Try MinGW cross-compilation (Linux) 
            # or MSVC (Windows)
            try:
                result = subprocess.run(
                    [
                        "x86_64-w64-mingw32-gcc",
                        "-c",          # Compile only, don't link
                        "-fsyntax-only",
                        f.name
                    ],
                    capture_output=True, text=True, timeout=30
                )
                
                if result.returncode == 0:
                    return True, ""
                else:
                    return False, result.stderr
                    
            except FileNotFoundError:
                return False, "No cross-compiler available"

Confidence Scoring — Putting It All Together


def compute_final_confidence(
    diaphora_score: float,
    llm_classification_confidence: str,
    validation_result,  # ValidationResult
    llm_consistency: float  # Agreement across N independent runs
) -> dict:
    """
    Aggregate confidence score from all pipeline stages.
    
    A high score means: the patch pattern matches the LLM's 
    classification, the LLM is confident, and multiple runs agree.
    """
    
    confidence_map = {"low": 0.3, "medium": 0.6, "high": 0.9}
    llm_conf = confidence_map.get(
        llm_classification_confidence.lower(), 0.5
    )
    
    # Weighted combination
    weights = {
        "patch_heuristic": 0.25,
        "llm_confidence": 0.25,
        "validation_score": 0.25,
        "consistency": 0.25
    }
    
    final_score = (
        weights["patch_heuristic"] * diaphora_score +
        weights["llm_confidence"] * llm_conf +
        weights["validation_score"] * validation_result.score +
        weights["consistency"] * llm_consistency
    )
    
    # Determine action
    if final_score >= 0.8:
        action = "HIGH_PRIORITY — likely exploitable, begin manual analysis"
    elif final_score >= 0.6:
        action = "MEDIUM — worth investigating, may need manual validation"
    elif final_score >= 0.4:
        action = "LOW — possible false positive, review if time permits"
    else:
        action = "SKIP — likely misclassification or non-security change"
    
    return {
        "final_score": round(final_score, 3),
        "action": action,
        "breakdown": {
            "patch_heuristic": diaphora_score,
            "llm_confidence": llm_conf,
            "validation": validation_result.score,
            "consistency": llm_consistency
        },
        "issues": validation_result.issues
    }

A Concrete Example: Walking Through a Real Patch

Let's trace through a simplified but realistic example. Imagine a Patch Tuesday fix for a kernel callback function.

The Diff

Before (vulnerable):


void CmpCallCallBacks(PCMHIVE Hive, int Type) {
    PVOID buffer = Hive->CallbackListHead;
    int count = *(int*)(buffer + 0x10);
    
    for (int i = 0; i < count; i++) {
        PCALLBACK_ENTRY entry = (PCALLBACK_ENTRY)(buffer + i * 0x28);
        if (entry->Routine != NULL) {
            entry->Routine(entry->Context, Type);
        }
    }
}

After (patched):


void CmpCallCallBacks(PCMHIVE Hive, int Type) {
    PVOID buffer = Hive->CallbackListHead;
    int count = *(int*)(buffer + 0x10);
    
    // === PATCH: bounds validation added ===
    if (count < 0 || count > MAX_CALLBACKS) {
        return;  
    }
    
    for (int i = 0; i < count; i++) {
        PCALLBACK_ENTRY entry = (PCALLBACK_ENTRY)(buffer + i * 0x28);
        if (entry->Routine != NULL) {
            entry->Routine(entry->Context, Type);
        }
    }
}

What the LLM sees

When fed this through the structured prompt, a good model will identify:

Classification: Integer overflow / out-of-bounds access (HIGH confidence). The count value is read from attacker-influenced memory (Hive->CallbackListHead + 0x10) with no validation. A negative or very large count causes the loop to read/execute from out-of-bounds memory.

Reachability: CmpCallCallBacks is called from CmpPostNotify and CmUnRegisterCallback. Registry operations from user-mode can reach this path. A crafted registry hive could influence the CallbackListHead structure.

Primitive: Out-of-bounds read leading to a controlled function pointer call. If the attacker can influence the memory at buffer + i * 0x28, they control entry->Routine — a direct kernel code execution primitive.

PoC sketch: Load a crafted registry hive via RegLoadKey() with a malformed callback list.

Where LLMs Fail (and Why This Matters)

Documenting failure modes is just as important as the successes. From extensive testing, here's where models consistently struggle:

1. Deeply nested struct manipulation When the vulnerability involves pointer arithmetic across 3+ levels of struct nesting, models lose track of offsets. They'll say "field X is at offset 0x18" when it's actually at 0x20 because they miscounted a union.

2. Compiler optimization artifacts Ghidra's decompiler sometimes produces code that looks buggy but is actually an optimization artifact. Models flag these as vulnerabilities — false positives.

3. Subtle race conditions Time-of-check-to-time-of-use (TOCTOU) bugs are hard for models because the vulnerability exists between two functions, not within one. The model sees each function in isolation and misses the window.

4. Implicit type conversions Signed/unsigned comparison bugs are notoriously subtle. if (user_input < buffer_size) looks safe, but if user_input is a signed int and negative, the comparison passes on some compilers. Models miss this about 60% of the time in testing.

5. Custom allocator semantics Windows kernel uses pool allocators with specific tag-based semantics. Models don't understand that ExAllocatePoolWithTag memory has specific alignment and adjacency properties that affect exploitability.

The Defender's Perspective

If you're on a blue team reading this, the implications are uncomfortable. This pipeline compresses the 1-day exploitation window from weeks (when only elite researchers could find the bug) to potentially hours (when anyone with API access and this script can triage patches).

What this means practically:

Patch faster. The "we'll patch next month" window is closing.
Prioritize by exploitability, not just CVSS score. A "7.5 Medium" with a trivially reachable code path might be more dangerous than a "9.8 Critical" that requires local admin.
Monitor for this tooling. If you see automated Ghidra analysis + LLM API calls spinning up every Patch Tuesday, someone's running this pipeline.

What Doesn't Exist Yet (Your Research Opportunities)

A proper benchmark dataset — Matched pairs of (vulnerable_function, patched_function, CVE_class, exploitability_score) for hundreds of real CVEs. This would let the community properly evaluate and improve models.

Head-to-head model evaluation — Nobody has rigorously compared models on this specific task with controlled methodology.

End-to-end open-source tooling — Everything described here is duct-taped together. A clean, maintained pipeline would be enormously useful.

Fine-tuning on historical CVEs — Take every known patched vulnerability, extract the before/after binaries, and build a training dataset. The potential accuracy improvement is huge but unexplored.

Hybrid approaches — LLM does the classification and rough trigger hypothesis, then symbolic execution (angr/Triton) does precise path constraint solving. This combination could be significantly more powerful than either alone.

Conclusion

LLM-assisted binary diffing isn't theoretical — it's buildable today with existing tools and APIs. The pipeline described here (Winbindex → Diaphora → Ghidra → structured prompts → multi-round LLM analysis → validation) turns Patch Tuesday into a semi-automated vulnerability discovery process.

The models aren't perfect. They hallucinate, miss subtle bugs, and struggle with complex memory semantics. But as a triage tool — rapidly sorting through hundreds of changed functions to surface the 3-5 that are security-relevant — they're already transformative.

The 1-day window just got a lot shorter. Whether that's terrifying or exciting depends on which side of the patch you're sitting on.

Want to discuss this further or contribute to building the pipeline? Open an issue or reach out.