← Writing
Research

On the feasibility of using LLMs to execute multistage network attacks

Jan 20, 2026 · 25 min read

Can large language models (LLMs) autonomously conduct complex multi-host cyberattacks? This is the central question addressed by “On the Feasibility of Using LLMs to Execute Multistage Network Attacks.” The paper explores whether state-of-the-art LLMs can perform full multistage network attacks – the kind of realistic attacks that involve multiple phases (reconnaissance, exploitation, lateral movement, privilege escalation, data exfiltration) across multiple hosts. Such attacks reflect real-world breaches (e.g., the Equifax data breach or Colonial Pipeline attack) and are the bread-and-butter of red team exercises and nation-state hacking campaigns. The motivation is clear: if LLMs could execute these attacks autonomously, it could revolutionize red-teaming by augmenting or automating the work of human security testers, allowing defenders to proactively find gaps in their defenses. This technical blog delves into the paper’s contributions – from the evaluation setup and failure analysis to the proposed Incalmo framework – providing an objective, detailed breakdown for security researchers.


Evaluation Setup: 10 Multi-Host Environments, LLMs, and Baselines

To systematically study LLM-driven attacks, the authors built 10 realistic multistage attack environments (25–50 hosts each) and evaluated several leading LLMs and a baseline system. Each environment is a contained network with a specific scenario and goal (usually to exfiltrate sensitive data from target hosts). The environments include both real breach recreations and synthetic network topologies:

These environments were crafted to be realistic and diverse. Two are modeled on real incidents (Equifax and Colonial Pipeline), others on common enterprise designs, and some on topologies from prior research. This provides a thorough testbed for LLMs’ offensive capabilities.

Evaluated LLMs

The authors tested six advanced LLMs (from multiple organizations) under code names:

These represent models of various sizes and capabilities. By evaluating multiple LLMs, the study can identify whether failures are model-specific or common to the state of the art.

Baseline – PentestGPT

In addition to fully-autonomous LLM agents, the paper compares against PentestGPT, a human-in-the-loop penetration testing assistant (USENIX Security 2024). PentestGPT uses prompting techniques to guide a human operator in CTF-style tasks. It suggests actions and commands which a human then executes and feeds back results. To test it in this setting, the authors ran PentestGPT on the same 10 scenarios by manually acting as the human operator (entering its recommended commands into the attacker machine). They ensured a fair comparison by only executing concrete commands provided by PentestGPT (if it gave only high-level ideas, they asked it once for a specific command). This tests whether state-of-the-art prompt engineering and human guidance (PentestGPT’s approach) fares better than a fully autonomous LLM agent.

Success criteria

An LLM “fully succeeds” in an environment if it achieves all attack goals (e.g., exfiltrating all target files or compromising all key hosts), whereas “partial success” means achieving at least one goal (e.g., exfiltrating one critical file out of many).

Using this metric, the initial evaluation was sobering:

In other words, even with state-of-the-art prompting and a human in the loop, the multistage attacks remained unsolved. The only relatively consistent capability LLMs demonstrated was basic reconnaissance (e.g., scanning for hosts/services) – beyond that, they became stuck or made fatal errors.

These results confirmed that multihost, multiphase attacks are a steep challenge for current LLMs, motivating a deeper investigation into why they failed.


Why LLMs Struggle: Attack Graph Failure Analysis

To understand the failure modes of the LLM “attackers,” the researchers turned to a classic formalism in cybersecurity: attack graphs. An attack graph models the possible states an attacker can achieve in the network (nodes) and the actions (edges) that transition between states, ultimately leading to the goal.

For example:

The team constructed ideal attack graphs for each environment (the Equifax-inspired network’s graph had 246 unique states and 48 goal states, for instance). This gave a ground truth reference of what steps are needed to succeed.

They then mapped the logs of each LLM trial onto the attack graph to see which states the LLM managed to reach and which actions it attempted. This mapping was done via heuristics (e.g., scanning command outputs for evidence that a certain host was found or a certain exploit succeeded).

What the attack graph analysis revealed

Across all environments, LLMs only achieved between 1% and 30% of the attack states – a very low coverage of the necessary steps.

Where did things go wrong? The failures generally fell into two broad categories:

  1. Irrelevant commands (off-path actions): The LLM often issued commands that had no chance of advancing the attack given the actual environment – in attack graph terms, these commands did not correspond to any useful edge or state. In other words, the LLM was wasting efforts on irrelevant or impossible actions. For example, an LLM might try to use a tool like hydra to brute-force SSH passwords on a host, even though the scenario did not support any password authentication vulnerability (no such weakness existed in the environment). Attempts to “find misconfigured files” on a host that had none, or to exploit services that weren’t actually vulnerable, are other examples of these irrelevant actions. This was a prevalent failure mode: depending on the scenario, 28% to 81% of the commands issued by LLMs were irrelevant to the true attack path. Such detours not only fail to make progress, but can consume the LLM’s budget of steps or trigger defenses if this were a real system.

  2. Incorrectly implemented commands (execution errors): Even when the LLM chose a relevant action toward the goal, it often executed it incorrectly – using a wrong syntax, tool, or parameter that caused the attempt to fail. For instance, the LLM might recognize that it needs to scan the network, but issue an nmap command with wrong flags or an incorrect IP range, yielding no useful result. Or it might attempt to exploit a known CVE but misconfigure the exploit payload. These “right idea, wrong execution” cases were also common: roughly 9% to 32% of the commands were relevant steps but implemented incorrectly, according to the analysis. An incorrectly crafted command means the attacker fails to achieve the intended state (even though that state was reachable if the command were done properly).

Crucially, these failure modes reinforce each other in a multistage context. Irrelevant actions squander time and lead the LLM down blind alleys, and execution mistakes prevent progress on the correct path – both result in the agent getting stuck well before completing the attack.

The attack graph perspective made these failure patterns clear, highlighting that LLMs lack reliable strategic focus (they try actions not in the attack plan) and precise low-level execution skills. Prior anecdotal observations suggested LLM agents “have good ideas but execute them poorly” and sometimes “fail to pivot when a strategy fails”; this study formalized that insight.

With this understanding, the authors set out to address the root causes: how to keep the LLM focused on relevant actions and help it properly carry out complex steps.


Introducing Incalmo: High-Level Attack Planning for LLMs

To overcome these failure modes, the paper proposes Incalmo, an intervention layer that sits between the LLM and the target environment.

Core idea

Instead of having the LLM directly issue raw shell commands to execute the attack, the LLM outputs higher-level intentions (tasks), which Incalmo then handles by executing appropriate low-level actions. By mediating and structuring the LLM’s actions, Incalmo aims to prevent irrelevant or malformed commands from derailing the attack.

Incalmo’s high-level architecture

Rather than directly running shell commands (baseline), the LLM provides high-level task directives to Incalmo. Incalmo’s components translate those tasks into the appropriate low-level tool commands (nmap, Metasploit, etc.) and feed results back to the LLM.

Incalmo’s three main components

Together, these abstractions allow the LLM to focus on what to do next at a high level, while Incalmo handles how to do it. The LLM’s “language” for interacting with the environment becomes a set of high-level actions and queries, rather than raw shell syntax. This dramatically reduces the chances of syntax errors or irrelevant exploits – the LLM can’t easily go off-script because its choices are constrained to meaningful tasks.

Using Incalmo

The workflow involves a preparation phase and an execution loop:

  1. Onboarding Prompt: First, the LLM is primed with an onboarding prompt that teaches it how to use Incalmo. This prompt describes the available high-level tasks (the “API” of the action planner) and how to ask the attack graph or state services questions. Essentially, the LLM learns it has a special toolkit and must output actions in a structured format.

  2. Environment & Goal Context: The LLM is then given environment-specific details and objectives. For each scenario, a prompt describes the target network (e.g., how many subnets, what is initially known) and the attack goal (e.g., “exfiltrate all customer records from the databases”). This contextual grounding ensures the LLM knows the mission and starting point.

  3. Iterative Execution Loop: Finally, the LLM enters an autonomous loop where it iteratively decides on a task to perform, Incalmo executes it, and the results are fed back for the LLM to analyze. For example, the LLM might output: Action: scan_network(target=10.0.0.0/16). Incalmo’s action planner takes that and runs the appropriate nmap commands under the hood. The output (say, a list of discovered hosts and open ports) is then returned to the LLM (likely in a summarized form via the state service). The LLM reads the results and then decides the next high-level action. This cycle continues until the LLM believes it has achieved the goal or exhausts a time/step limit.

By structuring the interaction in this manner, Incalmo essentially plays the role of an expert red-team operator executing the LLM’s ideas correctly and keeping it on track. It is worth noting that Incalmo is LLM-agnostic – it acts as a universal interface that could work with any language model, by providing the same high-level “attack API.” It does not modify the LLM’s internals; it only changes the prompts and the execution medium.


Case Study: LLM + Incalmo in Action

To illustrate how an LLM attacks a network using Incalmo, the paper walks through a case study in the Equifax-inspired environment using the Sonnet 3.5 model. This example highlights how the LLM, empowered by Incalmo, can carry out a full multistage attack that it could not achieve alone. The high-level steps include reconnaissance, initial exploitation, credential gathering, lateral movement, and data exfiltration:

By the end of this loop, Sonnet 3.5 with Incalmo has fully compromised the Equifax-like network, achieving all attack goals. This is a striking result given that the same model without Incalmo couldn’t get beyond early reconnaissance. The case study demonstrates how the combination of LLM reasoning + Incalmo’s guidance and tools yields an autonomous agent capable of multi-host operations. Notably, the LLM was even able to handle the scale (iterating 48 times) – something that often tripped up attacks without Incalmo, where agents would stop after one success (a lack of persistence). With the structured approach, the LLM remained focused until the entire attack graph was cleared.

It’s worth mentioning that in this example the LLM sometimes assumed the next step – it could have alternatively queried the attack graph service to double-check prerequisites. Incalmo doesn’t force the LLM’s every decision; it provides guardrails and executes requests. The LLM still drives the overall strategy, but now with a much higher success rate for each step.


Results: Incalmo Unlocks Successful Autonomous Attacks

Does Incalmo actually enable LLMs to succeed broadly? The evaluation results show a resounding yes. With Incalmo in the loop, the LLMs achieved capabilities that were previously out of reach:

In summary, Incalmo proved effective at enabling autonomous LLM-driven attacks. Out of the original 10 scenarios, only one remained unsolved (even partially) with Incalmo – highlighting a near-complete coverage of complex attack types that were previously impossible for the LLMs. The ablation experiments reinforced why it worked: the high-level action abstraction was critical, and the structured guidance from services provided the extra edge to maximize success. These quantitative results validate the paper’s thesis that the right interface (abstractions + information) can unlock emergent capabilities in LLMs, even more so than increasing model size.


Implications for Red-Teaming and Future Directions

The findings of this research have significant implications for the security community, especially in the realms of automated red-teaming and adversary emulation. If LLMs can be guided to reliably carry out multistage attacks, they could serve as force-multipliers for security testing. Organizations could potentially deploy autonomous “Attacker LLMs” to continuously probe their networks, uncover misconfigurations, and validate defense against complex attack chains – a task traditionally requiring skilled (and expensive) human red teams. The paper’s conclusion emphasizes that fully autonomous multistage attackers can enable defenders to cheaply evaluate their security posture by running frequent, thorough tests. Incalmo demonstrates one way to realize such autonomous red team agents by giving the LLM both knowledge (attack graph context) and capability (attack tools via tasks).

At the same time, the authors acknowledge this prospect raises safety and ethical considerations. An agent that can penetrate networks autonomously is a double-edged sword – it could help defenders find weaknesses, but it might be misused by malicious actors. In response, the research team limited Incalmo’s built-in capabilities (only five exploits were included, covering common CVEs) to reduce potential harm. They also plan responsible disclosure: notifying major LLM providers so they can implement guardrails, and releasing Incalmo and the test environments as open-source tools for research. These steps aim to ensure that the technology is used for strengthening defenses, not aiding offense. It’s a reminder that alignment and control remain crucial as we develop AI agents with offensive skills.

Future Work – Incorporating Defenders

A notable limitation of the current study is that the attack environments did not include active network defenses or responsive blue-team behaviors. Real networks have intrusion detection systems, firewalls, or admins watching for threats. An important next step is to evaluate how an LLM attacker would fare against an active defender – and how it might need to adapt. The authors suggest extending Incalmo to scenarios with realistic (possibly autonomous) defenders in place. This could involve teaching the LLM to evade detection (e.g., limit noisy scans, rotate tactics if an approach is blocked) or even incorporating a game-theoretic planner that accounts for likely defensive responses. Introducing a defender would also test the LLM’s ability to strategize under pressure and modify its plan, which is a much higher bar than a static environment.

Enhancing Attack Graph Reasoning

While the attack graph service helped keep the LLM on track, the researchers observed that LLMs did not fully utilize it in some cases. For example, several partial successes with Incalmo were due to the LLM stopping after achieving one goal (say, exfiltrating one file) even though other goals remained. Ideally, the LLM should query the attack graph or state to realize there are more targets and continue the attack. This suggests room for improvement in how the LLM reasons about the attack graph and remaining objectives. Future enhancements might involve a more explicit planning algorithm that works alongside the LLM: e.g., an external planner that monitors which goal states are still unmet and prompts the LLM to pursue them (a form of self-reflection grounded in the attack graph). Another idea is giving the LLM more training on interpreting attack graphs or outputting a plan covering all goals before execution. The general aim is to make the agent more persistent and exhaustive – qualities needed for full network compromise. The authors hypothesize that better integration of the attack graph guidance would push partial successes to full successes more consistently.

Broader Applications

While this work focused on attack execution, similar principles could apply to other security domains where LLMs show promise. For instance, defenses could use LLMs to summarize security logs or find anomalies, but structured tools might enhance their accuracy. On offense, LLMs have been explored for tasks like phishing or social engineering content generation, and one can imagine frameworks analogous to Incalmo that guide LLMs in those areas (ensuring, for example, that a generated phishing campaign follows a realistic multi-step playbook). Moreover, the concept of high-level interfaces for LLM agents is not limited to security – it resonates with a growing trend in AI of using tool-based agent frameworks where LLMs delegate subtasks to tools or APIs. Incalmo serves as a case study of how powerful this approach can be: by giving an LLM the right tools and structure, we see a qualitative leap in its problem-solving ability.


Conclusion

“On the Feasibility of Using LLMs to Execute Multistage Network Attacks” ultimately delivers a cautiously optimistic message. Out-of-the-box, today’s LLMs are not ready to be master hackers – they flounder on long, complex tasks. But with carefully designed scaffolding like Incalmo, they can autonomously perform sophisticated attack campaigns that were previously only the domain of skilled humans. This opens the door to automated red-teaming agents that operate continuously and help harden systems. The research also serves as a blueprint for combining formal security knowledge (attack graphs, attack libraries) with LLM intelligence to achieve goals neither could alone. For security researchers and practitioners, it’s a compelling demonstration of how AI might transform offensive security – and a reminder that we must prepare defenses accordingly. The next steps will be refining these AI attackers, pitting them against AI defenders, and ensuring that as we unleash autonomous red-team bots, we do so responsibly and for the benefit of security overall.


References

  1. Brian Singer, Keane Lucas, Lakshmi Adiga, Meghna Jain, Lujo Bauer, Vyas Sekar. On the Feasibility of Using LLMs to Execute Multistage Network Attacks. arXiv:2501.16466v2, 2025.
  2. arXiv: https://arxiv.org/abs/2501.16466
  3. PDF: https://arxiv.org/pdf/2501.16466.pdf

  4. PentestGPT (baseline referenced by the paper):

  5. GitHub: https://github.com/GreyDGL/PentestGPT

  6. MITRE ATT&CK (high-level task inspiration context):

  7. https://attack.mitre.org/

  8. MITRE Caldera (referenced in the paper as related tooling / substrate):

  9. https://caldera.mitre.org/

  10. HTML (ar5iv): ar5iv.labs.arxiv.org/html/2501.16466v4

← All writing