Firmware exploration: LLM as your annotator

If you’ve ever opened a firmware image and stared at a hex dump thinking “nope,” you’re not alone.

Modern IoT and embedded devices ship with complex firmware: full Linux distributions, RTOS kernels, proprietary bootloaders, custom update mechanisms, and often… questionable security decisions. Large-scale studies have shown that firmware is a gold mine for vulnerabilities, from hard-coded credentials to unsafe update logic and exposed debug interfaces.

At the same time, governments and standards bodies are pushing manufacturers to treat firmware security seriously. NIST’s IoT Cybersecurity guidance and documents like SP 800-213/213A explicitly call out firmware, update mechanisms, and device integrity as critical capabilities for secure IoT products.

In this post, I’m not going to pretend that a Large Language Model (LLM) will magically reverse engineer your firmware for you. Instead, I’ll show how you can use an LLM as an annotator and sidekick while you do the real work:

Turning messy strings into structured hints
Summarizing decompiled functions
Hypothesizing the purpose of weird config blobs and scripts
Supporting your threat modeling of the device

All in a way that keeps you in control, and the AI in a supporting role.

⚠️ Everything here is about defensive security / research on devices you own or are authorized to test. Don’t use these techniques on systems you’re not allowed to touch.

1. The Traditional Firmware Exploration Workflow (Very Short Version)

A very typical firmware exploration flow looks like this:

Obtain firmware
From a vendor update file, web UI, or by dumping flash via JTAG / SPI.
Triage and unpack
Use tools like binwalk, dd, unsquashfs, or firmware-mod-kit to unpack file systems and images.
Scan for strings and patterns
Use strings, grep, or custom scripts to find credentials, URLs, debug commands, etc.
Reverse engineer binaries
Use tools like Ghidra, IDA, Radare2, or Binary Ninja to analyze executables and libraries.
Dynamic analysis / rehosting
Use QEMU or specialized frameworks for firmware rehosting to actually run firmware in a controlled environment and interact with it.

Each step is noisy and produces walls of text and code. That’s where an LLM can make things less painful.

2. Where LLMs Actually Help: “Annotator, Not Autopilot”

Recent work has looked at using LLMs for binary code understanding, such as:

Recovering function names
Summarizing binary code behavior
Explaining decompiled functions at a higher level

Industry research has also explored LLMs as reverse engineering sidekicks that help malware analysts explain decompiled functions, outline control flows, or draft detection logic—without replacing analysts.

You do the reversing.
The LLM helps label, summarize, cluster, and explain.

Think of it as a hyperactive junior sitting next to you, happy to generate function names, markdown notes, and hypotheses while you decide what’s real and what’s hallucination.

Let’s walk through some concrete examples.

3. Strings + LLM: Turning Noise Into Hints

A classic first step on a firmware image is just:

strings firmware.bin | less

But this dumps everything: menu texts, error messages, random config keys, leftover debug prints, etc. You can make this a lot more effective with a little Python and an LLM.

3.1 Step 1 – Extract and filter strings

Here’s a small Python script to extract printable strings from a firmware blob:

import re
from pathlib import Path

def extract_strings(path, min_len=4):
    data = Path(path).read_bytes()
    pattern = rb"[ -~]{%d,}" % min_len  # printable ASCII
    return [s.decode(errors="ignore") for s in re.findall(pattern, data)]

if __name__ == "__main__":
    strings = extract_strings("firmware.bin")

    # Naive filters: paths, URLs, shell-like commands
    interesting = [
        s for s in strings
        if ("/" in s or "http" in s or "ssh" in s or "admin" in s.lower())
    ]

    for s in interesting[:50]:
        print(s)

You now have a list of “interesting” candidate strings: endpoints, file paths, error messages, maybe even hidden menu options.

3.2 Step 2 – Ask the LLM to annotate

Take a subset of these strings (don’t paste your entire firmware dump into a cloud LLM—be mindful of confidentiality) and send them to an LLM with a prompt like:

You are an embedded security analyst.
The following strings were extracted from a router firmware image.

1. /bin/diagnostic_cli
2. /usr/sbin/backup_cfg
3. POST /apply.cgi
4. admin:admin
5. Enable remote management

For each string:
- Guess what subsystem it might belong to (web UI, update system, debug, etc.).
- Mark whether it’s interesting for security review and why (1–2 sentences).

In code (pseudo-style, using a generic llm_chat() helper so you can plug in OpenAI / local LLM / etc.):

def annotate_strings_with_llm(strings_chunk):
    prompt = (
        "You are an embedded firmware security analyst.\n\n"
        "You are given a list of strings extracted from firmware.\n"
        "For each string, produce:\n"
        "- category: (web_ui | auth | config | debug | logging | update | other)\n"
        "- interesting: (yes/no)\n"
        "- reason: one short sentence.\n\n"
        "Strings:\n"
        + "\n".join(f"- {s}" for s in strings_chunk)
    )
    # Replace this with your LLM client call
    response = llm_chat(prompt)
    return response

if __name__ == "__main__":
    strings = extract_strings("firmware.bin")
    interesting = [...]  # apply your filters
    chunk = interesting[:40]
    print(annotate_strings_with_llm(chunk))

Suddenly, instead of reading 300 anonymous strings, you get a structured, human-readable checklist:

Possible backup subsystem
Potential default creds
Hidden diagnostic binaries
Suspicious URLs / endpoints

You still have to verify everything—but now you have a prioritized map.

4. Ghidra + LLM: Explaining Weird Functions

Once you move into static analysis, tools like Ghidra are the workhorse for exploring firmware binaries. You load ELF/ARM/MIPS binaries, let Ghidra analyze them, and then decompile functions into a pseudo-C view.

The hard part is not “disassembling”—it’s understanding what a function actually does.

Research and experiments with LLMs show they can help with tasks like:

Function name recovery
Code summarization
Highlighting security-relevant behavior (auth, crypto, file access, network I/O, etc.)

That’s perfect for our “annotator” idea.

4.1 Workflow

In Ghidra, decompile a function you suspect is security-relevant:
- Maybe it’s referenced from the /login CGI handler
- Or from a firmware update routine
Copy a sanitized snippet of the decompiled C-like code (omit hard-coded secrets / proprietary stuff if using a cloud LLM).
Ask the LLM something like:

You are assisting with firmware reverse engineering.
Here is a decompiled function from an embedded Linux binary (MIPS):
int sub_40123C(char *user, char *pw) {
    FILE *f = fopen("/etc/passwd", "r");
    if (!f) return -1;
    ...
}
Tasks:
1. Give this function a descriptive name (C-style).
2. Summarize what it does in bullets.
3. Call out any security-relevant behavior (auth checks, file access, cryptography, etc.).

You’ll often get surprisingly good results:

Suggested names like check_user_credentials, verify_login, etc.
Bullet summaries that highlight comparisons, suspicious file paths, or insecure checks.

You can even script this: export decompiled functions or their summaries and feed them to an LLM in batches, building a “map” of the binary where each function has:

A human-readable name
A short description
Tags like auth, crypto, network, update

This mirrors workflows used in both academic research and experimental tools that use LLMs as reverse engineering assistants for malware and binaries.

5. Configs, Scripts, and “Weird Blobs”: Semantic Tagging

Firmware images are full of non-binary artifacts:

Shell scripts for initialization
Lua / Python / proprietary scripting languages
JSON / XML / custom config formats
Web templates for CGI-based admin interfaces

Instead of manually reading each file, you can:

Programmatically find candidates – files under /etc/, /usr/script/, /www/, etc.
Summarize them with an LLM to label purpose and risk.

5.1 Example: scanning shell scripts

from pathlib import Path

def list_shell_scripts(root):
    root = Path(root)
    return list(root.rglob("*.sh"))

def summarize_script(path):
    content = Path(path).read_text(errors="ignore")[:4000]  # truncate just in case
    prompt = (
        "You are reviewing firmware init scripts.\n\n"
        f"File path: {path}\n"
        "Script:\n\n"
        "```sh\n"
        f"{content}\n"
        "```\n\n"
        "Tasks:\n"
        "1. Briefly summarize what this script does.\n"
        "2. Call out any security-relevant actions "
        "(starting services, changing permissions, touching auth/crypto, enabling remote access).\n"
        "3. Rate its review priority: high / medium / low.\n"
    )
    return llm_chat(prompt)

if __name__ == "__main__":
    for p in list_shell_scripts("squashfs-root"):
        print("=== ", p, " ===")
        print(summarize_script(p))
        print()

This gives you:

A quick view of which scripts matter (e.g., those enabling remote management or manipulating firewall rules).
A better starting point when you need to dive deeper manually.

You can do the same for:

JSON configs: ask which keys look like feature flags, debug options, or update URLs.
HTTP templates: ask which endpoints perform sensitive operations.

6. LLMs + Rehosting: Augmenting Dynamic Analysis

The more hardcore end of firmware analysis involves rehosting: running firmware in an emulated environment to see how it behaves at runtime. Researchers and practitioners use various frameworks to emulate peripheral devices and remove hardware dependencies.

LLMs can help here too—but again, as annotators:

Log analysis: feed chunks of runtime logs (HTTP requests, kernel messages, application logs) into the LLM and ask it to:
- Summarize what the system is doing
- Highlight errors, crashes, or suspicious patterns (e.g., repeated failed logins)
Crash triage: when fuzzers targeting BusyBox or embedded binaries produce crashing inputs and stack traces, LLMs can help cluster and explain crash types.

6.1 Example prompt for log triage

You are a firmware analyst. The following log lines come from an emulated router firmware:
[HTTPD] POST /apply.cgi action=wan_settings
[HTTPD] user=admin from 192.168.0.10
[KERNEL] device eth0 entered promiscuous mode
[APP] enabling remote_management on port 8080
...
1. Summarize the key events.
2. Identify any security-relevant changes.
3. Suggest 2–3 follow-up checks I should perform against this firmware.

You get a quick human-level summary instead of reading 500 lines manually.

7. Limitations and Risks: This Is Not Magic (and That’s Good)

Before we get too excited, reality check:

Hallucinations are real
LLMs can improve naming and summarization, but they still get things wrong and may invent functionality that isn’t present.
- Never treat LLM output as ground truth.
- Use it as a hint, then confirm by reading the actual code / disassembly.
Confidentiality and IP
If you’re analyzing proprietary firmware, uploading large chunks to a cloud LLM may be unacceptable (legally, ethically, or by contract).
- Consider local open-source models for sensitive work.
- Use strict data minimization: send only what you must (e.g., one function, one script).
Ethics and legality
Secure development and IoT guidance emphasize secure design, responsible vulnerability management, and lifecycle support.
- Use these techniques to improve security, not undermine it.
- Follow responsible disclosure practices if you discover real issues.
Skill still required
Experiences from LLM-powered reverse engineering consistently conclude that LLMs augment experts; they don’t turn beginners into instant firmware ninjas.
- You still need to know how toolchains, OSes, networking, and cryptography work.
- LLMs amplify good analysts; they don’t replace them.

8. Practical Tips: Making LLMs a Useful Firmware Sidekick

If you want to actually integrate LLMs into your firmware workflow, here are some practical patterns:

Use strong roles in prompts
- “You are an embedded firmware security analyst.”
- “You are assisting reverse engineering of a MIPS-based router binary.”
Give context concisely
- Mention architecture (ARM/MIPS/x86), OS (Linux/RTOS), and approximate purpose (router, camera, PLC).
- This helps the model make better guesses about functions and config files.
Chunk smartly
- Don’t send entire file systems.
- Work per-function, per-script, or per-log-chunk.
Always ask for structure
Ask for JSON-like or bullet-pointed output, for example:
```
{
  "function_name": "...",
  "high_level_summary": "...",
  "security_relevance": "..."
}
```
This makes it easier to feed back into your own tooling.
Build your own “knowledge notebook”
- Store LLM explanations in markdown or a small database.
- Link them back to offsets / function addresses / file paths so you can quickly revisit them later.
Compare models
- Some models are better at code; others at natural language.
- Try both cloud and local options, especially if you need privacy.