What is prompt injection?

A malicious prompt that tricks an LLM into performing an unintended action.

How does the harness protect my agents?

It runs guardrails that validate every prompt and output before the LLM processes them.

What is dual LLM architecture?

A pattern that splits a privileged LLM from a quarantine LLM so the latter can scrutinize and block dangerous output.

How do I monitor my agents in real time?

Log all tool calls, guardrail hits, and LLM prompts, then feed the data into an observability stack.

What frameworks help with threat modeling?

OWASP GenAI Top 10, Google Safe AI, and NVIDIA AI kill chain.

What is the least autonomy principle?

Limit an agent’s decision-making power to only what is needed for its task.

How can I test my agents for vulnerabilities?

Run automated red-team tools like PyRIT and scan with OWASP checklists.

AI Agent Security: My Battle Plan to Stop Prompt Injection

Table of Contents

TL;DR

Prompt injection lets attackers hijack AI agents and steal data.
A scaffold, harness, and guardrails form the first line of defense.
Real-time monitoring and threat modeling make it possible to spot and stop attacks early.
Automated red-team tools like PyRIT help uncover hidden weaknesses.
Follow industry frameworks (OWASP, Google Safe AI) and adopt the least autonomy principle.

Why this matters

Agents can read, write, run code, and send emails without human approval.
When an agent has tool access, a single malicious prompt can delete or exfiltrate data.
In 2026, a Microsoft Copilot “Reprompt” exploit proved that a single link can let an attacker steal user data without any user action Microsoft Copilot vulnerability (2026).
Without a robust threat model, you have no idea where the attack surface lies.
The prompt injection success rate climbs from 0% at one attempt to 78.6% after 200 tries Anthropic prompt injection rates (2025).

Core concepts

AI agent – a system that can think, plan, and act to reach a goal while using an LLM. The agent’s core loop is think → plan → act; you can read the loop in the OpenAI Codex article OpenAI Codex agent loop (2026).
Scaffold – the code that wraps the LLM and gives it agency. Frameworks like Google ADK or OpenAI AgentKit let you build a scaffold that handles state, memory, and tool calls Google ADK (2025).
Harness – a control layer that watches the agent’s output and stops it if something looks bad. Guardrails are the primary harness feature; they validate user input and agent output before the LLM runs or after it finishes OpenAI Guardrails (2026).
Least autonomy – give the agent only the permissions it needs. Google’s latest blog explains that least privilege and least agency are key to keeping agents honest Least privilege & least agency (2025).
Lethal trifecta – untrusted input + private data + consequential actions. If all three are present, an attacker can use prompt injection to cause catastrophic damage.
Threat-modeling frameworks – OWASP GenAI Top 10, Google Safe AI, and NVIDIA AI kill chain give you a set of checks to run against every agent design OWASP GenAI Top 10 (2025).

How to apply it

Map the agent – draw the data flow: user → agent scaffold → LLM → tool → world. Identify every place where sensitive data enters or exits.
Build a scaffold – use Google ADK or OpenAI AgentKit. The scaffold should keep the LLM’s context small so it can be checked quickly.
Add a harness – enable guardrails for input (PII, jailbreak) and output (hallucination, policy). The guardrails run in parallel with the LLM to catch bad prompts before they get processed.
Implement least autonomy – create a permission matrix that limits the agent’s tool set. If a tool can modify a database, the agent must first ask a human for a sign-off.
Do threat modeling – run the OWASP checklist and Google Safe AI’s risk matrix against the scaffold. Pay special attention to the “action selector” pattern; it routes LLM output to tools and is a prime spot for injection.
Red-team with PyRIT – run the Microsoft PyRIT tool against your agent code. It will generate thousands of adversarial prompts and will show you which ones bypass your guardrails Microsoft PyRIT (2024).
Deploy real-time monitoring – log every tool call, every LLM prompt, and every guardrail hit. Use a telemetry stack (e.g., Splunk, Elastic, or the new Sentari observability platform) to surface anomalies. Microsoft’s recent AI-security post stresses that real-time observability is a must Microsoft AI governance (2026).
Automate policy enforcement – write policy rules that the harness enforces. For example, a “no delete” rule that stops the agent from calling a delete API unless a human signs off.
Iterate and retest – every time you add a new tool or update the LLM, rerun the threat model and the PyRIT scan. Treat it as a continuous delivery pipeline.

Pitfalls & edge cases

Over-blocking – guardrails that are too strict can turn your agent into a bottleneck. Balance the false-positive rate with the business value of the agent.
Mis-configuring the harness – if guardrails run after the LLM instead of before, malicious prompts may already execute. Verify that the harness executes in parallel with the LLM OpenAI Guardrails (2026).
Scaling problems – the success rate of prompt injection climbs with compute. If you’re running many agents, the attack surface expands dramatically Anthropic prompt injection rates (2025).
Future attacks – attackers are already exploring indirect injection via email links (EchoLeak) and through embedded prompts in PDFs. Keep the threat model updated as new vectors appear EchoLeak (2025).
Compliance risk – if your agent touches regulated data, failing to implement least privilege can trigger GDPR or HIPAA violations. Map each data type to a compliance rule before you build.

Quick FAQ

Question	Answer
What is prompt injection?	A malicious prompt that tricks an LLM into performing an unintended action.
How does the harness protect my agents?	It runs guardrails that validate every prompt and output before the LLM processes them.
What is dual LLM architecture?	A pattern that splits a privileged LLM from a quarantine LLM so the latter can scrutinize and block dangerous output.
How do I monitor my agents in real time?	Log all tool calls, guardrail hits, and LLM prompts, then feed the data into an observability stack.
What frameworks help with threat modeling?	OWASP GenAI Top 10, Google Safe AI, and NVIDIA AI kill chain.
What is the least autonomy principle?	Limit an agent’s decision-making power to only what is needed for its task.
How can I test my agents for vulnerabilities?	Run automated red-team tools like PyRIT and scan with OWASP checklists.

Conclusion

Securing AI agents is a hands-on process. Start with a clean scaffold, wrap it with a harness, and enforce guardrails that keep the LLM in check. Use threat modeling and continuous red-team testing to uncover hidden holes, then monitor every action so you can react before a breach happens. If your organization handles regulated data, make least autonomy and observability part of your compliance framework. If you’re still experimenting, pause before you give an agent full tool access; the sooner you harden, the fewer chances attackers have.

Last updated: February 12, 2026

Recommended Articles

How AI Agents Turn Into Threats: Guarding Your Code From Prompt Injection

Voice AI: The Untapped Goldmine for Consultants

Build Smarter AI Agents with These 10 Open-Source GitHub Projects

How I Built a RAG Agent That Stops Hallucinations With Source Validation

Graphene OS: The Ultimate Privacy & Security Upgrade for Your Pixel

The Vending Bench: Training AI to Run Real Businesses