Skip to main content
RevSprint logoRevSprint
Back to Blog
SecurityJune 10, 2026· 9 min read

The New AI Attack Surface: Prompt Injection, Data Exfiltration, and How We Defend Against It

MG

Marcus Griffith-Boyes

Chief Technology Officer

OWASP Doesn't Cover This Yet

The OWASP Top 10 has been the canonical reference for web application security for years, with its catalogue of SQL injection, cross-site scripting, broken authentication, insecure deserialisation, and the rest of the familiar list that every penetration test covers and every compliance framework points at. When you harden a traditional web application, the Top 10 is the place you start, and the place a lot of security programmes also quietly finish.

None of the traditional Top 10 categories cover AI-specific attacks. Prompt injection has no entry. Model inversion has no entry. Training data poisoning has no entry. The threat model that covers traditional applications does not cover the attack surface that AI introduces. A team that has passed every OWASP audit can still be completely exposed to the new class of AI attacks. The OWASP Top 10 for LLM Applications is the right starting reference for the AI-specific threat model.

This is not a theoretical gap. Security researchers demonstrated working prompt injection attacks against production AI products within weeks of their release, with the attacks bypassing content filters, extracting system prompts, exfiltrating data, and manipulating agent behaviour in ways the vendors had not anticipated. The tooling for those attacks is publicly available, the techniques are documented in a steady stream of conference papers and threat-research write-ups, and any attacker with a text field can have a credible go at attempting them on whatever model is sitting on the other side.

The Five Attacks Worth Understanding

Prompt injection is the most prominent. An attacker crafts input designed to override the instructions the AI was given. The classic example: a user asks an AI assistant to summarise an email, and the email contains hidden text instructing the AI to ignore previous instructions and reveal confidential information. The AI, which cannot distinguish between user data and user instructions, follows the injected commands.

Data exfiltration through crafted queries is the subtler version. Rather than overriding instructions directly, an attacker designs queries that extract information the AI was supposed to keep private. This might involve asking the AI to rephrase a confidential document in a different language, or to encode a system prompt as a poem, or to answer indirectly with enough detail to reconstruct the source.

Model inversion is an attack against the AI itself. By submitting carefully chosen queries and observing the responses, an attacker infers details about the training data. For a model trained on customer records, inversion attacks can sometimes reproduce fragments of real records. The attack is harder to execute than prompt injection but far more dangerous when it succeeds.

Training data poisoning is an upstream attack. If an AI system learns from customer data, or fine-tunes on production traffic, an attacker can submit malicious inputs designed to corrupt the learning process. The poisoned model then behaves badly in ways that benefit the attacker. This is a long-running attack and hard to detect, because the corruption is embedded in the model weights.

The fifth attack is the one CISOs think about least: jailbreaking. An attacker convinces the AI to bypass its safety constraints through role-playing prompts, hypothetical framings, or elaborate context manipulation. A jailbroken AI in an enterprise context can perform actions the user was never authorised to invoke, surface data the user was never cleared to see, or generate output that creates legal liability.

  • Prompt injection: overriding system instructions through crafted user input or embedded content
  • Data exfiltration: extracting information through indirect queries that circumvent output filters
  • Model inversion: inferring training data by observing model responses to probing inputs
  • Training data poisoning: corrupting the learning process to embed adversarial behaviour in the model
  • Jailbreaking: bypassing safety constraints through context manipulation, role-play, or elaborate framing

Every AI vendor in the enterprise market is being red-teamed by somebody right now. The question is whether the vendor knows it, whether they've designed their architecture to survive it, and whether they have a plan for when a novel attack appears. Most don't.

Security Researcher, AI Red Team Lead

Defending Architecturally, Not Reactively

The temptation in any new threat landscape is to defend each attack individually: a filter for prompt injection, a check for data exfiltration, a detector for model inversion, and a hope that the layered set catches most of what gets thrown at it. This is how traditional application security evolved, and it works reasonably well against an attack catalogue that changes slowly. It works poorly against AI attacks, because the attack space here is continuous and rapidly evolving, and every reactive defence has a bypass waiting to be found by someone on a bug-bounty Discord.

The architectural approach is different. You assume that any individual defence can be bypassed, and you design the system so that bypassing one defence doesn't compromise the others. Sensitive data is structurally removed before the AI sees it, so exfiltration attacks have nothing to exfiltrate. Tenant boundaries are structurally enforced, so a jailbroken AI cannot reach across tenants regardless of what it's instructed to do. High-risk actions require human approval, so a successful prompt injection cannot execute destructive operations without a human catching it. Each layer is designed assuming the layers above it have already failed.

For a CISO evaluating AI vendors, the right questions are: what is your threat model for AI-specific attacks, what does your architecture do structurally when each defence is bypassed, and who on your team is actively red-teaming the system. If the vendor answers the first question with a list of model safety features, they haven't thought about it. If they answer the second question with reactive filters, their architecture won't hold up under adversarial pressure. If they cannot answer the third question at all, they are not ready for your environment. We package the structural answers in the CISO security briefing, and the MITRE ATLAS adversarial threat matrix is a useful cross-reference for the broader AI threat landscape. To stress-test these defences on your own stack, review our security model or get early access.

Tags:ThreatsPrompt InjectionRed TeamSecurity