Many organizations are racing to integrate AI into customer service, software development, healthcare, financial systems, cybersecurity operations, and enterprise workflows. But as AI adoption accelerates, attackers are shifting focus from simply attacking applications to poisoning the very data and models that power them.

The OWASP Top 10 for LLM Applications 2025 identifies LLM04:2025 Data and Model Poisoning as one of the most dangerous emerging risks in AI systems. Unlike traditional cyberattacks that target infrastructure directly, data and model poisoning attacks manipulate the information that AI systems learn from. Attackers can inject malicious training data, hidden triggers, biased content, falsified documents, or backdoors into datasets, embeddings, or fine-tuning pipelines. The result is an AI system that appears normal on the surface but behaves unpredictably, maliciously, or deceptively under specific conditions.

What is Data and Model Poisoning?

Data poisoning occurs when pre-training, fine-tuning, or embedding data is manipulated to introduce vulnerabilities, backdoors, or biases. This manipulation can compromise model security, performance, or ethical behaviour, leading to harmful outputs or impaired capabilities. Common risks include degraded model performance, biased or toxic content, and exploitation of downstream systems.

Data poisoning can target different stages of the LLM lifecycle, including pre-training (learning from general data), fine-tuning (adapting models to specific tasks), and embedding (converting text into numerical vectors).  Data poisoning is considered an integrity attack since tampering with training data impacts the model’s ability to make accurate predictions. The risks are particularly high with external data sources, which may contain unverified or malicious content.

Common examples of Data and Model Poisoning

  • Malicious actors introduce harmful data during training, leading to biased outputs. Techniques like “Split-View Data Poisoning” or “Frontrunning Poisoning” exploit model training dynamics to achieve this.
  • Attackers can inject harmful content directly into the training process, compromising the model’s output quality.
  • Users unknowingly inject sensitive or proprietary information during interactions, which could be exposed in subsequent outputs.
  • Unverified training data increases the risk of biased or erroneous outputs.
  • Lack of resource access restrictions may allow the ingestion of unsafe data, resulting in biased outputs.

Prevention and Mitigation Strategies

  • Track data origins and transformations using tools like OWASP CycloneDX or ML-BOM. Verify data legitimacy during all model development stages.
  • Vet data vendors rigorously, and validate model outputs against trusted sources to detect signs of poisoning.
  • Implement strict sandboxing to limit model exposure to unverified data sources. Use anomaly detection techniques to filter out adversarial data.
  • Tailor models for different use cases by using specific datasets for fine-tuning. This helps produce more accurate outputs based on defined goals.
  • Ensure sufficient infrastructure controls to prevent the model from accessing unintended data sources.
  • Use data version control (DVC) to track changes in datasets and detect manipulation. Versioning is crucial for maintaining model integrity.
  • Store user-supplied information in a vector database, allowing adjustments without retraining the entire model.
  • Test model robustness with red team campaigns and adversarial techniques, such as federated learning, to minimize the impact of data perturbations.
  • Monitor training loss and analyze model behavior for signs of poisoning. Use thresholds to detect anomalous outputs.
  • During inference, integrate Retrieval-Augmented Generation (RAG) and grounding techniques to reduce risks of hallucinations.

Attack Scenarios

Scenario #1: Manipulating training data to spread sisinformation

An attacker biases the model’s outputs by manipulating training data or using prompt injection techniques, spreading misinformation.

Real-world example: PoisonGPT

Researchers demonstrated this risk with PoisonGPT, a modified open-source model uploaded to spread fake news and misinformation while appearing legitimate. The altered model produced manipulated outputs while bypassing normal trust assumptions around open-source AI repositories.

Scenario #2: Toxic or Biased data producing harmful outputs

Attack Scenario:
Toxic data without proper filtering can lead to harmful or biased outputs, propagating dangerous information.

Real-World Example: Microsoft Tay

Launched in March 2016, Microsoft’s AI chatbot Tay was an experimental artificial intelligence chatbot designed to mimic the conversational patterns of an 18-to-24-year-old American and learn from interactions on X (formerly Twitter). Within 16 hours, the bot had to be shut down after internet trolls exploited its learning algorithm to output racist, sexist, and antisemitic content.

Scenario #3: Falsified training documents leading to corrupted outputs

Attack Scenario:
A malicious actor or competitor creates falsified documents for training, resulting in model outputs that reflect these inaccuracies.

Real-World Example: RAG poisoning and hidden resume injection

Security researchers demonstrated attacks where malicious documents containing hidden instructions were inserted into AI retrieval systems. In some cases, hidden prompts embedded inside resumes manipulated hiring recommendation systems.

Scenario #4: Prompt Injection introducing misleading data

Attack Scenario:
Inadequate filtering allows an attacker to insert misleading data via prompt injection, leading to compromised outputs.

Real-World Example: Indirect Prompt Injection Attacks

Security researchers demonstrated how hidden instructions embedded in webpages, documents, emails, or PDFs could manipulate LLM behaviour when processed by AI assistants or summarization tools.

Scenario #5: AI Backdoors and Sleeper Agents

Attack Scenario:
An attacker uses poisoning techniques to insert a backdoor trigger into the model. This could leave you open to authentication bypass, data exfiltration or hidden command execution.

Real-World Example: Sleeper Agents Research

Researchers from Anthropic demonstrated how “Sleeper Agents”, AI models trained with hidden behaviours that activate only under specific trigger conditions. These hidden behaviours survived later safety training and fine-tuning.

AI Trust starts with data integrity

Many organizations focus heavily on AI productivity and deployment speed but underestimate the security risks associated with training data, embeddings, fine-tuning datasets, and external knowledge sources. Data poisoning is fundamentally an integrity problem. As enterprises adopt AI copilots, RAG systems, agentic workflows, and fine-tuned foundation models, organizations need:

  • AI governance frameworks
  • trusted data pipelines
  • AI supply-chain security
  • model provenance validation
  • secure MLOps controls
  • continuous monitoring and red teaming

The reality is that poisoned AI systems may continue operating normally for long periods before malicious behaviour becomes visible. That makes this one of the most difficult AI security risks to detect. For organizations deploying AI in regulated environments, cybersecurity, finance, healthcare, education, or government systems, AI trust and AI integrity can no longer be treated as optional.

Secure your AI system before attackers poison it

AI security is no longer just about protecting infrastructure. It is about protecting the integrity of the models, data pipelines, embeddings, and knowledge systems powering modern organizations.

Reputiva helps organizations strengthen:

  • AI security posture
  • cloud and AI governance
  • Microsoft 365 and Google Workspace security
  • AI risk management
  • RAG and prompt injection security
  • Zero Trust-aligned AI deployments
  • secure AI modernization strategies

Book a consultation to assess your organization’s AI readiness and security posture.

Navigate

Let's talk

Networks

Privacy Preference Center