Skip to main content

AML-002: Data Poisoning

CategoryAdversarial ML
FrameworksATLAS: Poison Training Data · OWASP: LLM03

Corrupt training data to influence model behavior. Surgical label flipping can degrade performance on specific classes while maintaining overall accuracy.

Technique

# Poisoning strategies:
1. Label flipping: change labels on
targeted samples (5-10% can shift
decision boundaries)
2. Backdoor triggers: add pattern to
training data associated with target
label. Model learns the trigger.
3. Clean-label: poison WITHOUT changing
labels. Harder to detect. Modify
feature space instead.

# Detection: inspect loss distribution,
# look for outlier training samples.

Key Concepts

  • Label flipping is the lowest-effort poisoning strategy. Changing labels on a small fraction (5-10%) of targeted samples is enough to shift decision boundaries for specific classes while aggregate metrics like overall accuracy remain largely unaffected, making it hard to detect through standard evaluation.
  • Backdoor triggers create a hidden activation pathway. The model learns to associate a specific pattern (a pixel patch, a phrase, a watermark) with a target output. In production, the trigger activates the backdoor while all other inputs are handled normally.
  • Clean-label poisoning is the stealthiest variant. Because the labels remain correct, standard data validation passes. The attack works by manipulating the feature representation of training samples so they cluster near the target class in embedding space.
  • Poisoning is especially dangerous in fine-tuning and RLHF pipelines. Models being fine-tuned on user-contributed data, crowdsourced labels, or scraped web content are exposed to poisoning at scale with minimal attacker effort.
  • The attacker's advantage is asymmetric: they only need to corrupt a small fraction of data, while defenders must validate the entire dataset.

Detection

  • Inspect training loss distributions for outliers. Poisoned samples often have anomalously high or low loss values compared to clean samples in the same class, since the model struggles to reconcile conflicting signals.
  • Use spectral signature analysis. Backdoor-poisoned data tends to leave a detectable statistical signature in the covariance of learned representations, which spectral methods can isolate.
  • Run activation clustering on the penultimate layer. Poisoned samples frequently form a distinct cluster separate from clean samples of the same class, especially for backdoor attacks.

Mitigation

  • Data provenance and integrity tracking are the first line of defense. Verify the source and chain of custody for all training data, and prefer curated datasets over scraped or crowdsourced data for safety-critical models.
  • Robust training methods like DPSGD (differentially private stochastic gradient descent) limit the influence any single training sample can have on model parameters, bounding the impact of poisoned data.
  • Periodic retraining and differential testing can catch poisoning that accumulates over time. Compare model behavior across training checkpoints and flag sudden shifts in performance on specific input classes.