AML-001: Model Evasion
| Category | Adversarial ML |
| Frameworks | ATLAS: Evade ML Model · OWASP ML: ML04 |
Craft inputs that cause ML models to misclassify. Small, imperceptible perturbations to input data can flip model decisions while appearing identical to humans.
Technique
# Common evasion methods:
FGSM - Fast Gradient Sign Method
PGD - Projected Gradient Descent
C&W - Carlini & Wagner (L2 norm)
# Black-box evasion (no model access):
- Transferability: adversarial examples
crafted against one model often fool others
- Query-based: estimate gradients through
repeated queries to the target model
- Score-based: use confidence scores to
guide perturbation search
Key Concepts
- Gradient-based attacks are the gold standard in white-box settings. FGSM computes a single gradient step for speed, while PGD iterates for stronger perturbations. C&W optimizes for minimal perturbation magnitude, making adversarial inputs nearly indistinguishable from originals.
- Transferability is what makes black-box evasion practical. Adversarial examples generated against a local surrogate model frequently fool the target model because different architectures learn similar decision boundaries on the same data distribution.
- Query-based attacks only need API access. By observing how confidence scores shift with small input changes, an attacker can numerically estimate gradients without ever seeing model weights, then craft perturbations accordingly.
- The perturbation budget controls stealth. Attacks constrained to small Lp-norm balls (L0, L2, L-infinity) produce changes invisible to human reviewers but sufficient to cross decision boundaries.
- Evasion attacks expose a fundamental tension: models optimized purely for accuracy on clean data are often brittle to carefully crafted inputs, because decision boundaries in high-dimensional space can be thin and non-intuitive.
Detection
- Monitor confidence score distributions. Adversarial inputs often produce unusual confidence patterns, such as high confidence for the wrong class or confidence values clustered near decision boundaries.
- Deploy input preprocessing detectors. Techniques like feature squeezing, spatial smoothing, or JPEG compression can alter adversarial perturbations enough to change the model's prediction, flagging a discrepancy between raw and preprocessed inputs.
- Track query patterns for black-box probing. A high volume of similar queries with small systematic variations is a strong signal of gradient estimation attacks.
Mitigation
- Adversarial training augments the training set with adversarial examples, forcing the model to learn robust decision boundaries. This is the most empirically validated defense but increases training cost.
- Ensemble methods and model diversity reduce transferability. If the production system uses an ensemble of architecturally different models, an adversarial example crafted against one is less likely to fool the majority.
- Certified defenses (randomized smoothing, interval bound propagation) provide mathematical guarantees that no perturbation within a specified radius can change the prediction, though they currently trade accuracy for robustness.