AML-004: Model Extraction
| Category | Adversarial ML |
| Frameworks | ATLAS: Extract ML Model · OWASP: LLM10 |
Replicate a model's functionality through repeated queries. Build a surrogate model that approximates the target's decision boundary, enabling further attacks.
Technique
# Extraction pipeline:
1. Query target model systematically
2. Collect input-output pairs
3. Train surrogate model on collected data
4. Surrogate approximates target behavior
# Applications of stolen model:
- Craft transferable adversarial examples
- Membership inference attacks
- Understanding model internals
- Competitive intelligence
# Defense: rate limiting, query monitoring,
# watermarking model outputs.
Key Concepts
- Model extraction turns black-box access into white-box access. Once the attacker has a surrogate that approximates the target's decision boundary, they can inspect gradients, craft adversarial examples, and run any white-box analysis against the surrogate with high transferability to the target.
- The query strategy determines extraction efficiency. Active learning techniques allow the attacker to select the most informative queries, achieving high-fidelity extraction with fewer API calls. Random sampling works but is wasteful.
- Extraction is a force multiplier for other attacks. A stolen model enables membership inference (test if data was in training), adversarial example crafting (white-box perturbations transfer), and architecture reverse engineering, all without further queries to the target.
- The economic impact is direct. Training large models costs millions. Extraction allows a competitor to replicate the functionality for the cost of API queries, undermining the model owner's investment and competitive advantage.
- Partial extraction is often sufficient. The attacker does not need a perfect copy. A surrogate that agrees with the target on 90% of inputs is enough to craft effective adversarial examples or understand the target's behavior on the input regions that matter.
Detection
- Monitor for high-volume systematic query patterns. Extraction requires many queries with structured input diversity. Detect API consumers whose query distributions differ significantly from normal user behavior, particularly those covering input space uniformly rather than following natural usage patterns.
- Fingerprint query sequences for active learning signatures. Active learning-based extraction produces queries that are informationally dense and often cluster near decision boundaries. This is statistically distinguishable from organic usage.
- Deploy model watermarking. Embed a secret pattern in the model's outputs (e.g., specific behavior on a set of trigger inputs) that will transfer to any surrogate trained on those outputs, proving extraction occurred.
Mitigation
- Rate limiting and query budgets are the simplest defense. Cap the number of queries per API key per time period, and monitor for key rotation or distributed querying that attempts to circumvent limits.
- Output perturbation adds controlled noise to confidence scores or rounds them to fewer decimal places. This degrades the quality of the training signal the attacker collects without significantly affecting legitimate users who only need top-k predictions.
- Proof-of-work or CAPTCHA challenges for high-volume API consumers increase the cost of extraction. Combined with query monitoring, this forces attackers to trade off between stealth and extraction fidelity.