Executive Summary
Model distillation — training a smaller model to replicate a larger model's behavior using its outputs — poses a growing security challenge for API-based AI providers. Traditional detection approaches attempt to identify distillation from the input side, classifying individual queries as legitimate or extractive. The Online Model Extraction Detection (OMED) impossibility theorem establishes that this approach cannot succeed: any individual query that a distillation attacker would make is also a query a legitimate user would make, making per-query classification provably impossible.
CMS does not claim to violate the OMED bound. Instead, it shifts from absolute prevention to economic deterrence — making extraction more expensive without guaranteeing detection of every adversarial client. By monitoring the model's activation space rather than its inputs, CMS detects the distinctive geometric signatures that distillation attacks produce in the victim model's activation manifold. The framework achieves AUC 1.000 on systematic distillation attacks at model scales of 1.5B parameters and above. All evaluations use synthetic sessions; real-world deployment with heterogeneous user behavior may show lower separation.
The work is motivated in part by the February 2026 Anthropic disclosure, which documented over 16 million exchanges from approximately 24,000 fraudulent accounts engaged in systematic model extraction. The framework developed here provides continuous Proof-of-Humanity scoring as an alternative to binary authentication, creating an economic deterrent through a coverage-clustering tradeoff: attackers must choose between broad capability extraction (which produces detectable geometric signatures) and narrow extraction (which limits the utility of the distilled model).
Key Contributions and Methodology
The paper makes three contributions. First, it formalizes the distinction between input-side and activation-side detection, showing that activation-side monitoring raises the cost of evasion relative to input-space defenses. The OMED impossibility applies to any observable signal, and activation patterns become partially observable through systematic probing — but the adversary faces a bootstrap problem: learning the defensive signal requires performing the extraction activity the defense monitors.
Second, it develops a suite of topological and geometric metrics — including persistent homology summaries, activation coverage maps, and manifold curvature estimates — that collectively form a distillation fingerprint. These metrics are computed continuously over sliding windows of queries, producing a time-evolving Proof-of-Humanity score that degrades as extraction patterns accumulate.
Third, the paper formalizes the coverage-clustering tradeoff as an economic mechanism. Effective distillation requires broad coverage of the victim model's capability manifold, but broad coverage produces clustered activation patterns that are geometrically distinguishable from the diffuse patterns of legitimate usage. Attackers who avoid clustering must sacrifice coverage, reducing the value of their extraction. This creates a structural deterrent that does not depend on catching any individual query.
Key Findings
- Economic deterrence: Activation-side monitoring shifts from absolute prevention to economic deterrence, making extraction more expensive without claiming to violate the OMED bound
- Perfect detection at scale: AUC 1.000 on systematic distillation attacks at model scales of 1.5B parameters and above
- Continuous scoring: Proof-of-Humanity scoring provides continuous rather than binary classification, with perfect separation between attack and legitimate distributions
- Coverage-clustering tradeoff: Attackers face a fundamental tradeoff between extraction breadth (detectable) and extraction stealth (limited utility), creating an economic deterrent
- Motivated by real attacks: Framework addresses the scale of real-world extraction documented in the February 2026 Anthropic disclosure (16M+ exchanges, ~24K fraudulent accounts)
Key References
Stealing Machine Learning Models via Prediction APIs. USENIX Security Symposium.
Knockoff Nets: Stealing Functionality of Black-Box Models. IEEE Conference on Computer Vision and Pattern Recognition.
PRADA: Protecting Against DNN Model Stealing Attacks. IEEE European Symposium on Security and Privacy.
Stealing Part of a Production Language Model. International Conference on Machine Learning.
Disclosure: Systematic Model Extraction via Fraudulent API Accounts. Anthropic Security Report.