Universal Entanglement in Transformer Activation Space

Executive Summary

This consolidated paper unifies two previously separate works into a single treatment of universal entanglement in transformer activation space. The first contribution (from AI-25) establishes the discrimination-activation dissociation: SVD directions in multi-concept ridge regression can be concept-pure for classification (V-matrix purity greater than 0.96) while simultaneously carrying all concepts in their activations. The damage matrix — constructed by projecting out each direction and measuring leave-one-out accuracy loss — reveals minimum cross-concept damage of 38.8%. No direction can be removed without damaging every concept.

The second contribution (from AI-26) establishes that this entanglement is geometric, not learned. Eight experiments across four transformer architectures (GPT-2 124M, Qwen-0.5B, Qwen-7B, and Qwen-7B-Instruct) show that random Gaussian projections to d greater than or equal to 448 reproduce the learned entanglement intensity (EI = 1.50), while PCA reverses it. Superlinear amplification is confirmed: triple EI exceeds mean pairwise by 2x. Together, the results establish that entanglement intensity is determined by the ratio d/k (hidden dimension to number of concepts), not by training or architecture.

Discrimination-Activation Dissociation

Linear probing assumes that concept-separability in the classifier implies concept-separability in the activations. This paper shows that assumption is false. Using multi-concept ridge regression with SVD decomposition on Qwen 2.5-7B, directions can be concept-pure for discrimination while simultaneously carrying all concepts in their activations. The V-matrix shows what the classifier uses each direction for; the damage matrix shows what each direction actually carries. This establishes a fundamental limitation on direction-based concept editing: the geometry that supports classification is not the geometry that carries information.

Entanglement Is Geometric

Random Gaussian projections to 448 dimensions match learned EI (1.50); projections to the 7-dimensional informative rank yield baseline EI (0.18). PCA to 112 dimensions achieves EI 0.18 with purity 0.76 — reversing the entanglement. Concept-type independence is validated by replacing linguistic concepts with software engineering concepts (mean ratio 0.97). RLHF accelerates entanglement crystallization during training. Stratified bootstrap confidence intervals (2,000 iterations) confirm EI is significantly above zero for all four models.

Superlinear Amplification

When three concepts are probed simultaneously, the triple EI exceeds the mean pairwise EI by 2x (GPT-2: 1.87x, Qwen-7B: 2.15x). Nesting two concepts into one reduces EI below the pairwise baseline, confirming that independent concept axes drive the superlinearity. This is not a measurement artifact but a structural consequence of encoding multiple concepts in a shared high-dimensional space.

Key Findings

V-matrix purity greater than 0.96: SVD directions are concept-pure for discrimination
Minimum cross-concept damage 38.8%: Every direction carries every concept in activations
Random projection reproduces entanglement: Gaussian projections to 448d match learned EI (1.50)
PCA reverses entanglement: PCA to 112d achieves EI 0.18 with purity 0.76
Superlinear amplification: Triple/pairwise EI ratio 1.87x-2.15x
Cross-model consistency: All four architectures show EI greater than 1.0 at terminal layers
RLHF accelerates crystallization: Instruction tuning reaches terminal EI faster

Superseded Papers

This paper consolidates and supersedes:

AI-25: Entangled Directions — discrimination-activation dissociation and the damage matrix
AI-26: Structural Entanglement in the Informative Subspace — eight experiments establishing entanglement as geometric

Key References

Ravfogel et al. (2020)

INLP: Iterative Null-Space Projection for concept erasure.

Johnson, W. B., & Lindenstrauss, J. (1984)

Extensions of Lipschitz mappings into Hilbert space.

McEntire (2026)

The Concentration Barrier (AI-11): effective dimensionality bounds on selectivity.