Self-attention[2] is the centerpiece of the Transformer architecture, powering models from GPT[6] to BERT[7]. But beneath its surface lies a surprising redundancy: many dimensions and attention heads can be altered or even removed without severely impacting performance. In this post, we’ll walk through the mathematical foundations of this redundancy, its implications for efficient variants like Grouped Query Attention (GQA)[3,[14]](https://www.ibm.com/think/topics/grouped-query-attention), and the nuances of attention head pruning. We'll do this without hand-waving—sticking strictly to what research and math show.
In self-attention, each token in the sequence is projected into three vectors:
Here, $X$ is the input token embedding matrix, and $W^Q,W^K,W^V$ are learned projection matrices. These vectors are then used to compute the attention output:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
The division by $\sqrt{d_k}$ is critical—it stabilizes gradients during training by controlling the variance of the dot product as dimensionality increases [5].
Note: We’re focusing on self-attention, where $Q, K,V$ are projections of the same input sequence $X$. In cross-attention, redundancy and invariance may behave differently.
Let’s say we transform the queries and keys as:
$$ Q\rightarrow QA, \quad K \rightarrow K(A^{-1})^T $$
Then:
$$ QK^T \rightarrow QA(K(A^{-1})^T)^T = QA(A^{-1})K^T = QK^T $$
So the dot product in self-attention is unchanged by such a transformation. This invariance holds for any invertible matrix $A \in GL(d_k)$, the general linear (GL) group of invertible $d_k \times d_k$ matrices. That’s a lot of freedom.
If $d_k=64$, we’re talking about $64^2=4096$ continuous degrees of freedom in $GL(64)$. This implies that there are many redundant ways to represent queries and keys that ultimately produce the same attention scores.
This is not to be confused with permutation invariance. In fact, self-attention is permutation equivariant, not invariant [11]: