Modern language models never see raw text directly – they see tokens. For example, the word “compression” might be split as ["compress", "ion"] or even down to bytes or characters. This tokenization step is far from cosmetic. It fixes the sequence length, vocabulary size, memory footprint, and amount of information per token. A coarse tokenizer (like a large Byte-Pair Encoding vocabulary) compresses many bytes into one token, yielding shorter sequences and smaller embedding tables. A fine-grained tokenizer (like bytes or characters) yields very long sequences and tiny vocabularies. Thus the choice of tokenization changes how compute is spent and how information flows through the model.

Historically, many practitioners assumed more compression is better – fewer tokens means fewer FLOPs and easier learning. Influential ideas supported this: for instance, the Information Bottleneck perspective suggests representations should throw away irrelevant input details to focus on task-relevant information【31†L46-L54】. Likewise, the Minimum Description Length principle holds that “the shortest description of the data is the best model”【28†L132-L139】, implying the learning process should compress data as much as possible. In daily language, we think of intelligence as forming concise abstractions – categories, summaries, latent features – rather than memorizing every detail. By analogy, researchers often tried adding explicit bottlenecks or compression layers, enforcing sparse or quantized representations, or using very aggressive tokenization to “help” the model compress input before training.

However, recent large-scale models reveal a twist: very large models tend to compress internally. Studies have found that transformer hidden states collapse in mid-layers: representations become low-dimensional and low-entropy before expanding again【9†L182-L190】. Phenomena like attention sinks (many attention heads focusing on special tokens) and compression valleys (sharp drops in representation entropy) appear universally in big models【9†L182-L190】. Crucially, this compression is task-dependent and emerges through gradient descent, not by design. For example, inference trajectories in transformers consistently project down to very low-dimensional manifolds, even though the model’s embedding space is high-dimensional – and this happens dynamically during inference, not because of an architectural bottleneck【19†L163-L168】【9†L185-L190】. In other words, LLMs naturally learn to throw away irrelevant details and focus on salient features internally.

This leads to a sharp distinction:

Intuitively, internal compression can be richer: it discovers abstract features, syntax, semantics – exactly what matters for the downstream task. External tokenization, by contrast, usually compresses based on surface statistics (e.g. frequent n-grams)【16†L104-L108】.


Scaling Laws Meet Tokenization

The new paper “Compute Optimal Tokenization” (Limisiewicz et al., 2026) investigates this distinction by treating the tokenization compression rate (bytes-per-token) as a variable in scaling laws. In classic scaling experiments (Kaplan et al. 2020; Hoffmann et al. 2022), researchers fixed the tokenizer and varied model size and data amount. Here, they also sweep over how aggressively the data is pre-compressed into tokens. To do this precisely, the authors use the Byte-Latent Transformer (BLT) architecture: a model that segments raw text bytes into latent tokens of a controllable average size【7†L188-L197】. This lets them dial in any compression rate (from character-level up to multi-word tokens) while keeping the model’s embedding parameters the same size. In total, 988 BLT models and 320 standard subword models were trained (parametric sizes 50M–6.7B, data 4B–1.1T bytes) under fixed compute budgets【1†L326-L334】. Every model was evaluated on bits-per-byte (loss per byte) with a constant 8192-byte context, so losses can be compared across tokenizations【1†L347-L352】.

The key idea: under a fixed FLOPs budget, changing tokenization changes how many tokens (and thus data bytes) you process. More compression (more bytes per token) means shorter sequences and cheaper forward passes, leaving extra compute “on the table” to either train on more data or use a bigger model. But if the tokenizer discards too much information, the model may not use that saved compute well. Conversely, very little compression (tokenizing at the byte or char level) yields extremely long sequences that are expensive to process, potentially overwhelming the model with low-level detail. Somewhere in between lies the sweet spot.

【25†embed_image】 Figure: For a fixed compute budget, the lowest loss (blue=better) lies along a roughly constant “bytes-per-parameter” ratio, across a range of compression rates. In each 3D IsoFLOP surface above, one axis is model size (parameters), one is data amount (bytes), and the vertical axis is loss. Each curve (yellow) shows the tradeoff for one compression rate. We see that across compression rates, the best models all lie along a line of roughly constant bytes-per-parameter. In other words, the optimal scaling law is in bytes, not tokens【16†L104-L108】【1†L369-L374】. This means that if you increase token-level compression (fewer tokens), you should proportionally scale up data in bytes to keep performance optimal. A corollary is that popular rules-of-thumb (like 20 tokens per parameter in Chinchilla) should instead be understood as a constant ratio of bytes per parameter. In short, the model cares about raw information (bytes), not how many tokens represent that information【16†L104-L108】【1†L369-L374】.

The experiments confirm several surprising points:

【26†embed_image】 Figure: Loss vs. compression rate, for different compute budgets (colours). Each curve is U-shaped, with a clear minimum (black dot). As compute budget grows, the optimum (dotted line) shifts toward less compression (i.e. more tokens per byte). The plot above illustrates this trade-off: each colored curve is a different compute budget, showing loss as a function of bytes-per-token. Too much compression (right) or too little (left) both hurt. Importantly, the green dotted line traces the best compression rate at each budget, and it goes leftward (toward finer tokenization) for larger compute budgets. Quantitatively, the paper observes the optimal bytes-per-token decreases (models want smaller segments) as we scale up【20†L530-L532】【2†L84-L90】.

In short, this study shows the optimal tokenization is coarser than naive byte-level but finer than standard BPE, and that the “sweet spot” shifts toward finer (less compressed) tokens as models grow. Mathematically, the parameter count should scale with bytes of data, not tokens【16†L104-L108】 – a subtle but crucial point for formulating future scaling laws.