Modern language models never see raw text directly — they see tokens. For example, the word “compression” might be split as ["compress", "ion"] or even down to bytes or characters. This tokenization step is far from cosmetic. It fixes the sequence length, vocabulary size, memory footprint, and amount of information per token. A coarse tokenizer (like a large Byte-Pair Encoding vocabulary) compresses many bytes into one token, yielding shorter sequences and smaller embedding tables. A fine-grained tokenizer (like bytes or characters) yields very long sequences and tiny vocabularies. Thus the choice of tokenization changes how compute is spent and how information flows through the model.
Historically, many practitioners assumed more compression is better — fewer tokens means fewer FLOPs and easier learning. Influential ideas supported this: for instance, the Information Bottleneck perspective suggests representations should throw away irrelevant input details to focus on task-relevant information. Likewise, the Minimum Description Length principle holds that “the shortest description of the data is the best model,” implying the learning process should compress data as much as possible. In daily language, we think of intelligence as forming concise abstractions — categories, summaries, latent features — rather than memorizing every detail. By analogy, researchers often tried adding explicit bottlenecks or compression layers, enforcing sparse or quantized representations, or using very aggressive tokenization to “help” the model compress input before training.
However, recent large-scale models reveal a twist: very large models tend to compress internally. Studies have found that transformer hidden states collapse in mid-layers: representations become low-dimensional and low-entropy before expanding again. Phenomena like attention sinks (many attention heads focusing on special tokens like the BOS token) and compression valleys (sharp drops in representation entropy) appear universally in big models. Crucially, this compression is task-dependent and emerges through gradient descent, not by design. For example, inference trajectories in transformers consistently project down to very low-dimensional manifolds, even though the model’s embedding space is high-dimensional — and this happens dynamically during inference, not because of an architectural bottleneck. In other words, LLMs naturally learn to throw away irrelevant details and focus on salient features internally.
This leads to a sharp distinction:
Intuitively, internal compression can be richer: it discovers abstract features, syntax, semantics — exactly what matters for the downstream task. External tokenization, by contrast, usually compresses based on surface statistics (e.g. frequent n-grams).
The new paper “Compute Optimal Tokenization” (Limisiewicz et al., 2026) investigates this distinction by treating the tokenization compression rate (bytes-per-token) as a variable in scaling laws. In classic scaling experiments (Kaplan et al. 2020; Hoffmann et al. 2022), researchers fixed the tokenizer and varied model size and data amount. Here, they also sweep over how aggressively the data is pre-compressed into tokens. To do this precisely, the authors use the Byte-Latent Transformer (BLT) architecture: a model that segments raw text bytes into latent tokens of a controllable average size. This lets them dial in any compression rate (from character-level up to multi-word tokens) while keeping the model’s embedding parameters the same size. In total, 988 BLT models and 320 standard subword models were trained (parametric sizes 50M—6.7B, data 4B—1.1T bytes) under fixed compute budgets. Every model was evaluated on bits-per-byte (loss per byte) with a constant 8192-byte context, so losses can be compared across tokenizations.
The key idea: under a fixed FLOPs budget, changing tokenization changes how many tokens (and thus data bytes) you process. More compression (more bytes per token) means shorter sequences and cheaper forward passes, leaving extra compute “on the table” to either train on more data or use a bigger model. But if the tokenizer discards too much information, the model may not use that saved compute well. Conversely, very little compression (tokenizing at the byte or char level) yields extremely long sequences that are expensive to process, potentially overwhelming the model with low-level detail. Somewhere in between lies the sweet spot.

Optimal bytes per parameter ratio across compression rates. Fixed training budget 10^20 FLOPs.
Figure: For a fixed compute budget, the lowest loss (blue=better) lies along a roughly constant “bytes-per-parameter” ratio, across a range of compression rates. We see that across compression rates, the best models all lie along a line of roughly constant bytes-per-parameter. In other words, the optimal scaling law is in bytes, not tokens. This means that if you increase token-level compression (fewer tokens), you should proportionally scale up data in bytes to keep performance optimal. A corollary is that popular rules-of-thumb (like 20 tokens per parameter in Chinchilla) should instead be understood as a constant ratio of bytes per parameter. In short, the model cares about raw information (bytes), not how many tokens represent that information.
The experiments confirm several surprising points:
Optimal compression is weaker than BPE. On many budgets, models with less aggressive tokenization beat standard BPE. For example, Table 2 of the paper shows that BPE (about 4.57 bytes/token) was not always the best choice — slightly coarser or finer tokenizations often gave lower loss. In fact, the study finds that the “sweet spot” compression rate is typically smaller (i.e. more tokens) than what popular BPE models use. Put bluntly: “BPE compresses too much.” The optimal tokens-per-byte is higher than what GPT-style tokenizers produce.

Comparison of the lowest BPB obtained by subword tokenized models for specific compute budgets.
Optimal compression decreases with scale. As models get larger (and budgets grow), the best compression rate becomes even less (more granular tokens). The loss versus compression curves form a clear U-shape: with very low compression (character-level, left side) loss is high due to inefficient long sequences; with very high compression (giant multi-word tokens, right side) loss is high due to lost information. The minimum of that U moves left as compute increases. In other words, bigger models prefer to do more of the compression work themselves rather than have it done by the tokenizer. The paper fits a scaling law and explicitly shows that “the optimal compression rate decreases as the training budget increases.”.

Power law estimating loss given compression rate and compute budget.
Figure: Each colored curve is a different compute budget, showing loss as a function of bytes-per-token. Loss vs. compression rate, for different compute budgets (colours). Each curve is U-shaped, with a clear minimum (black dot). As compute budget grows, the optimum (line of stars) shifts toward less compression (i.e. more tokens per byte). Too much compression (right) or too little (left) both hurt. Importantly, the line of stars trace the best compression rate at each budget, and it goes leftward (toward finer tokenization) for larger compute budgets. Quantitatively, the paper observes the optimal bytes-per-token decreases (models want smaller segments) as we scale up.