Where Should AI Compress? Inside the Model or Before It Hits the Neural Net?

Modern language models never see raw text directly – they see tokens. For example, the word “compression” might be split as ["compress", "ion"] or even down to bytes or characters. This tokenization step is far from cosmetic. It fixes the sequence length, vocabulary size, memory footprint, and amount of information per token. A coarse tokenizer (like a large Byte-Pair Encoding vocabulary) compresses many bytes into one token, yielding shorter sequences and smaller embedding tables. A fine-grained tokenizer (like bytes or characters) yields very long sequences and tiny vocabularies. Thus the choice of tokenization changes how compute is spent and how information flows through the model.

Historically, many practitioners assumed more compression is better – fewer tokens means fewer FLOPs and easier learning. Influential ideas supported this: for instance, the Information Bottleneck perspective suggests representations should throw away irrelevant input details to focus on task-relevant information【31†L46-L54】. Likewise, the Minimum Description Length principle holds that “the shortest description of the data is the best model”【28†L132-L139】, implying the learning process should compress data as much as possible. In daily language, we think of intelligence as forming concise abstractions – categories, summaries, latent features – rather than memorizing every detail. By analogy, researchers often tried adding explicit bottlenecks or compression layers, enforcing sparse or quantized representations, or using very aggressive tokenization to “help” the model compress input before training.

However, recent large-scale models reveal a twist: very large models tend to compress internally. Studies have found that transformer hidden states collapse in mid-layers: representations become low-dimensional and low-entropy before expanding again【9†L182-L190】. Phenomena like attention sinks (many attention heads focusing on special tokens) and compression valleys (sharp drops in representation entropy) appear universally in big models【9†L182-L190】. Crucially, this compression is task-dependent and emerges through gradient descent, not by design. For example, inference trajectories in transformers consistently project down to very low-dimensional manifolds, even though the model’s embedding space is high-dimensional – and this happens dynamically during inference, not because of an architectural bottleneck【19†L163-L168】【9†L185-L190】. In other words, LLMs naturally learn to throw away irrelevant details and focus on salient features internally.

This leads to a sharp distinction:

External compression: Reducing data size before the model via the tokenizer (BPE/WordPiece, JPEG, fixed bottlenecks, etc). It is static and task-agnostic.
Internal compression: The model learning to compress through training (feature pruning, low-rank structure, entropy reduction). It is adaptive, aligned with the prediction task, and shaped by gradients.

Intuitively, internal compression can be richer: it discovers abstract features, syntax, semantics – exactly what matters for the downstream task. External tokenization, by contrast, usually compresses based on surface statistics (e.g. frequent n-grams)【16†L104-L108】.

Scaling Laws Meet Tokenization

The new paper “Compute Optimal Tokenization” (Limisiewicz et al., 2026) investigates this distinction by treating the tokenization compression rate (bytes-per-token) as a variable in scaling laws. In classic scaling experiments (Kaplan et al. 2020; Hoffmann et al. 2022), researchers fixed the tokenizer and varied model size and data amount. Here, they also sweep over how aggressively the data is pre-compressed into tokens. To do this precisely, the authors use the Byte-Latent Transformer (BLT) architecture: a model that segments raw text bytes into latent tokens of a controllable average size【7†L188-L197】. This lets them dial in any compression rate (from character-level up to multi-word tokens) while keeping the model’s embedding parameters the same size. In total, 988 BLT models and 320 standard subword models were trained (parametric sizes 50M–6.7B, data 4B–1.1T bytes) under fixed compute budgets【1†L326-L334】. Every model was evaluated on bits-per-byte (loss per byte) with a constant 8192-byte context, so losses can be compared across tokenizations【1†L347-L352】.

The key idea: under a fixed FLOPs budget, changing tokenization changes how many tokens (and thus data bytes) you process. More compression (more bytes per token) means shorter sequences and cheaper forward passes, leaving extra compute “on the table” to either train on more data or use a bigger model. But if the tokenizer discards too much information, the model may not use that saved compute well. Conversely, very little compression (tokenizing at the byte or char level) yields extremely long sequences that are expensive to process, potentially overwhelming the model with low-level detail. Somewhere in between lies the sweet spot.

【25†embed_image】 Figure: For a fixed compute budget, the lowest loss (blue=better) lies along a roughly constant “bytes-per-parameter” ratio, across a range of compression rates. In each 3D IsoFLOP surface above, one axis is model size (parameters), one is data amount (bytes), and the vertical axis is loss. Each curve (yellow) shows the tradeoff for one compression rate. We see that across compression rates, the best models all lie along a line of roughly constant bytes-per-parameter. In other words, the optimal scaling law is in bytes, not tokens【16†L104-L108】【1†L369-L374】. This means that if you increase token-level compression (fewer tokens), you should proportionally scale up data in bytes to keep performance optimal. A corollary is that popular rules-of-thumb (like 20 tokens per parameter in Chinchilla) should instead be understood as a constant ratio of bytes per parameter. In short, the model cares about raw information (bytes), not how many tokens represent that information【16†L104-L108】【1†L369-L374】.

The experiments confirm several surprising points:

Optimal compression is weaker than BPE. On many budgets, models with less aggressive tokenization beat standard BPE. For example, Table 2 of the paper shows that BPE (about 4.57 bytes/token) was not always the best choice – slightly coarser or finer tokenizations often gave lower loss【17†L630-L637】. In fact, the study finds that the “sweet spot” compression rate is typically smaller (i.e. more tokens) than what popular BPE models use. Put bluntly: “BPE compresses too much.” The optimal tokens-per-byte is higher than what GPT-style tokenizers produce【16†L104-L108】.
Optimal compression decreases with scale. As models get larger (and budgets grow), the best compression rate becomes even less (more granular tokens). The loss versus compression curves form a clear U-shape: with very low compression (character-level, left side) loss is high due to inefficient long sequences; with very high compression (giant multi-word tokens, right side) loss is high due to lost information. The minimum of that U moves left as compute increases. In other words, bigger models prefer to do more of the compression work themselves rather than have it done by the tokenizer. The paper fits a scaling law and explicitly shows that “the optimal compression rate decreases as the training budget increases.”【20†L530-L532】【2†L84-L90】.

【26†embed_image】 Figure: Loss vs. compression rate, for different compute budgets (colours). Each curve is U-shaped, with a clear minimum (black dot). As compute budget grows, the optimum (dotted line) shifts toward less compression (i.e. more tokens per byte). The plot above illustrates this trade-off: each colored curve is a different compute budget, showing loss as a function of bytes-per-token. Too much compression (right) or too little (left) both hurt. Importantly, the green dotted line traces the best compression rate at each budget, and it goes leftward (toward finer tokenization) for larger compute budgets. Quantitatively, the paper observes the optimal bytes-per-token decreases (models want smaller segments) as we scale up【20†L530-L532】【2†L84-L90】.

Smaller models behave differently. At low compute budgets, a bit more compression actually helps (since tiny models struggle with very long sequences). But beyond that, the trend is clear: the more powerful the model, the rawer the input it prefers. This mirrors a classic theme: in older methods, we often hand-engineered features to help weak learners. In deep learning, strong models often do better with rawer, lower-level inputs. Here, massive transformers can learn linguistic abstractions internally, so we should give them the chance rather than pre-compressing too much.
Tokenizer vs Model: a FloPs tradeoff. Since we have a fixed compute budget, every bit of saved cost in tokenization is a bit of cost stolen by the model for internal compression. If the tokenizer does all the heavy lifting, the model has less work (and less opportunity) to learn structure from the data. Conversely, if the tokenizer does too little, the model may waste compute on trivial pattern-matching of raw bytes. The experiments suggest the best allocation is a balance – but shifted toward the model’s side, especially at scale. In fact, the paper’s language is telling: internal, task-adaptive compression appears more valuable than static, corpus-based compression【16†L104-L108】.
Cross-lingual variation. Interestingly, these findings aren’t limited to English. When the authors repeated the study on languages like French, Arabic, Hindi, etc., they found the optimal compression rate depends on the language’s “parity” (how many bytes it uses to say the same thing as English)【22†L107-L115】【2†L119-L123】. Rare or verbose languages tended to favor more compression. Popular BPE models (Llama, Qwen, etc.) turned out to overcompress high-resource languages (English, Arabic) and undercompress low-resource ones (Hindi, Vietnamese) compared to the optimum【17†L630-L637】【22†L107-L115】. The takeaway: optimal tokenization is language-dependent, but again it rarely matched what default BPE does【22†L107-L115】【2†L119-L123】.

In short, this study shows the optimal tokenization is coarser than naive byte-level but finer than standard BPE, and that the “sweet spot” shifts toward finer (less compressed) tokens as models grow. Mathematically, the parameter count should scale with bytes of data, not tokens【16†L104-L108】 – a subtle but crucial point for formulating future scaling laws.