The Right Answer, the Wrong Direction: Why Transformers Fail at Counting

Why transformers fail at counting and how to fix it

The question started fairly simple. Ask a transformer to count objects in a paragraph. "There are three cats near the pond. There are two dogs. There are three cats. How many cats?" Answer: six. A human gets it instantly. Qwen3-8B, a reasonably capable model, gets it right about 14% of the time. Barely above chance for a nine-way multiple choice. Fine, transformers aren't great at counting. Just like how it struggles with how many R's are in strawberry. Not new news.

The surprise was what happened when I looked inside.

The Thing That Shouldn't Be Possible

I trained linear probes on the model's hidden states. Simple ridge regression. For every layer, for every position, can a linear function read the count from the internal representation? The answer was yes. In fact, R² > 0.99 from layer two onward. The model stores the count accurately, precisely, and redundantly through all 36 layers.

And then it gets the answer wrong.

This is odd. Much of the interpretability field operates on an implicit assumption: if you can decode information from hidden states, the model is using that information. I just found a clear counterexample. The model has the answer. It computed it. It propagated it. And then it just didn't output it.

Something was blocking the route between the representation and the output. I wanted to figure out what.

The recent Godey and Artzi paper had caught my attention a while back. They showed that the LM head, the final projection from hidden states to vocabulary tokens, acts as a gradient bottleneck during training. Backpropagating through it compresses gradients, suppressing 95 to 99 percent of the update signal. When I started probing the internal representations and noticed this strange misalignment, I wondered whether it might be a consequence of that same bottleneck. A direction the model learned to write into during training, but never learned to read from.

The Geometry of Silence

Quick background on how language model output works. The final hidden state lives in a 4096-dimensional vector space. The output head is a weight matrix where each row corresponds to one token in the vocabulary. The dot product between the hidden state and each token row determines which token gets produced next. Simple.

My question became (unfortunately) geometric: what's the cosine similarity between the direction the model uses to represent a count, and the direction the output head uses to read out digits?

If those two vectors point in roughly the same direction, the count should be readable. If they point in different directions, the count gets lost on the way to the output.

The answer: |cos| ≤ 0.032 across all layers. Across four different probing methods. Across three model families. To calibrate that number, two random vectors in 4096-dimensional space have an expected cosine of about 0.012. The count direction is essentially indistinguishable from random with respect to the digit output rows. The model stores the count, but it stores it in a direction the output mechanism literally cannot see.

The Test

If this diagnosis is right, a very specific intervention should work. Fix the digit rows of the output head. Touch nothing else.

I modified only the 9 rows of the output matrix that correspond to digit tokens. Token "1" through token "9". 36,864 parameters out of 620 million. Trained on counting data for 300 steps.

Under constrained evaluation (the model has to pick from digit tokens):

Entity counting: 13.7% → 60.7% Character counting: 49.3% → 98.0% List length: 57.7% → 99.2%

A control where I modified 9 random non-digit rows produced no improvement. A control where I shuffled which digit row got modified also failed. The effect is specific to the correct rows, supporting the geometric diagnosis cleanly.

But alas there's a catch.

The Catch

The 9-row repair works beautifully under constrained next-token evaluation. But if I let the model generate freely (normal autoregressive decoding), accuracy drops to 0.0%. The model produces chain of thought reasoning instead of a digit. "Let's see, I need to count the cats..."

This isn't really a contradiction. It's what the geometry says it should be. The 9-row repair fixes the output layer mapping. But each generation step presents a new hidden state to the output head. Unless the count direction has been amplified upstream in the residual stream, the correction can't generalize across autoregressive steps.

The repair proves where the problem is but doesn't fix it for real use.

A quick aside on two techniques I use below. LoRA (Low-Rank Adaptation) is a method for fine-tuning large models cheaply. Instead of updating the full weight matrices (millions of parameters), you train two small matrices whose product is added to the original weights. Only a tiny fraction of parameters change, and the rest of the model stays frozen. Logit lens is simpler: you take the hidden state from any intermediate layer and multiply it directly by the final output projection (the lm_head). This tells you what the model would predict if it stopped at that layer. A window into what each layer "thinks" the answer is.

The Fix That Actually Works

To make counting work in real generation, you need to go upstream. LoRA rank-16 applied to Q/V attention projections (7.67M parameters, 200 training steps):

Full vocabulary: 91.7% ± 4.5% Greedy generation: 83.1% ± 7.2% Generation gap: zero

The logit lens shows why. Before LoRA, the correct digit's median vocabulary rank is 55,980. It's lost in a sea of 152K competing tokens. After LoRA, the rank drops to 1. The output head now reads the count as its top choice. That's a 55,980x improvement from making the attention layer routing just slightly more aligned with the output projection.

The mechanism is cleanly verified at three points. At layer 2 (where counts are encoded), LoRA leaves the probe direction unchanged. At layer 35 (what the output head reads from), ridge probe R² rises from 0.974 to 0.998. At the logit lens level, rank drops from 55,980 to 1. LoRA doesn't change what gets encoded. It changes how the encoded signal reaches the output.

The Honest Failures

This paper went through some painful corrections before converging.

Later, I ran an ablation to prove Q/V was uniquely the right architectural target. It showed the opposite. LoRA on FFN-only got 96.0% accuracy while Q/V got 63.5%. I stared at that result for a while thinking I had to rework the paper. But the logit lens showed something more interesting. FFN-only had a logit lens rank of 3,384 (barely improved from baseline). Q/V had rank 9. The dissociation between accuracy and alignment actually strengthened the geometric argument: FFN works through general capacity, Q/V works through routing realignment. Both effective. Only one is mechanistically interpretable.

And then there was the pipeline figure. A simple diagram showing the bottleneck flow took eight iterations to get right. The connectors kept overlapping. The arrows detached. The boxes wouldn't center. Getting it right took more effort than any figure reasonably should.

What This Means Beyond Counting

I think the readout bottleneck is general. The pattern is:

1. Probes succeed (R² > 0.99). The information is there.
2. Native generation fails (≤14% baseline). The model can't use it.
3. |cos| with output rows is near random (≤0.032). The geometry explains why.
4. Targeted output head repair works. Causal confirmation.
5. Upstream routing correction works in generation. Deployable fix.

This pattern held across four tasks and three model families. It held from 0.4B to 14B parameters. It did not hold for MMLU, GSM8K, or DROP. Multi-step reasoning tasks don't pre-encode the answer at the prompt boundary. There's nothing there to be misaligned.

The implication: many reasoning failures in language models may be less about missing internal computation and more about readout geometry. The model computes useful intermediate variables but stores them in directions the output mechanism can't easily access. Probe success alone is not enough evidence that a model can use a variable at generation time.

The Paper & Code

"The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It"
Published on arXiv May 2026.

Read the full paper on arXiv • Full code repository • Reproducible experiments

All experiments are fully reproducible in PyTorch on CPU. No GPUs needed. Code, data, and trained models are available in the repository.

As for the background on this page: behind the text is a toy model of the geometric readout bottleneck. Two perpendicular streams of particles cross paths, red particles drift horizontally, blue particles drift vertically. The red horizontal stream represents the direction in which the model encodes count information. The blue vertical stream is the direction the output head uses to read digit tokens. They pass through each other without merging, which is the whole problem: the model stores the count in a direction the output mechanism literally cannot read. Moving your mouse across the page briefly deflects both streams, a visual analogue of the repair intervention. It nudges them toward alignment, but only at the point of contact. To make the fix stick, you need to go upstream. It's the simplest of the three backgrounds on this site, but I think it captures the idea pretty well.