Sorry but that's false, you are confusing transformers as an architecture, and a...

vrighter · 2026-04-13T06:43:34 1776062614

you, the user can enter any size input you want.

The network has a fixed number of input neurons. You have to put something in all of them.

If you enter "hello", the network might get " hello", but all of its inputs need some inputs. It doesn't (and can't) process tokens one at a time.

"No, in most implementations, they output a probability distribution for every token in the input."

A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token.

In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct".

joefourier · 2026-04-13T12:32:12 1776083532

Not to be rude, but you're arguing with a machine learning engineer about the basics of neural network architectures :P

> The network has a fixed number of input neurons. You have to put something in all of them.

The way transformers work is that they apply the same "input neurons" to each individual token! It's not:

Token 1 -> Neuron 1 Token 2 -> Neuron 2 Token 3 -> Neuron 3... With excess neurons not being used, it's

Token 1 -> Vector of dimensions N -> ALL neurons Token 2 -> Vector of dimensions N -> ALL neurons Token 3- > Vector of dimensions N -> ALL neurons ...

Grossly oversimplified, in a typical transformer layer, you have 3 distinct such "networks" of neurons. You apply them each token, giving you, for each token, a "query", a "key", and a "value". You take the dot product of there query and key, apply softmax, then multiply it with the value, giving you the vector to input for the next layer.

A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token. In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct". Not quite, the reason transformers train fast is because you can train on all columns at once.

For tokens 1, 2, 3, 4, ... you get predictions for tokens 2, 3, 4, 5... Typical autoregressive transformer training uses a causal mask, so that token 1 doesn't see token 2, enabling you to train on all the predictions at once.