Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sorry but that's false, you are confusing transformers as an architecture, and auto-regressive generation, and padding during training.

Standard transformers take in an arbitrary input size and run blocks (self and possibly cross attention, positional encoding, MLPs) that don't care about its length.

> They also have a fixed output of one probability distribution for the next one token.

No, in most implementations, they output a probability distribution for every token in the input. If you input 512 tokens, you get 512 probability distributions. You can input however many tokens you want - 1, 2048, one million, it's the same thing (although since standard self-attention scales quadratically you'll eventually run out of memory). Modern relative embeddings like RoPE can support infinite length although the quality will degrade if you extrapolate too far beyond what the model saw during training.

For typical auto-regressive generation, they are trained with causal masking/teacher forcing, which makes it calculate the probability for the next token. During inference, you throw away all but the last probability distribution and use that to sample the next token, and then repeat. You also do this with an RNN. An autoregressive CNN (e.g. WaveNet) would be closer to what you described in that it has a fixed window looking backwards.

But a transformer doesn't have to be used for auto-regressive generation, you can use it for diffusion, as a classifier model, for embedding text. It doesn't even see a sequence as spatially organised - unlike a CNN or an RNN it doesn't have architectural intrinsic biases about the position of elements, which is why it needs positional embeddings. This lets you have 2D, 3D, 4D, or disordered elements in a sequence. You can even have non-regularly sampled sequences. (Again this is for a classic transformer without sliding window attention or any other special modifications).

> (padding the unneeded context window with null tokens). To have efficient training, you pad all samples in a batch to have the same length (and maybe make it a power of two). But you are working with a single sequence, the length is arbitrary up to hardware limitations, and no padding is needed.

 help



you, the user can enter any size input you want.

The network has a fixed number of input neurons. You have to put something in all of them.

If you enter "hello", the network might get " hello", but all of its inputs need some inputs. It doesn't (and can't) process tokens one at a time.

"No, in most implementations, they output a probability distribution for every token in the input."

A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token.

In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct".


Not to be rude, but you're arguing with a machine learning engineer about the basics of neural network architectures :P

> The network has a fixed number of input neurons. You have to put something in all of them.

The way transformers work is that they apply the same "input neurons" to each individual token! It's not:

Token 1 -> Neuron 1 Token 2 -> Neuron 2 Token 3 -> Neuron 3... With excess neurons not being used, it's

Token 1 -> Vector of dimensions N -> ALL neurons Token 2 -> Vector of dimensions N -> ALL neurons Token 3- > Vector of dimensions N -> ALL neurons ...

Grossly oversimplified, in a typical transformer layer, you have 3 distinct such "networks" of neurons. You apply them each token, giving you, for each token, a "query", a "key", and a "value". You take the dot product of there query and key, apply softmax, then multiply it with the value, giving you the vector to input for the next layer.

A probability distribution obviously contains a probability for every possible next token. But the whole probability distribution (which adds up to one) only predicts the next ONE token. It predicts what is the probability of that one token being A, or B, or C, etc, giving a probability for each possible token. It's still predicting only one token. In anything but the last column, the numbers are junk. You can treat them as probability distributions all you want, but the system is only trained to get the outputs of the last column "correct". Not quite, the reason transformers train fast is because you can train on all columns at once.

For tokens 1, 2, 3, 4, ... you get predictions for tokens 2, 3, 4, 5... Typical autoregressive transformer training uses a causal mask, so that token 1 doesn't see token 2, enabling you to train on all the predictions at once.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: