Given their failure on novel logic problems, generation of meaningless text, tendency to do things like delete tests and incompetence at simple mathematics, it seems very unlikely they have built any sort of world model. It’s remarkable how competent they are given the way they work.
Predict the next word is a terrible summary of what these machines do though, they certainly do more than that, but there are significant limitations.
‘Reasoning’ etc are marketing terms and we should not trust the claims made by companies who make these models.
The Turing test had too much confidence in humans it seems.
They generate text based on quite a large context, including hidden prompts we don’t see and their weights are distorted heavily by training. So I think there’s a lot more than a simple probability of word x coming next. That makes ‘predict next word’ a reductive summary IMO.
I do not personally feel it resembles thinking or reasoning though and really object to that framing because it is misleading many people.
I may be using the wrong terms, my impression was:
1. Weights in the model are created by ingesting the corpus
2. Techniques like reinforcement learning, alignment etc are used to adjust those weights before model release
3. The model is used and more context injected which then affects which words it will choose, though it is still heavily biased by the corpus and training.
That could be way off base though, I'd welcome correction on that.
The point I was trying to make though was that they do more than predict next word based on just one set of data. Their weights can encode entire passages of source material in the training data (https://arxiv.org/abs/2505.12546), including books, programs. This is why they are so effective at generating code snippets.
There are a lot of inputs nowadays and a lot of stages to training. So while I don't think they are intelligent I think it is reductive to call them next token predictors or similar. Not sure what the best name for them is, but they are neither next word predictors nor intelligent agents.
That extended explanation is more accurate, yes. I'd call your points 1 and 2 both training under the definition "anything that adjusts model weights is training". There are multiple stages and types of training. Right now AFAIK most (all) architectures then fix the weights and you have non-weight-affecting steps like the system prompt, context, etc.
You're right that the weights can enable the model to memorize training data.
So that might depend on model, how long ago you lasted tested it, etc. I've seen llms solve novel logic problems, generate meaningful text, retain tests just fine, and simple mathematics on newer models is a lot better.
Btw if you read the actual paper that proposes the Turing test, Turing actually rejects the framing of "can machines think"; preferring to go for the more practical "can you tell them apart in practice".
Yes, that’s the ‘too much confidence in humans’ bit - he didn’t count on some humans being easily fooled by prolix word generators. I’d be interested in his take on these generators but I think he’d be focussed on what was missing as well as the amazing progress we have seen.
> "The original question, 'Can machines think?' I believe to be too meaningless to deserve discussion."
> "the question, 'Can machines think?' should be replaced by 'Are there imaginable digital computers which would do well in the imitation game?'"
> "according to this view the only way to know that a man thinks is to be that particular man. It is in fact the solipsist point of view... instead of arguing continually over this point it is usual to have the polite convention that everyone thinks."
... is: if it's practical to say the system can give meaningful intput/output on xyz in -say- natural language; we might just go ahead and say it can think about xyz, because otherwise everyone's just going to go nuts inventing new terms every time.
ELIZA absolutely did not ever pass anything resembling a real Turing test. A real Turing test is adversarial, the interrogator knows the testees are trying to fool him.
Landauer and Bellman, absolutely put ELIZA to an adversarial Turing test, and called it such, in 1999. [0]
But... Over in 2025, ELIZA was once again, put to the Turing test in adversarial conditions. [1] And still had people think it was a real person, over 27% of the time. Over a quarter of the testees, thought the thing was a human.
The "ELIZA Effect" wasn't coined because everyone understands that an AI isn't conscious.
Unfortunately I'm not sure the Turing test posited a minimal level of intelligence for the human testers. As we have found with LLMs, humans are rather easy to fool.
There are many, many examples, mostly caused by people thinking LLMs are intelligent and reasoning and giving them too much power (e.g. treating them as agents, not text generators). I'm sure they're all fixed in whatever new version came out this week though.
Your sarcasm is misplaced. Without principled limitations that demonstrate the existence of a lower bound on the error rate and show that errors are correlated across invocations and models (so that you can't improve the error rate with multiple supervision), you can’t exclude the possibility that "they're all fixed in the new version" (for practical purposes).
Predict the next word is a terrible summary of what these machines do though, they certainly do more than that, but there are significant limitations.
‘Reasoning’ etc are marketing terms and we should not trust the claims made by companies who make these models.
The Turing test had too much confidence in humans it seems.