Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Given their failure on novel logic problems, generation of meaningless text, tendency to do things like delete tests and incompetence at simple mathematics, it seems very unlikely they have built any sort of world model. It’s remarkable how competent they are given the way they work.

Predict the next word is a terrible summary of what these machines do though, they certainly do more than that, but there are significant limitations.

‘Reasoning’ etc are marketing terms and we should not trust the claims made by companies who make these models.

The Turing test had too much confidence in humans it seems.



> Predict the next word is a terrible summary of what these machines do though, they certainly do more than that

What would that be?


They generate text based on quite a large context, including hidden prompts we don’t see and their weights are distorted heavily by training. So I think there’s a lot more than a simple probability of word x coming next. That makes ‘predict next word’ a reductive summary IMO.

I do not personally feel it resembles thinking or reasoning though and really object to that framing because it is misleading many people.


> their weights are distorted heavily by training

What does that even mean? Their weights are essentially created by training. There aren't some magic golden weights that are then distorted.


I may be using the wrong terms, my impression was:

1. Weights in the model are created by ingesting the corpus

2. Techniques like reinforcement learning, alignment etc are used to adjust those weights before model release

3. The model is used and more context injected which then affects which words it will choose, though it is still heavily biased by the corpus and training.

That could be way off base though, I'd welcome correction on that.

The point I was trying to make though was that they do more than predict next word based on just one set of data. Their weights can encode entire passages of source material in the training data (https://arxiv.org/abs/2505.12546), including books, programs. This is why they are so effective at generating code snippets.

Also text injected at the last stage during use has far less weight than most people assume (e.g. https://georggrab.net/content/opus46retrieval.html) and is not read and understood IMO.

There are a lot of inputs nowadays and a lot of stages to training. So while I don't think they are intelligent I think it is reductive to call them next token predictors or similar. Not sure what the best name for them is, but they are neither next word predictors nor intelligent agents.


That extended explanation is more accurate, yes. I'd call your points 1 and 2 both training under the definition "anything that adjusts model weights is training". There are multiple stages and types of training. Right now AFAIK most (all) architectures then fix the weights and you have non-weight-affecting steps like the system prompt, context, etc.

You're right that the weights can enable the model to memorize training data.


Alignment scrubs the underlying raw output to be socially acceptable. It's an artificial superego.


I was under the impression it is a part of training which adjusts weights before release.

Are you saying it is a separate process which scrubs output before we see it?


So that might depend on model, how long ago you lasted tested it, etc. I've seen llms solve novel logic problems, generate meaningful text, retain tests just fine, and simple mathematics on newer models is a lot better.

Btw if you read the actual paper that proposes the Turing test, Turing actually rejects the framing of "can machines think"; preferring to go for the more practical "can you tell them apart in practice".


Yes, that’s the ‘too much confidence in humans’ bit - he didn’t count on some humans being easily fooled by prolix word generators. I’d be interested in his take on these generators but I think he’d be focussed on what was missing as well as the amazing progress we have seen.


So my reading of (Turing 1950)...

> "The original question, 'Can machines think?' I believe to be too meaningless to deserve discussion."

> "the question, 'Can machines think?' should be replaced by 'Are there imaginable digital computers which would do well in the imitation game?'"

> "according to this view the only way to know that a man thinks is to be that particular man. It is in fact the solipsist point of view... instead of arguing continually over this point it is usual to have the polite convention that everyone thinks."

... is: if it's practical to say the system can give meaningful intput/output on xyz in -say- natural language; we might just go ahead and say it can think about xyz, because otherwise everyone's just going to go nuts inventing new terms every time.

grey-area!thinking, kim_bruning!thinking, pet_cat!thinking, octopus!thinking, claude_opus!thinking.

Can we leave out the '!' ? Nothing to do with fooling people. Just practical ways of dealing with the overall concept.

https://courses.cs.umbc.edu/471/papers/turing.pdf


Probably worth remembering that ELIZA passed Turing tests, and was the definition of shallow prediction.


ELIZA absolutely did not ever pass anything resembling a real Turing test. A real Turing test is adversarial, the interrogator knows the testees are trying to fool him.


Landauer and Bellman, absolutely put ELIZA to an adversarial Turing test, and called it such, in 1999. [0]

But... Over in 2025, ELIZA was once again, put to the Turing test in adversarial conditions. [1] And still had people think it was a real person, over 27% of the time. Over a quarter of the testees, thought the thing was a human.

The "ELIZA Effect" wasn't coined because everyone understands that an AI isn't conscious.

[0] https://books.google.com.au/books?id=jTgMIhy6YZMC&pg=PA174

[1] https://arxiv.org/html/2503.23674v1


Unfortunately I'm not sure the Turing test posited a minimal level of intelligence for the human testers. As we have found with LLMs, humans are rather easy to fool.


> there are significant limitations

Where can we read about those significant limitations?


Well here's some:

Confabulation/Hallucination - https://github.com/lechmazur/confabulations

Failure to read context - https://georggrab.net/content/opus46retrieval.html

Deleting tests to make them pass - https://www.linkedin.com/posts/jasongorman_and-after-it-did-...

Going rogue and deleting data - https://x.com/jasonlk/status/1946069562723897802

Agent security nightmares because they are not in fact intelligent assistants - https://x.com/theonejvo/status/2015401219746128322

Failure to read or generate structured data - https://support.google.com/gemini/thread/390981629/llm-ignor...

There are many, many examples, mostly caused by people thinking LLMs are intelligent and reasoning and giving them too much power (e.g. treating them as agents, not text generators). I'm sure they're all fixed in whatever new version came out this week though.


Your sarcasm is misplaced. Without principled limitations that demonstrate the existence of a lower bound on the error rate and show that errors are correlated across invocations and models (so that you can't improve the error rate with multiple supervision), you can’t exclude the possibility that "they're all fixed in the new version" (for practical purposes).


I've seen all of these from human teammates in my 30+ years in tech.


Sure but now everyone can do them all the time at 10x speed!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: