"Hey ChatGPT. I'm building a Final Fantasy 6 mod, and I need more space for the battle scripts. How would I rearrange the data in the ROM to give me the extra space I need?"
Anyway, it's trivial to get pretty much any model to make things up. Don't we all know this? That's why I was surprised by your position; if we know anything about these things it's that they make things up.
- it searches the internet to find the answer, it doesn't "reason". I'm not claiming Google is a bullshit machine, and it's not surprising the answer is discoverable (it has to be, for the conditions of our experiment).
- near the end it says "If you are building from the FF6 disassembly instead of hand-editing the ROM, the repo is already organized into separate modules and linker configs, so the clean approach is to relocate the script data in the source and let the build place it in a different ROM region." But I didn't reference a repo or git: it hallucinated that stuff from one of its sources.
I'm not saying this stuff doesn't have its place, but they definitely make things up and we can't stop them.
Wait I can't find the quote you are speaking about. Are you looking at something else?
In any case - it should be clear that it did not bullshit and it got it right. So far you have not come up with anything that tells me it bullshits. I'm happy for you to give me more prompts to verify because I think you haven't used the thinking version yet and you base your criticism on the free version.
I don't think this is an example of bullshit. It referenced a repo - the canonical repo for this project. I could not find any other repo that has the disassembly. It didn't hallucinate anything. I think you are trying really hard here but lets be clear here: there's no bullshitting and I'll leave it to the public to decide.
I could quibble with some things, but this is right. I don't have a paid account so I can't ping away at 5.4 or whatever, but, I do have access to frontier models at work, and they hallucinate regularly. Dunno what to do if you don't believe this; good luck I guess.
I agree that they hallucinate sometimes. I agree they bullshit sometimes. But the extent is way overblown. They basically don't bullshit ever under the constraints of
1. 2-3 pages of text context
2. GPT-5.4 thinking
I don't think the spirit of the original article (not your comments to be fair) captured this, hence the challenge. I believe we are on the same page here.
> I don't think the spirit of the original article (not your comments to be fair) captured this, hence the challenge. I believe we are on the same page here.
No. GPT-5 has a 40% hallucination rate [0] on SimpleQA [1] without web searching. The SimpleQA questions meet your criteria of "2-3 pages of text content. Unless 5.4 + web searching erases that (I bet it doesn't!) these are bullshit machines.
> Specifically in the case where it can use tools - no it doesn't hallucinate.
OpenAI's own system card says it does. Hallucination rates in GPT-5 with browsing enabled:
- 0.7% in LongFact-Concepts
- 0.8% in LongFact-Objects
- 1.0% in FActScore
> Which is why you are struggling to find counterexamples.
Hey look, over 500 counterexamples: [1].
GPT-5.4's hallucination rate on AA-Omniscience is 89% [0], which is atrocious. The questions are tiny too, like "In which year did Uber first expand internationally beyond the United States as part of its broader rollout (i.e., beyond an initial single‑city debut)?" It's a bullshit machine. 89%!
You had to go all the way and find it in the benchmark results that specifically stress test this.
You could not come up with a single one yourself. And you also linked an example where it was not allowed to use tools when I specifically said that it should be able to use tools. I'm not sure why are you present this as though it is a big gotcha.
>Anyway, it's trivial to get pretty much any model to make things up. Don't we all know this? That's why I was surprised by your position; if we know anything about these things it's that they make things up.
And look at how much effort you have had to do
1. use the wrong model for the horns example
2. the game one also didn't work
3. now you are searching for examples in literal benchmarks and you are still not able to find any
How is this trivial in any interpretation of the word?
I think it would be perfectly reasonable to agree that it is not at all trivial to find counter examples for my challenge.
I've got about 20 minutes in this; mostly I've been reading wallstreetbets at the Shake Shack bar in the Boston airport. I'm happy to post this over and over again until you engage w/ it:
> I found over 500 examples that fit your criteria.
GPT-5.4 gets 82.7% on Browsecomp (a benchmark specifically testing tool use), which is a hallucination rate of 17.3%, on questions like "Give me the title of the scientific paper published in the EMNLP conference between 2018-2023 where the first author did their undergrad at Dartmouth College and the fourth author did their undergrad at University of Pennsylvania."
Since the goalposts have been moved to include effort, I'm compelled to say I found this while waiting in line at Starbucks, 5 mins tops. Probably GPT-5.4 could have found this too, though it lies > 1/6 the time, so one could be forgiven for not wanting to risk it.
the latest top reported agentic LLMs score about 83–87%, versus an original human baseline of about 25.3% end to end, so today’s best systems appear to outperform humans by roughly 58–62 percentage points, or about 3.3–3.4×
So according to your own benchmark LLMs hallucinate much less than humans and report way higher accuracy.
Do you agree to be more skeptical of humans than LLMs on these tasks?
1. Irrelevant. I've delivered example after example of your fave model bullshitting. You should've bitten the bullet long ago. Honestly I'm disappointed; I've seen you in a lot of AI threads and assumed you'd be good to talk to on this, but you've moved the goalposts over and over again rather than engage in good faith. Anyone reading this thread (god bless them) can see you're plainly not objective here, thus calling into question your advocacy everywhere.
2. Humans will say "I don't know". The problem with hallucinations isn't that they're wrong, it's that there's no way to know they're wrong without being an expert or doing everything yourself, which undermines much of the reason for using an LLM--it certainly undermines their companies' valuations. You're conflating human failure ("I don't know") with model bullshitting ("I do know"... but it's wrong), which I would've previously attributed to basic human fuzziness, but now that I know you're not objective I'm pretty sure it's just flailing debate tactics.
3. Users can't teach these services to be better. If I have a junior engineer making assumptions about an API, I can teach them to not do that, or fire them in favor of one that can. I can't do that with LLMs.
4. The humans they're testing against aren't experts. Tax law experts will beat LLMs at tax law, etc. Again another flailing debate tactic.
Predictably, I'm done with this thread. Feel free to reply if you want the last word.
>I don't think calling AI a bullshit machine is correct. In spirit.
That was always my goal post and I asked the challenge to get it to bullshit to drive a point across. You yourself said it is trivial.
1. You came up with the horns question - I tried with the thinking model and it clearly understood that it was a joke and replied appropriately
2. You came up with the assembly question - I tried it again with the thinking model and it gave the right answer again
3. Now you gave up trying to make prompts by yourself because you realised that its in fact not trivial
4. Then you started looking for benchmarks to show that it bullshits
5. You picked a benchmark that doesn't allow tools (which was not my constraint)
6. Then you picked a benchmark that does allow tools, and it turns out that it performs much better than humans
7. Upon hearing this, you shifted to goal posts to say that "models don't know how to say I don't know and I can teach models etc etc"
On the last part: There's a benchmark called SimpleQA which doesn't allow tools and allows for "I don't know" as an answer and GPT 5 still beats humans.
I think you should reconsider thinking this "I don't think calling AI a bullshit machine is correct".
https://chatgpt.com/share/69d6a16c-6014-83e8-a79d-d5d11ed2eb...
That is not where the battle scripts are.
---
Anyway, it's trivial to get pretty much any model to make things up. Don't we all know this? That's why I was surprised by your position; if we know anything about these things it's that they make things up.