> As they did so, they also learned how to improve the prompts they gave AlphaEvolve. One key takeaway: The model seemed to benefit from encouragement. It worked better “when we were prompting with some positive reinforcement to the LLM,” Gómez-Serrano said. “Like saying ‘You can do this’ — this seemed to help. This is interesting. We don’t know why.”
Four top logical people in the world are acknowledging this. It is mind-blowing and we don't know why.
Several people had problems with Sonnet burning through all their credits grinding on a problem it can't solve. Opus fixes this — it has a confidence threshold below which it exits the task instead of grinding.
"I spent ~$100 last week testing both against multiplication. Sonnet at 37-digit × 37-digit (~10³⁷) never quits — 15+ minutes, 211KB of output, still actively decomposing numbers when I stopped it. Opus will genuinely attempt up to ~50 digits (112K tokens on a real try), starts doubting around 55 digits, and by 80-digit × 80-digit surrenders in 330 tokens / 9 seconds with an empty answer." -- Opus, helping me with the data
The "I don't think this is worth attempting" heuristic is the difference. Sonnet doesn't have it, or has it set much higher. In order to get Opus and some other models to work on harder problems that it assumes it is not worth attempting, it requires an increase of confidence level.
I'll finish writing this up this week. I'm making flashy data visual animations to make the point right now.
So we have a bunch of imposter syndrome techies who would 5x with just a hint of encouragement, and now we’re trying to 2x them with LLMs, but in order to get there those same techies will have to gas up and inspire the LLM with the leadership and vision they themselves are wanting for?
The Universe seems against free lunches… if AGI is possible, finding a manager good enough to get the AGI to update its timesheet will not be (in practice).
From my reading, the official docs don’t support the strong
claim that frontier LLMs are explicitly RL-trained to “be lazy”
or conserve tokens as claimed in this thread. What they do document
is adaptive / hidden reasoning compute: OpenAI says reasoning
models allocate internal reasoning tokens and reasoning.effort
controls how many are used
(https://developers.openai.com/api/docs/guides/reasoning), and
Anthropic says adaptive thinking decides whether/how much to use
extended thinking based on request complexity, with effort as
soft guidance and max_tokens as the hard cap
(https://docs.anthropic.com/en/docs/build-with-claude/adaptiv...
hinking). So prompt wording may change how the same budget is
spent, but it can’t exceed the hard token cap.
Also, the “encouragement helps” anecdote seems real in the
AlphaEvolve workflow, but I can't see that forpublic
models. Gómez-Serrano says this in Quanta
(https://www.quantamagazine.org/the-ai-revolution-in-math-has...
rived-20260413/), and the released AlphaEvolve notebooks really
do contain prompts like “Good luck, I believe in you...”
(https://github.com/google-deepmind/alphaevolve_repository_of...
oblems, e.g.
https://github.com/google-deepmind/alphaevolve_repository_of...
blems/blob/main/experiments/finite_field_kakeya_problem/finite_f
ield_kakeya.ipynb). But those prompts also bundled strong
structural hints (“find a general solution”, “better
constructions are possible”), so from my reading the evidence
is: prompt phrasing matters, especially in an internal search
stack, but not “pep talks are a universal reasoning hack.”
> Anthropic says adaptive thinking decides whether/how much to use extended thinking based on request complexity, with effort as soft guidance and max_tokens as the hard cap
Nothing I said contradicts this.
Here is the first attempt of what I'm testing. [0] Haiku can get the correct answer to `floor( (1234567 * 8901234) / 12345 )` or
Given this Haiku will give a correct answer 77.8% of the time. Add one digit or remove a digit, it is very highly predictable also.
That is the WHOLE point. The models are predictable!
Given that prompt Sonnet at 37-digit × 37-digit (~10³⁷) never quits a predictable percentage of the time!
And, Opus at 80-digit × 80-digit simply quits after 9 seconds and 333 tokens!
This is the amazing thing people are not discussing. The models are very predictable.
The AI companies are not posting this information because it shows how unreliable the models are, however, I think there is great virtue that the models are consistently unreliable.
looks like you've done some thorough testing. Have you found that prompting reliably reduces premature quitting?
And have you found that reducing premature quitting results in more accuracy?
Because these are probabilistic machines, they solve the same problem at a predictable rate. Even with different variables, the success rate stays consistent.
I only noticed the premature quitting issue recently and haven't tested it much yet. It's getting expensive to run Sonnet on hard multiplication problems. I let it run to 200k tokens and it still grinds without quitting.
But Opus has a different problem. Ask it to solve a Rubik's Cube and it will run for hours and never solve it. So there are definitely prompts that make it run forever. But if you tell it to break down multiplication using algorithms, it behaves differently. It can take really complicated calculus problems and break them into simpler ones. I can't stump it that way.
Here's the interesting thing. Even when Opus solves modular expressions by breaking them down like calculus, it still fails at a predictable rate. There's a constant failure rate no matter what you do at any level of complexity.
Models have a baseline failure rate that prompting can't change. You can change how they fail -- token burn or quitting early -- but the underlying limit stays the same.
Originally LLMs would get stuck in infinite loops generating tokens forever. This is bad, so we trained them to strongly prefer to stop once they reached the end of their answer.
However, training models to stop also gave them "laziness", because they might prefer a shorter answer over a meandering answer that actually answered the user's question.
Mathematics is unusual because it has an external source of truth (the proof assistant), and also because it requires long meandering thinking that explores many dead ends. This is in tension with what models have been trained to do. So giving them some encouragement keeps them in the right state to actually attempt to solve the problem.
I think laziness is not a minimum of effort. Laziness can actually be more effort towards a simpler or more practical solution, because those solutions are more pleasant in some way, and therefore more attractive to pursue.
It's pattern matching on training material. There is almost certainly an overlap between positivity and success in the training material. Positive prompts cause the pattern matching to weight towards positivity and therefor more successful material.
The training or system prompts have shoved the probabilities toward a space that tends to select “halt” sooner. You need to drag the probability weights around until they are less likely to reach “halt” so soon.
Nice language often sorta does this for whatever model(s) they looked at, and is also something people are likely to try. Probably lots and lots of nonsense token combos would work even better, but who’s gonna try sticking “gerontocratic green giant giraffes” on the end of their prompts to see if it helps?
Positive or negative language likely also prevents pulling the probabilities away from the correct topic, being so generic a thing. The above suggestion might only be ultra-effective if the topic is catalytic converters, for some reason, and push the thing into generating tokens about giraffes otherwise. How would you ever discover the dozens or thousands of more-effective but only-sometimes nonsense token combos? You’d need automation and a lot of brute force, or some better way to analyze the LLM’s database.
Models are trained on human outputs. It’s not super surprising to me that inputs following encouraging patterns product better results outputs; much of the training material reflects that.
Four top logical people in the world are acknowledging this. It is mind-blowing and we don't know why.