The AI revolution in math has arrived

bgirard · 2026-04-14T04:24:37 1776140677

Last week I got together with my math alumni friend. We cracked some beers, we chatted with voice mode ChatGPT and toyed around with Collatz Conjecture and we sent some prompt to a coding agent to build visualizations and simulation. It was a lot of fun directing these agents while we bounced off ideas and the models could explore them.

I think with the right problem and the right agentic loop it’s clear to me improvements will speed up.

drakenot · 2026-04-14T04:45:24 1776141924

I think voice mode uses weaker models, just an FYI relative to the SOTA

pxc · 2026-04-14T16:04:52 1776182692

The bigger problem for me is that the realtime voice modes lack tool use, so they can't look anything up or do anything. Model strength definitely also matters, but even dumb models can be helpful when they can look things up and try things out. And smart models that don't do those things kinda suck.

SOLAR_FIELDS · 2026-04-14T12:13:34 1776168814

Can get around this with a local STT model and use text input but UX is probably clunkier

scrollop · 2026-04-14T05:41:24 1776145284

Definitely, seems like gpt 3

dogscatstrees · 2026-04-14T03:12:34 1776136354

> As they did so, they also learned how to improve the prompts they gave AlphaEvolve. One key takeaway: The model seemed to benefit from encouragement. It worked better “when we were prompting with some positive reinforcement to the LLM,” Gómez-Serrano said. “Like saying ‘You can do this’ — this seemed to help. This is interesting. We don’t know why.”

Four top logical people in the world are acknowledging this. It is mind-blowing and we don't know why.

dataviz1000 · 2026-04-14T03:38:25 1776137905

I know why.

Several people had problems with Sonnet burning through all their credits grinding on a problem it can't solve. Opus fixes this — it has a confidence threshold below which it exits the task instead of grinding.

"I spent ~$100 last week testing both against multiplication. Sonnet at 37-digit × 37-digit (~10³⁷) never quits — 15+ minutes, 211KB of output, still actively decomposing numbers when I stopped it. Opus will genuinely attempt up to ~50 digits (112K tokens on a real try), starts doubting around 55 digits, and by 80-digit × 80-digit surrenders in 330 tokens / 9 seconds with an empty answer." -- Opus, helping me with the data

The "I don't think this is worth attempting" heuristic is the difference. Sonnet doesn't have it, or has it set much higher. In order to get Opus and some other models to work on harder problems that it assumes it is not worth attempting, it requires an increase of confidence level.

I'll finish writing this up this week. I'm making flashy data visual animations to make the point right now.

bonesss · 2026-04-14T06:58:56 1776149936

So we have a bunch of imposter syndrome techies who would 5x with just a hint of encouragement, and now we’re trying to 2x them with LLMs, but in order to get there those same techies will have to gas up and inspire the LLM with the leadership and vision they themselves are wanting for?

The Universe seems against free lunches… if AGI is possible, finding a manager good enough to get the AGI to update its timesheet will not be (in practice).

trees101 · 2026-04-15T00:19:56 1776212396

From my reading, the official docs don’t support the strong claim that frontier LLMs are explicitly RL-trained to “be lazy” or conserve tokens as claimed in this thread. What they do document is adaptive / hidden reasoning compute: OpenAI says reasoning models allocate internal reasoning tokens and reasoning.effort controls how many are used (https://developers.openai.com/api/docs/guides/reasoning), and Anthropic says adaptive thinking decides whether/how much to use extended thinking based on request complexity, with effort as soft guidance and max_tokens as the hard cap (https://docs.anthropic.com/en/docs/build-with-claude/adaptiv... hinking). So prompt wording may change how the same budget is spent, but it can’t exceed the hard token cap.

Also, the “encouragement helps” anecdote seems real in the AlphaEvolve workflow, but I can't see that forpublic models. Gómez-Serrano says this in Quanta (https://www.quantamagazine.org/the-ai-revolution-in-math-has... rived-20260413/), and the released AlphaEvolve notebooks really do contain prompts like “Good luck, I believe in you...” (https://github.com/google-deepmind/alphaevolve_repository_of... oblems, e.g. https://github.com/google-deepmind/alphaevolve_repository_of... blems/blob/main/experiments/finite_field_kakeya_problem/finite_f ield_kakeya.ipynb). But those prompts also bundled strong structural hints (“find a general solution”, “better constructions are possible”), so from my reading the evidence is: prompt phrasing matters, especially in an internal search stack, but not “pep talks are a universal reasoning hack.”

dataviz1000 · 2026-04-15T03:42:17 1776224537

> Anthropic says adaptive thinking decides whether/how much to use extended thinking based on request complexity, with effort as soft guidance and max_tokens as the hard cap

Nothing I said contradicts this.

Here is the first attempt of what I'm testing. [0] Haiku can get the correct answer to `floor( (1234567 * 8901234) / 12345 )` or

``` Math.floor( (Math.floor(Math.random() * 9000000 + 1000000) * Math.floor(Math.random() * 9000000 + 1000000)) / Math.floor(Math.random() * 9000000 + 1000000) ) ```

Given this Haiku will give a correct answer 77.8% of the time. Add one digit or remove a digit, it is very highly predictable also.

That is the WHOLE point. The models are predictable!

Given that prompt Sonnet at 37-digit × 37-digit (~10³⁷) never quits a predictable percentage of the time!

And, Opus at 80-digit × 80-digit simply quits after 9 seconds and 333 tokens!

This is the amazing thing people are not discussing. The models are very predictable.

The AI companies are not posting this information because it shows how unreliable the models are, however, I think there is great virtue that the models are consistently unreliable.

[0] https://github.com/adam-s/agent-tuning/blob/main/application...

trees101 · 2026-04-15T04:06:09 1776225969

looks like you've done some thorough testing. Have you found that prompting reliably reduces premature quitting? And have you found that reducing premature quitting results in more accuracy?

dataviz1000 · 2026-04-15T04:55:25 1776228925

Because these are probabilistic machines, they solve the same problem at a predictable rate. Even with different variables, the success rate stays consistent.

I only noticed the premature quitting issue recently and haven't tested it much yet. It's getting expensive to run Sonnet on hard multiplication problems. I let it run to 200k tokens and it still grinds without quitting.

But Opus has a different problem. Ask it to solve a Rubik's Cube and it will run for hours and never solve it. So there are definitely prompts that make it run forever. But if you tell it to break down multiplication using algorithms, it behaves differently. It can take really complicated calculus problems and break them into simpler ones. I can't stump it that way.

Here's the interesting thing. Even when Opus solves modular expressions by breaking them down like calculus, it still fails at a predictable rate. There's a constant failure rate no matter what you do at any level of complexity.

Models have a baseline failure rate that prompting can't change. You can change how they fail -- token burn or quitting early -- but the underlying limit stays the same.

zarzavat · 2026-04-14T03:41:51 1776138111

It makes sense to me.

Originally LLMs would get stuck in infinite loops generating tokens forever. This is bad, so we trained them to strongly prefer to stop once they reached the end of their answer.

However, training models to stop also gave them "laziness", because they might prefer a shorter answer over a meandering answer that actually answered the user's question.

Mathematics is unusual because it has an external source of truth (the proof assistant), and also because it requires long meandering thinking that explores many dead ends. This is in tension with what models have been trained to do. So giving them some encouragement keeps them in the right state to actually attempt to solve the problem.

dogscatstrees · 2026-04-14T06:12:32 1776147152

It was just yesterday that this top post [] was decrying the "peril of laziness lost", that LLMs inherently lack the virtue of laziness.

So which one are they?

[] https://news.ycombinator.com/item?id=47743628

LoganDark · 2026-04-14T06:43:48 1776149028

I think laziness is not a minimum of effort. Laziness can actually be more effort towards a simpler or more practical solution, because those solutions are more pleasant in some way, and therefore more attractive to pursue.

zarzavat · 2026-04-14T07:15:50 1776150950

Reminds me of Larry Wall's three virtues of a programmer: laziness, impatience and hubris.

brookst · 2026-04-14T03:24:42 1776137082

Do we know why it works for humans?

Models are trained on human outputs. It’s not super surprising to me that inputs following encouraging patterns product better results outputs; much of the training material reflects that.

latentsea · 2026-04-14T03:33:27 1776137607

> Do we know why it works for humans?

Try to figure it out. You can do it.

gxs · 2026-04-14T03:29:07 1776137347

If I had to wager a lazy, armchair guess, I think it forces it to think harder/longer

The answer is probably more straightforward than we think, e.g. “the user thinks I can do this so I better make sure I didn’t miss anything”

CivBase · 2026-04-14T03:34:59 1776137699

This seems pretty obvious, no?

It's pattern matching on training material. There is almost certainly an overlap between positivity and success in the training material. Positive prompts cause the pattern matching to weight towards positivity and therefor more successful material.

lamasery · 2026-04-14T03:59:31 1776139171

The training or system prompts have shoved the probabilities toward a space that tends to select “halt” sooner. You need to drag the probability weights around until they are less likely to reach “halt” so soon.

Nice language often sorta does this for whatever model(s) they looked at, and is also something people are likely to try. Probably lots and lots of nonsense token combos would work even better, but who’s gonna try sticking “gerontocratic green giant giraffes” on the end of their prompts to see if it helps?

Positive or negative language likely also prevents pulling the probabilities away from the correct topic, being so generic a thing. The above suggestion might only be ultra-effective if the topic is catalytic converters, for some reason, and push the thing into generating tokens about giraffes otherwise. How would you ever discover the dozens or thousands of more-effective but only-sometimes nonsense token combos? You’d need automation and a lot of brute force, or some better way to analyze the LLM’s database.

sm0ss117 · 2026-04-14T04:11:07 1776139867

Mathematics seems like the ideal candidate for AIs to achieve absurd results. It's a purely abstract grammar with true auto-verifiability. Even SWE has the requirement of interacting with real physical things. In math there's no external feedback required, you're solely bounded by the rate and quality of token generation.

drivebyhooting · 2026-04-14T04:33:07 1776141187

This misses the mark on at least two accounts: 1. Proofs without human understanding have less value for mathematicians 2. At least for now, interestingness depends on human judgment. It is subjective and not as verifiable.

sm0ss117 · 2026-04-14T22:29:06 1776205746

1. The four color theorem is a useful case study, for which the original proof was validated and 400 pages long. My prediction is that the first couple waves of proofs will be hard enough that a layman couldn't produce them, but simple enough that experts can verify them. Over time the most advanced proofs will get more and more complicated until humans can no longer verify them, this process could happen over the course of a few month or could take literally hundreds of years.

2. Especially early on the overwhelming majority of the proofs are likely to be uninteresting and more novel just because actually producing them would take expert time that's better spent elsewhere. That being said, as above over time I expect the interestingness of proofs to go up until they eventually regularly produce interesting proofs. The vast majority of proofs are likely to maintain their position as of no interest to humans for the simple reason that the vast majority of proofs are of no interest to humans.

In neither case will I make any particular guesses about a timeline beyond it seems like the way things will go.

dyauspitr · 2026-04-14T04:48:01 1776142081

Every new mathematician that comes along doesn’t know everything that has come before him. He needs to go learn all the math that his predecessors did. I don’t see how an LLM coming up with these proofs changes that.

streb-lo · 2026-04-14T05:17:47 1776143867

Because the problem space is basically infinite. If a person is working on a problem, its probably interesting to at least one person. Randomly walking through the problem space might be interesting, but I don't know how the signal will fare against other humans.

meroes · 2026-04-14T05:38:27 1776145107

Grammar seems like you’re talking about LLMs specifically. Well, isn’t Sudoku just math? LLMs suck at Sudoku last I checked. When told not to code a solver, its very first deduction was wrong.

evenhash · 2026-04-14T15:30:05 1776180605

Generally when people talk about using LLMs to do mathematics research they’re not talking about the LLM alone, but the LLM + a harness for it to write and execute theorem provers such as Lean or Coq to validate their results.

meroes · 2026-04-15T02:15:17 1776219317

I guess I just don’t have the experience or optimism that a harness around an LLM, which can’t make the first, bare deduction on its own, is a good use of compute.

I got out of RLHF, including games and puzzles, before agents took off and maybe I have outdated info. But we estimated RLHF’ing a single hard full sized sudoku was ~25 hours worth of work.

claysmithr · 2026-04-14T02:19:18 1776133158

I wonder when AI will be able to discern the passage of time

Buttons840 · 2026-04-14T03:29:55 1776137395

Can't you just give it the time in each prompt? Would that work?

I've seen this mentioned a few times though, so I think maybe it's more complicated than this?

1970-01-01 · 2026-04-14T02:38:07 1776134287

It already does time in prompt-blocks. It knows time is linear and what just happened, what happened before that, and what happened before that.

claysmithr · 2026-04-14T02:47:32 1776134852

When I tried to use it as an AI CEO and Life Coach, it never was able to discern time passing, what I've already done, what needed to be done. It just said the same stuff over and over, stuff I've already done. That and it's kind of stuck in the era it was trained in. If it felt time passing like a human maybe it would be conscious?

Nevertheless not having a sense of time makes it really bad at planning anything. I used Gemini Pro.

maplethorpe · 2026-04-14T02:33:12 1776133992

Altman has estimated one year until ChatGPT is capable of measuring time passed.

https://tech.yahoo.com/ai/chatgpt/articles/chatgpt-fails-mis...

x-complexity · 2026-04-15T04:31:28 1776227488

Taking the task at face value:

- 1 week to prototype: The tool + its accompanying JS sandbox + System prompt updates + context injection

- 11 months of public testing to go through i18n + a11y edge cases & fix them

ambicapter · 2026-04-14T02:50:13 1776135013

Sounds like Musk setting deadlines for Mars landings.

keyle · 2026-04-14T05:41:32 1776145292

It's so hard to predict you know, these planets keep moving...

VladVladikoff · 2026-04-14T02:57:57 1776135477

Can’t tell if you are being sarcastic but Altman’s whole job is to make bullshit near future predictions about rapid development of AI in the public.

random__duck · 2026-04-14T03:04:20 1776135860

Thankyou for stating the obvious, for some reason we need to repeat this. ^^;

viccis · 2026-04-14T04:52:51 1776142371

There's no need to "estimate" it. "Time" is not something built into training and sampling a generative distribution. He might as well have told you your Naive Bayes email filters will measure time passed.

doubledamio · 2026-04-14T07:00:26 1776150026

All these overly optimistic articles about AI solving maths problems are very annoying. Can we agree that maths is not about solving problems, but about understanding them by developing a language and the conditions for new insights? It is misleading because GPTs do provide easy access to new information, but they do not deepen understanding.

I think AI-assisted research will likely have a very negative net impact on mathematics in the long run by lowering the average level of understanding within the community.

Also, research directions are influenced by what people can solve, and this will slowly shift research toward purely algebraic/symbolic manipulations that mathematicians no longer fully keep track of.

norejisace · 2026-04-14T03:26:42 1776137202

Interesting development. It feels like AI is getting much better at symbolic reasoning, not just pattern recognition.

440bx · 2026-04-14T06:30:42 1776148242

Boring mathematical reality here. This is nice and all that but as a (part time) corporate mathematician, I'd like an AI that organises conference trips, picks the best accommodation and food and gaslights the execs into approving it. Then fixes the perpetually broken coffee machine. Everything else for me starts on paper and is mostly undergrad level problems which I need to do by hand to keep my brain going for when I actually might need it one day. And with the geopolitical instability out there at the moment I'm not that willing to put my eggs into the basket.

pyuser583 · 2026-04-15T04:30:29 1776227429

I just want it to cook and clean.

themafia · 2026-04-14T02:56:51 1776135411

There are several high value prizes for mathematical research. Let me know when an "AI" has earned one of them. Otherwise:

> When Ryu asked ChatGPT, “it kept giving me incorrect proofs,” [...] he would check its answers, keep the correct parts, and feed them back into the model

So you had a conversational calculator being operated by an actual domain expert.

> With ChatGPT, I felt like I was covering a lot of ground very rapidly

There's no way to convert that feeling into a measurement of any actual value and we happen to know that domain experts are surprisingly easy to fool when outside of their own domains.

gxs · 2026-04-14T03:31:05 1776137465

Wow that was your takeaway?

> “2025 was the year when AI really started being useful for many different tasks,” said Terence Tao

I think I’ll go out on a limb and agree with Terrence Tao, I think the dude is well known in the math community, or something

noobermin · 2026-04-14T03:34:32 1776137672

If anything his simping for AI models makes me more suspect of him than I ever was because my own eyes show me their limits.

jryle70 · 2026-04-14T04:11:15 1776139875

Any chance your eyes are wrong? Or only people who disagree with you are.

themafia · 2026-04-14T04:19:08 1776140348

> go out on a limb and agree with Terrence Tao

Is AI his specialty?

> I think the dude is well known in the math community, or something

I believe this is called "appeal to authority." Which is why, instead of disagreeing with him, I suggested a more cogent endpoint that could be used to establish the facts the article's title suggests.

p1dda · 2026-04-14T03:48:36 1776138516

I think he means useful for mathematicians getting paid shilling for AI models

viccis · 2026-04-14T05:20:01 1776144001

What is the telos for AI chewing around the edges of pure math problems? Does AI care about math?

4ajsH17 · 2026-04-14T05:41:10 1776145270

[flagged]

homarp · 2026-04-14T06:24:33 1776147873

https://www.quantamagazine.org/about/ says "launched by the Simons Foundation in 2012"

and https://www.simonsfoundation.org/about/ has "Since its founding in 1994 by Jim and Marilyn Simons"

https://en.wikipedia.org/wiki/Jim_Simons explains how Jim Simons got rich.

The book 'The Man Who Solved the Market' - https://www.gregoryzuckerman.com/the-books/the-man-who-solve... is a nice read.

HN discussion on a review of the book - https://news.ycombinator.com/item?id=29392041

Wissenschafter · 2026-04-14T05:54:41 1776146081

More neo-luddite nonsense.

yabutlivnWoods · 2026-04-14T04:27:54 1776140874

We can define a Dyson Sphere in math.

We cannot build one.

AI outputting axiomatically valid syntax isn't going to be all that useful. It's possible to generate all axiomatically correct math with a for loop until the machine OOMs

Physics is not math and math is not physics.

djsjajah · 2026-04-14T04:44:10 1776141850

You just failed the Turing test.

keyle · 2026-04-14T05:38:20 1776145100

Maybe he passed the Turing test with 88.2% which is 1.8% higher than the competition.

goatlover · 2026-04-14T05:20:54 1776144054

The Turing test just failed you. I'll go one better, physics isn't reality, it's a model of reality utilizing math.

dugidugout · 2026-04-14T18:09:05 1776190145

And I'll go one better, you haven't said anything here at all, you've just left a representation of what you understand to be saying.

yabutlivnWoods · 2026-04-14T06:12:55 1776147175

Fortunately for me equivalents to Turing exist: https://en.wikipedia.org/wiki/Turing_machine_equivalents

djsjajah · 2026-04-14T22:59:46 1776207586

I don't follow. Can you explain how your comment is relevant to mine? It might help if you also explain how you interpreted my comment.