> LLMs with harnesses are clearly capable of engaging with logical problems that...

simianwords · 2026-04-08T19:31:54 1775676714

> To some extent. It's not clear where specifically the boundaries are, but it seems to fail to approach problems in ways that aren't embedded in the training set. I certainly would not put money on it solving an arbitrary logical problem.

In what way can you falsify this without having the LLM be omniscient? We have examples of it solving things that are not in the training set - it found vulnerabilities in 25 year old BSD code that was unspotted by humans. It was not a trivial one either.

pessimizer · 2026-04-08T21:55:59 1775685359

Here's an odd example of testing, but I design very complex board and card games, and LLMs are terrible at figuring out whether they make sense or really even restating the rules in a different wording.

I thought they would be ideal for the job, until I realized that it would just pretend that the rules worked because they looked like board game rules. The more you ask it to restate, manipulate or simulate the rules, the more you can tell that it's bluffing. It literally thinks every complicated set of rules works perfectly.

> it found vulnerabilities in 25 year old BSD code that was unspotted by humans.

I don't think the age of the code makes the problem more complex. Finding buffers that are too small is not rocket science, bothering to look at some corner of some codebase that you've never paid attention to or seen a problem with is. AI being infinitely useful (cheap) to sic on pieces of codebase nobody ever carefully looks at is a great thing. It's not genius on the part of the AI.

nmadden · 2026-04-09T07:50:59 1775721059

Re: cheap - Anthropic’s write-up said it cost $20,000 of runs to find that bug (and a few others). So not that cheap compared to other tools - more similar in cost to human review/pentest, but probably more exhaustive.

> This was the most critical vulnerability we discovered in OpenBSD with Mythos Preview after a thousand runs through our scaffold. Across a thousand runs through our scaffold, the total cost was under $20,000 and found several dozen more findings.

They don’t talk about the other findings, so I’m guessing they are minor.

simianwords · 2026-04-08T22:00:25 1775685625

> Here's an odd example of testing, but I design very complex board and card games, and LLMs are terrible at figuring out whether they make sense or really even restating the rules in a different wording.

I'm positive that they are perfectly fine and will a pretty good job. Did you actually try it?

CamperBob2 · 2026-04-08T23:12:41 1775689961

Eh, I can see their point, I think. The models can restate the rules differently, I'm sure, but it sounds like the GP is saying that LLMs can't tell whether the rules are well-balanced.

It would be interesting to see some example problems along those lines. Design some games with complex rules, including one or two of the most subtle game-wrecking bugs you can think of, and ask the models if they can spot them.

In fact that sounds more interesting the more I think about it. Intensive RL on that sort of thing might generalize in... let's say useful ways.

simianwords · 2026-04-09T10:08:53 1775729333

I would love to see examples but I think we won’t. I’m happy to be proven wrong that an llm will do worse than a fairly smart human (without prior experience in the board game).

throwaway27448 · 2026-04-09T14:10:29 1775743829

I'm just saying I'd rather hire a human that can be reasoned with than rely on software that can't be. At least where reasoning is involved.

Granted, I don't do a lot of needle-in-the-haystack work like finding vulnerabilities where search will naturally dominate.

Also, I imagine most reasoning involved in exploits will be found in the training sets—there are only so many patterns of exploitation found in formal languages.

__alexs · 2026-04-08T16:51:53 1775667113

Solving arbitrary logical problems seems to be equivalent to solving the halting problem so you are probably wise not to make that bet.