More

uberman · 2026-04-27T15:46:07 1777304767

prior:

- https://news.ycombinator.com/item?id=47570269

- https://news.ycombinator.com/item?id=47900403

uberman · 2026-04-23T15:48:35 1776959315

I grew up in a city, my wife on a ranch that was several sections large. We have lived in large dense cities and we currently live on a "smaller" ranch. I have taken densely packed subways to work on one hand and walked home on my own property more than a mile after running out of gas. Here is my observation. Neither camp typically has a clue about why the other might be motivated by a different opinion.

Take your .22 rifle. Many truly rural families would feel that this tool was essential. Same for having knife on you. We have lived where there are rattlesnake and coyotes as an almost every day thing. Not so much rattlesnakes but certainly coyotes. In fact bear and cougar were not out of the question. The idea that I would allow my kids to wander on the property without a .22 in their 4 wheeler seemed risky. They were expected to know how to shoot just as they were expected to know how to ride a horse, and drive a tractor unsupervised. We taught them to be safe and could not have run our ranch without our girls taking on some big dangerous responsibilities.

We have also lived in big cities where the idea that many of the liberties we enjoyed in the country were insane in the city. The idea that any random teen should be allowed to drive a 80hp tractor around or carry a gun or a fixed blade knife was insanity. Just as allowing my kids to run down the sidewalk or play unsupervised in the park after dark was insanity. On our fist day after moving my eldest daughter ran down the sidewalk and was hit (but not injured) by a car coming out of a driveway.

She just had no clue about how cars in a dense city moved. There just are different life rules that apply in different situations. Guns can be critically important in one environment and absolutely insane in a second. Same goes for driving a tractor that could kill you or a family member or going to a park after dark.

Unless people understand that a different environment might require a different set of norms or even laws we can't have a productive urban/rural conversation. Of course I can drive my ATV along your fence line. You probably can't even see it from your home or hear it. Though you can bet my dad asked your dad for permission 40 years ago. Try running your unregistered, unlicensed ATV through your suburban neighbor's yard and you will find out why there are important laws preventing you from doing what was perfectly fine in the back country.

uberman · 2026-04-21T15:34:45 1776785685

Fascinating read. I know nothing about any of this neither the parties involved nor Copperhead though I had heard of Graphene. To that end, I wish the response included a pre-amble for those like me who were not familiar with what was going on. I guess I could probably read the Wired article though. Still. good read and I loved the Q and A at the end.

uberman · 2026-04-17T17:44:48 1776447888

Every tracker is likely chomping at the bit to purchase your chats. I don't know but also would not be surprised to find out that every major player already sells this data. Cloud computing at some level is just an information broker extraction tool.

I would not pool my data with others nor sell it if I knew how to prevent such things from happening though I do believe they are happening.

anoop4bhat · 2026-04-17T18:03:47 1776449027

You are right, its likely happening. The pitch I'm chewing on is flipping that: you own the pipe, you decide if anything leaves, and if it does you get paid instead of the platform. Curious what would make you trust a setup like that, or is it a hard no regardless? Maybe you also get to see what sort of PII data you have and the platform sanitizes it...

uberman · 2026-04-17T15:44:56 1776440696

On actual code, I see what you see a 30% increase in tokens which is in-line with what they claim as well. I personally don't tend to feed technical documentation or random pros into llms.

Given that Opus 4.6 and even Sonnet 4.6 are still valid options, for me the question is not "Does 4.7 cost more than claimed?" but "What capabilities does 4.7 give me that 4.6 did not?"

Yesterday 4.6 was a great option and it is too soon for me to tell if 4.7 is a meaningful lift. If it is, then I can evaluate if the increased cost is justified.

tetha · 2026-04-17T17:59:00 1776448740

Yeah that was an interesting discovery in a development meeting. Many people were chasing after the next best model and everything, though for me, Sonnet 4.6 solves many topics in 1-2 rounds. I mainly need some focus on context, instructions and keeping tasks well-bounded. Keeping the task narrow also simplifies review and staying in control, since I usually get smaller diffs back I can understand quickly and manage or modify later.

I'll look at the new models, but increasing the token consumptions by a factor of 7 on copilot, and then running into all of these budget management topics people talk about? That seems to introduce even more flow-breakers into my workflow, and I don't think it'll be 7 times better. Maybe in some planning and architectural topics where I used Opus 4.6 before.

snoman · 2026-04-19T15:23:03 1776612183

I wonder if there are different use cases. You sound like you’re using an LLM in a similar way to me. I think about the problem and solution, describe what I need implemented, provide references in the context (“the endpoint should be structured like this one…”) and then evaluate the output.

It sounds like other folks are more throwing an LLM at the problem to see what it comes up with. More akin to how I delegate a problem to one of my human engineers/architects. I understand, conceptually, why they might be doing that but I know that I stopped trying that because it didn’t produce quality. I wonder if the newer models are better at handling that ambiguity better.

jstummbillig · 2026-04-18T09:39:15 1776505155

I don't understand how people measure how much more or less work they need to do. It's not that gpt-4o was incapable of exuding enormous amounts of code quickly, it's that the tokens were relativ garbage.

How do you have an opinion on 4.6/4.7 here? It's less clear but I could totally see that 4.7 or beyond leads to project completion 20% faster, by removing dead ends, foot guns, less backtracking, etc.

How to tell / measure effectively? No clue.

uberman · 2026-04-18T13:17:14 1776518234

My personal opinion here based on observations not empirical tested. 4.5 could generate code, but I often ran out of context and the results were regularly incomplete. The result was that I had to spend as much time proofing and debugging as I did making direct progress.

4.6 has what in practice seems to an almost unlimited context window and rarely produces incomplete or flat out wrong results. That is a big step forward though i do burn through quota much faster.

I have not formed an option yet how what 4.7 does for me other than to say I have observed my quota being consumed faster. To be fair, I have not put 4.7 to a challenging task yet.

It honestly surprises me that someone who regularly uses Claude would not have an opion about 4.6 or even Opus vs Sonnet at this point. The lift at least for me was obvious.

pier25 · 2026-04-17T15:59:19 1776441559

haven't people been complaining lately about 4.6 getting worse?

solenoid0937 · 2026-04-17T16:06:00 1776441960

People complain about a lot of things. Claude has been fine:

https://marginlab.ai/trackers/claude-code-historical-perform...

addisonj · 2026-04-17T16:27:40 1776443260

I will be the first to acknowledge that humans are a bad judge of performance and that some of the allegations are likely just hallucinations...

But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?

My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.

Majromax · 2026-04-17T16:20:20 1776442820

While that's a nice effort, the inter-run variability is too high to diagnose anything short of catastrophic model degradation. The typical 95% confidence interval runs from 35% to 65% pass rates, a full factor of two performance difference.

Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.

yorwba · 2026-04-17T18:33:55 1776450835

Yes, MarginLab only tests 50 tasks a day, which is too few to give a narrower confidence interval. On the other hand, this really calls into question claims of performance degradation that are based on less intensive use than that. Variance is just so high that long streaks of bad luck are to be expected and plausibly the main source of such complaints. Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds (thus guaranteeing low variance), or you make a lot of calls (i.e. probably through the API and not in interactive mode.)

Majromax · 2026-04-18T13:45:01 1776519901

> Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds

That feels like a concession to the limited benchmarking framework. 5.4-xhigh is supposed to be (and is widely believe to be) a better model than 5.2, so if that's invisible in the benchmarking scores then the protocol has problems. The test probably should include cases that should be 'easy passes' or 'near always failures', and then paired testing could offer greater precision on improvements or degradations.

Conversely, if model providers also don't do this then they could be accidentally 'benchmaxxing' if they use protocols like this to set dynamic quantization levels for inference. All you really need for a credible observation of problems from 'less intensive use' is a problem domain that isn't well-covered by the measured and monitored benchmark.

yorwba · 2026-04-18T15:17:48 1776525468

Here's a sample-size calculator that may help illustrate the issue: https://sample-size.net/sample-size-proportions/ Put in the benchmark score of one model as p₀ and of the other model as p₁ (as a fraction between 0 and 1) and observe what kind of sample size you need to reliably observe a significant difference. The largest change between GPT 5.2 and 5.4 highlighted in https://openai.com/index/introducing-gpt-5-4/ is OSWorld-Verified going from 47.3% to to 75.0%. That's quite the difference, right? So plug in 0.473 and 0.75 and note that the required sample size per model is 55. For the software engineering tasks in SWE-Bench Pro, the change from 55.6% to 57.7% is a whopping 2.1 percentage points, which you can detect with a mere 8836 samples.

I'm sure someone in charge of benchmarking at OpenAI knows how statistics work and always makes sure to take a sufficiently large number of samples when comparing different models, but for most other people who want to know which model is better, the answer is unlikely to be worth the cost of measuring it precisely enough to find out.

jofzar · 2026-04-18T00:50:35 1776473435

Matrix also found that Claude was AB testing 4.6 vs 4.7 in production for the last 12 days.

https://matrix.dev/blog-2026-04-16

cbg0 · 2026-04-17T16:20:57 1776442857

That performance monitor is super easy to game if you cache responses to all the SWE bench questions.

solenoid0937 · 2026-04-17T18:24:52 1776450292

You dramatically overestimate how much time engineers at hypergrowth startups have on their hands

dns_snek · 2026-04-18T09:26:49 1776504409

There's a direct business incentive to game/cheat benchmarks, it wouldn't even be difficult to do, and besides, they have workforce-replacing AI to do it for them.

cbg0 · 2026-04-17T18:43:21 1776451401

Caching some data is time consuming? They can just ask Claude to do it.

sumedh · 2026-04-17T22:14:16 1776464056

Your link shows there have been huge drops.

How is it fine?

ed_elliott_asc · 2026-04-17T16:04:36 1776441876

No we increased our plans

grim_io · 2026-04-17T16:06:00 1776441960

How long will they host 4.6? Maybe longer for enterprise, but if you have a consumer subscription, you won't have a choice for long, if at all anymore.

Jeremy1026 · 2026-04-17T16:26:11 1776443171

I was trying to figure out earlier today how to get 4.6 to run in Claude Code, as part of the output it included "- Still fully supported — not scheduled for retirement until Feb 2027." Full caveat of, I don't know where it came up with this information, but as others have said, 4.5 is still available today and it is now 5, almost 6 months old.

hypercube33 · 2026-04-17T16:57:18 1776445038

I'm still using 4.5 because it gets the niche work I'm using it for where 4.6 would just fight me.

nfredericks · 2026-04-17T16:13:00 1776442380

Opus 4.5 is still available

grim_io · 2026-04-17T16:44:44 1776444284

Wow, they hosted it for 6 months. Truly LTS territory :)

uberman · 2026-04-16T16:11:35 1776355895

Did you mean $45m?

uberman · 2026-04-14T15:57:03 1776182223

At 15 mph it would take less than 2 seconds for a car at the back of a bus to reach the front of a bus. Are you suggesting that the driver of the bus was able to open and then close their door in less than two seconds? Alternatively, Are you suggesting that your daughter was driving slower than 15 mph yet was unable to stop?

I certainly believe there is room for discretion when officers write tickets, but not for passing a school bus.

uberman · 2026-04-14T15:38:07 1776181087

While I acknowledge there is now a legality question around the use of "red light cameras". I have no sympathy for people who are not stopping for school busses: I can't stomach that the article frames this as "burden" on those driving past the bus.

"there’s evidence the program is heavily burdening residents who either can’t or don’t pay the fines."

uberman · 2026-04-10T15:23:45 1775834625

I'm not condoning vandalism but I can empathize with the feeling that this alien thing is in my personal space, is a motorized vehicle on the sidewalk, is just as likely to cause a fall that no one will be held accountable for, and is taking a job from someone. I can see how that would be rage inducing. Perhaps surrounding it with traffic cones would be a better plan than actually damaging it.

I lived in Philadelphia (center city) and my other reaction based on simply attempting to keep a flowerpot on doorstep is, why have people not just stolen it yet?

_wire_ · 2026-04-10T18:18:56 1775845136

These devices are a form of social pollution, whereby the desires and demands of others are mechanically proxied into common spaces.

When you negotiate others on the causeway, you are involved in human one-on-one exchanges with parity; each encountering the others on the level of interpersonal status, which is about the ways humans observe respect for each other.

But there can be no respect given nor received with a robot. It's an engine that's in competition for your space, presents as both a mechanical advantage and as handicapped, is not interesting nor appropriate to meet, and generally responds so stupidly and unpredictably that it's hazardous-- which makes its insertion into the commons an offense.

Combine the need for vigilance and avoidance with the realization that the robot annoyance is a proxy for someone else's privilege and as robots are instruments of private property extending deeply into common spaces and it's not a surprise to find people who are encroached upon by robots manifesting their displeasure through sabotage.

uberman · 2026-04-10T12:41:31 1775824891

Isn't it the expected thing that LLMs degrade over time?