More

simjnd · 2026-04-29T18:38:50 1777487930

I do think it's a lot clearer title than Solutions Architect.

simjnd · 2026-04-29T16:30:03 1777480203

I'm not sure what people are on in the comments. It doesn't beat the other models, but it sure competes despite its size.

GLM 5.1 is an excellent model, but even at Q4 you're looking at ~400GB. Kimi K2.5 is really good too, and at Q4 quantization you're looking at almost ~600GB.

This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).

For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable. This beats the latest Sonnet while running locally, without anyone charging you extra for having HERMES.md in your repo, or locking you out of your account on a whim.

Mistral has never been competitive at the frontier, but maybe that is not what we need from them. Having Pareto models that get you 80% of the frontier at 20% of the cost/size sounds really good to me.

Aurornis · 2026-04-29T17:11:39 1777482699

> This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).

The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.

For running async work and background tasks the prompt processing and token generation speeds matter less, but a lot of Mac Studio buyers have discovered the hard way that it's not going to be as responsive as working with a model hosted in the cloud on proper hardware.

For most people without hard requirements for on-site processing, the best use case for this model would be going through one of the OpenRouter hosted providers for it and paying by token.

> This beats the latest Sonnet while running locally

Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.

nijave · 2026-04-29T21:42:11 1777498931

>Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.

This has been my experience as well. I've been testing an agent built with Strands Agents which receives a load balancer latency alert and is expected to query logs with AWS Athena (Trino) then drill down with Datadog spans/traces to find the root cause. Admittedly, "devops" domain knowledge is important here

My notes so far:

"us.anthropic.claude-sonnet-4-6" # working, good results

"us.anthropic.claude-sonnet-4-20250514-v1:0" # has problems following the prompt instructions

"us.anthropic.claude-sonnet-4-5-20250929-v1:0" # working, good results

"us.anthropic.claude-opus-4-5-20251101-v1:0"

"us.anthropic.claude-opus-4-6-v1" # best results, slower, more expensive

"amazon.nova-pro-v1:0" # completely fails

"openai.gpt-oss-120b-1:0" # tool calling broken

"zai.glm-5" # seems to work pretty well, a little slow, more expensive than Sonnet

"minimax.minimax-m2.5" # didn't diagnose correctly

"zai.glm-4.7" # good results but high tool call count, more expensive than Sonnet

"mistral.mistral-large-3-675b-instruct" # misdiagnosed--somehow claimed a Prometheus scrape issue was involved

"moonshotai.kimi-k2.5" # identified the right endpoints but interpreted trace data/root cause incorrectly

"moonshot.kimi-k2-thinking" # identified endpoint, 1 correct root cause, 1 missing index hallucination

Using models on AWS Bedrock. I let Claude Code w/ Opus 4.7 iterate over the agent prompt but didn't try to optimize per model. Really the only thing that came close to Sonnet 4.5 was GLM-5. The real kicker is, Sonnet is also the cheapest since it supports prompt caching

The Kimi ones were close to working but didn't quite make the mark

pbgcp2026 · 2026-04-30T00:00:20 1777507220

" it supports prompt caching" May I ask if you checked that? I use "{"cachePoint": { "type": "default" }" and I found 2 things: * 1) even if stated in the Doco, Bedrock Converse API does not allow 1hr expiry time, only 5m - gives error when attempted; * 2) Bedrock Converse API does accept up to 4 cachePoint's but does NOT cache and returns zeroes. LOL. It was confirmed by some other people on Github. (Note: VertexAI does cache properly reducing the bill drastically, so I use Vertex instead of OpenRouter.)

simjnd · 2026-04-29T17:51:20 1777485080

> The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.

Very valid. This is an active area of research, and there are a lot of options to try out already today.

- People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.

- Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.

- DFlash (block diffusion for speculative decoding) needs a good drafting model compatible with the big model, but can provide an uplift up to 5x in decoding (although usually in the 2-2.5x range)

- Forcing a model's thinking to obey a simple grammar has been shown to improve results with drastically lower thinking output (faster effective result generation) although that has been more impactful on smaller models.

We should be skeptical, but it's definitely trending in the right direction and I wouldn't be surprised if we are indeed able to run it at acceptable speeds.

> Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.

This hasn't been my experience. After Anthropic's started their shenanigans I've switched to exclusively using open-weights models via OpenRouter and OpenCode and I can't really tell a difference (for better or for worse).

tredre3 · 2026-04-30T03:16:18 1777518978

> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.

All the Q quants from big quant providers are importance-weighted (imatrix) nowadays.

The main (possibly only?) difference between Q and IQ today is that IQ uses a lookup table to achieve better compression. That is also why IQ suffers more when it can't fully fit into VRAM.

It's important to teach people the distinction and not perpetuate wrong assumptions of the past. If one needs/wants static quants, ignoring IQ_ isn't enough.

sroussey · 2026-04-29T21:39:22 1777498762

Super interesting!

> - People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.

Where can I find more info on this? I’d like to convert models to onnx this way.

> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.

Where can I find more info on this? I’d like to convert models to onnx this way.

The most difficult environment for small models is in the browser. Would be great to push the SOTA in that environment.

hadlock · 2026-04-29T23:36:04 1777505764

Google only released their TurboQuant paper barely a month ago, it is bleeding edge even by LLM standards

sroussey · 2026-04-30T00:54:03 1777510443

Actually, they published a year ago. Recent was being on official Google blog.

https://arxiv.org/abs/2504.19874

https://research.google/blog/turboquant-redefining-ai-effici...

parsimo2010 · 2026-04-29T22:46:20 1777502780

> being able to run a model and being able to run a model fast are two very different thresholds

Specifically speaking, on my Strix Halo machine with (theoretical) memory bandwidth of 256 GB/s, a 70 GB model can't generate faster than 256/70= 3.65 t/s. The logic here is that a dense model must do a full read of the weights for each token. So even if the GPU can keep up, the memory bandwidth is limiting.

A Mac M5 Pro is faster with a bandwidth of 307 GB/s, but that's only a little faster.

This thing is going to be slow on consumer hardware. Maybe that is useful for someone, but I probably prefer a faster model in most cases even if the model isn't quite as smart. Qwen3.6 35B-A3B generates about 50 t/s on my machine, so it can make mistakes, be corrected, and try again in the same time that this model would still be thinking about its first response.

zozbot234 · 2026-04-29T23:23:19 1777504999

Recent models support multi-token prediction, which can guess multiple future tokens in a single decode step (using some subset of the model itself, not a separate drafting model) and then verify them all at once. It's an emerging feature still (not widely supported) and it's only useful for speeding up highly predictable token runs, but it's one way to do better in practice than the common-sense theoretical limit might suggest.

pbgcp2026 · 2026-04-30T00:03:34 1777507414

It seems to me it's only Grok 4.20 that does this currently? Which other models did you have in mind, if I may ask?

phamilton · 2026-04-30T01:40:46 1777513246

Gemma4, qwen3.6, deepseek v4, mimo, glm 5/5.1 all do MTP.

pbgcp2026 · 2026-04-30T02:01:09 1777514469

Thank you, I just realised we are talking about MTP. It seems that it's not that clear though. "Currently, the MTP capabilities are primarily accessible through Google's proprietary LiteRT framework, rather than the open-weights versions... Despite the missing MTP heads in the open release, Gemma 4 (specifically the 26B-A4B variant) still demonstrates high efficiency"

notatoad · 2026-04-30T01:01:11 1777510871

being able to run a model fast is definitely more useful, but being able to run a model slowly for free is still super useful. agentic workflows are maturing all the time.

yes, if i'm directly interacting with the LLM, i want it to be reasonably fast. but lately i've been queueing up a bunch of things when i go for lunch, or leaving things running when i go home at the end of the day. and claude doesn't keep working on that all night, it runs for an hour or so, gets to a point where it needs more input from me, and gives me some stuff to review in the morning. that could run 16x slower and still be just as useful for me.

zozbot234 · 2026-04-29T17:15:01 1777482901

Cloud hardware is not inherently more "proper" than what's being proposed here, there's nothing wrong per se about targeting slower inference speeds in an on prem single-user context.

Aurornis · 2026-04-29T17:19:11 1777483151

> Cloud hardware is not inherently more "proper" than what's being proposed here

Cloud hardware can run the original model. Quantization will reduce quality. The quality drop to Q4 is not trivial.

Cloud hardware is also massively faster in time to first token and token generation speed.

> there's nothing wrong per se about targeting slower inference speeds in a local single-user context.

If that's what the user wants and expects then it's fine

Most people working interactively with an LLM would suffer from slower turns.

zozbot234 · 2026-04-29T18:20:06 1777486806

> Cloud hardware can run the original model. Quantization will reduce quality.

New models are often being released in quantized format to begin with. This is true of both Kimi and the new DeepSeek V4 series. There is no "original model", the model is generated using Quantization Aware Training (QAT).

Aurornis · 2026-04-29T18:49:19 1777488559

> There is no "original model", the model is generated using Quantization Aware Training (QAT).

The original model is the model used for the benchmarks

People will say "You can run it locally!" then show the benchmarks of the original model, but what they really mean is that you can run a heavily quantized adaptation of the model which has difference performance characteristics.

zozbot234 · 2026-04-29T18:56:05 1777488965

That remark was specific to newer models like Kimi 2.x and DeepSeek V4 series, and this is clearly stated in my comment.

As for other models, we quantize them because we are generally constrained by the model's total footprint in bytes, and running a larger model that's been quantized to fit in the same footprint as a smaller one improves performance compared to a smaller original, generally up to Q4 or so, with even tighter quantizations (up to Q2) being usable for some uses such as general Q&A chat.

hu3 · 2026-04-30T08:39:26 1777538366

When you say DeepSeek v4... you do realise it is a 1.6T param model right?

What kind of consumer hardware can run it reasonably in your mind?

DANmode · 2026-04-30T02:03:16 1777514596

I wish “performance” didn’t cover speed and quality, here.

cbg0 · 2026-04-29T17:20:23 1777483223

The quantization for some models can be very detrimental and their quality can drop considerably from the posted benchmarks which are probably at bf16, this is why having considerable RAM can be important.

Computer0 · 2026-04-29T21:07:08 1777496828

Sure but for a casual conversational use case I have not found speed to be a huge barrier. I chatted with a 100b model using ddr5 only on a plane recently and it was fine. It's mainly that I cannot do data classification and coding tasks in a timely manner.

gregsadetsky · 2026-04-29T16:50:04 1777481404

I didn't know about HERMES.md ... (??) - found information here for others who are curious https://github.com/anthropics/claude-code/issues/53262

gnulinux · 2026-04-29T18:19:34 1777486774

This github thread is incredible, thanks for sharing. This link should be its own HN topic.

nomel · 2026-04-29T19:17:31 1777490251

https://news.ycombinator.com/item?id=47952722

giancarlostoro · 2026-04-29T17:42:57 1777484577

That is insane, if you billed me an extra $200 for a bug in your system I'd flat out cancel my subscription. If you're not going to credit that back to me, you don't deserve anymore of my money. I'm a Claude first guy, but if you're going to bill me incorrectly, that's on you, own it, fix it.

xcrjm · 2026-04-29T17:49:34 1777484974

They did credit it back to him. There's a comment in the linked issue.

MarsIronPI · 2026-04-29T18:05:31 1777485931

Where? Just searched the entire thread for both the word "refund" and the word "credit" and I'm seeing nothing about credit being issued.

Also what's with @sasha-id talking to himself? Looks weird as all get out.

argee · 2026-04-29T18:16:15 1777486575

Looks like he copy pasted responses he got from their support agents.

simjnd · 2026-04-29T18:10:17 1777486217

Where? All I see is Boris saying "we are unable to issue compensation for degraded service or technical errors that result in incorrect billing routing".

lenerdenator · 2026-04-29T18:39:32 1777487972

Keep this in mind next time you hear someone talking about "removing the human in the loop".

Anthropic apparently won't take responsibility for issues their own systems handling billing cause. You think they'll take responsibility in your system when a bug in their models can be demonstrated as the cause?

KronisLV · 2026-04-29T19:05:08 1777489508

> Anthropic apparently won't take responsibility for issues their own systems handling billing cause.

I think with every org, especially the big ones, trying to dodge responsibility (setting the intent of "customer support" to be annoying them enough for them to buzz off), the only recourse people have is to give them enough bad press where they wake up and do the refund, it's less than a rounding error for them.

I think Anthropic is hardly unique in that position and being able to chat with a human with any sort of power to actually make things right is becoming more and more rare. If any human eyes saw that, the correct thing to do would probably be passing the message up the chain like "Hey, this will have really bad optics if we don't do the right thing. Can you take like 5 minutes and hit the refund button while I draft up a nice message about it?"

lenerdenator · 2026-04-29T20:19:15 1777493955

Bad press is meaningless where it matters most these days. The kind of people who are most responsive to threats of bad press are the kind of people who don't need to be threatened with bad press to do the right thing.

I really wish it carried any weight. It just doesn't. If someone at the organization just says "never admit fault, always attack", it's very likely they'll get away with it.

DANmode · 2026-04-30T02:05:21 1777514721

> You think they'll take responsibility in your system when a bug in their models can be demonstrated as the cause?

Flag on the play: AI doesn’t replace responsibility for your commits.

It doesn’t matter what promises a service makes, what you say is valid code is still on you.

Act accordingly.

giancarlostoro · 2026-04-29T17:39:12 1777484352

> For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable.

Before February I was able to use Opus on High exclusively on my Max plan no problem. Now I've shifted to just using Sonnet on high and yeah, its pretty capable. I love that, Claude Pilled. ;)

simjnd · 2026-04-29T17:57:22 1777485442

Yeah I love Claude, amazing models. Anthropic has very quickly burned most of the goodwill I had for it so I still ended up cancelling my subscription.

WhitneyLand · 2026-04-29T20:11:09 1777493469

“This beats the latest Sonnet while running locally”

Not really.

- The benchmarks are based on F8_E4M3 and you’re not running that on any Mac.

- Sonnet has a 1M token context window. This is 256k but again you’re probably not even getting that locally.

- Sonnet is fast over the wire. This is going to be much slower.

trueno · 2026-04-29T20:36:18 1777494978

the benchmarks we're using to measure llm's do no justice when everyone's mental-benchmark is simply "is it going to feel like using claude" and the answer is still no. the entire llm space is stuffed with tons of crazy datapoints and vernacular that barely paint the picture of the mental benchmark everyone is after.

i too am desperate to just sever ties with these big providers, my fingers are crossed we get there within the constraints of local hardware even if that means me spending 3-5k i just want off this wild ride.

trvz · 2026-04-29T23:15:14 1777504514

> Sonnet is fast over the wire.

Except when it’s unavailable. For sovereignity, the downsides are worth it to some.

varispeed · 2026-04-29T23:34:37 1777505677

Not sure if 1M token window is meaningful with Sonnet/Opus. The models go dumb quickly as context increases making them unusable (that is if you get routed to actual Opus, otherwise they are just dumb regardless of context window).

ksubedi · 2026-04-29T18:19:37 1777486777

Let's not forget Qwen 35B A3B MoE. It gets better performance than this in all the metrics for a fraction of the memory / compute footprint.

Sad to see all the non Chinese open source models being at least one generation behind.

simjnd · 2026-04-29T18:32:16 1777487536

Qwen3.6 27B is even more impressive IMO. Dense so it doesn't run as fast but it's so good.

trueno · 2026-04-30T00:14:15 1777508055

im kinda torn on which to download. i have the headroom to run either, mostly just want the occasional "do a coding thing im too lazy to do"

UncleOxidant · 2026-04-29T17:45:00 1777484700

Yeah, you can run it locally if you have enough VRAM, but the reports trickling in are saying about 3 tok/sec. This was on a Strix Halo box which definitely has the needed VRAM, but isn't going to have as high mem bandwidth as a GPU card, it's going to be similar on a Mac - that's the dilemma... the unified memory machines have the VRAM, but the bandwidth isn't great for running dense models. This size of a dense model is only going to be runnable (usefully) by very few people who have multiple GPU cards with enough memory to add up to about 70GB.

simjnd · 2026-04-29T18:06:46 1777486006

I don't think this is quite correct, a Strix Halo box usually has 256 GB/s memory bandwidth. An M5 Max has 614 GB/s. An M3 Ultra (no M4 or M5 Ultra) has 820 GB/s. It's still not GDDR or HBM territory, but still significantly faster.

That's the edge of Apple Silicon for AI. When they scale up the chip they add more memory controllers which adds more channels and more bandwidth.

But yeah in the end it's still going to be only a handful of people that can run it.

What I meant is that I think researching and developing smaller more powerful model is more interesting than chasing the next 3T parameter model while burning through VC money and squeezing your customer base more and more aggressively.

YetAnotherNick · 2026-04-29T16:59:20 1777481960

It has similar SWE bench score to qwen 3.6 27b[1]. No one is comparing it to frontier.

[1]: There is no other common benchmark in the blog.

simjnd · 2026-04-29T18:16:56 1777486616

That's more a testament of how good Qwen3.6 27B is (it really is great) more than how bad this one is IMO. Gemma 4 31B was already good, but Qwen3.6 27B is incredible for its size.

reissbaker · 2026-04-29T20:11:11 1777493471

Good models vs bad models are relative: if this was released in 2020 it would be earth shattering. But releasing a model today that's only on par with open-source dense models a quarter of the size and soundly beaten by open-source MoEs with active param counts a quarter of the size is kind of a flop. The niche for this is basically no one. It'll run at near-zero TPS for the few local model aficionados with enough hardware to try it out, and is lower throughout and lower quality for people trying to use it at scale.

I'm rooting for Mistral, I want them to release good models. This just isn't one. It's a little sad since they once were so prominent for open-source.

Who knows — if they have the compute to train this, they have the compute to train an MoE that's 3-4T total params with 128B active. Maybe they'll make a comeback (although using Llama 2 attention is... not promising). I hope they do.

2ndorderthought · 2026-04-29T17:12:17 1777482737

The point is it's open weight and is tiny compared to a lot of it's competitors. 4gpus for world class performance - sweet!

liuliu · 2026-04-29T17:30:19 1777483819

The competition is on DeepSeek v4 Flash for similar size / deployment target.

simjnd · 2026-04-29T17:55:40 1777485340

DeepSeek v4 Flash is still over 100GB at Q4 IIRC, and Q4 has generally been the sweet spot. Although it's an MoE so it might run a lot faster that this dense Mistral model if you have the RAM.

pbgcp2026 · 2026-04-30T05:58:01 1777528681

"Q4 has generally been the sweet spot" for self-hosting, yes. For any real meaningful work it's dumb AF. The only way to get reasonable intelligence from mid-size Gemma or Qwen is to run full precision BF16. Anything else is just an emulation of AI.

DeathArrow · 2026-04-29T17:16:31 1777482991

>This model? You can run it at Q4 with 70GB of VRAM. >This beats the latest Sonnet while running locally

Not sure it will beat Sonet at Q4.

>This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).

For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality.

simjnd · 2026-04-29T18:24:48 1777487088

> Not sure it will beat Sonet at Q4.

Very valid. Importance-weighted quantization and TurboQuant on model weights can reduce loss a lot compared to "traditional" Q4 so one can be hopeful.

> For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality

But you will own no computer, and that's also assuming prices stay what they are. Anyway my point was not whether or not it makes financial sense for everyone. A lot of people are very happy not owning their movies, software, games, cars or house. I'm just happy there is a future where the people can own and locally run the tech that was trained on their stolen data.

pbgcp2026 · 2026-04-30T06:04:50 1777529090

@simjnd, I hate this idea but you remember how radio had been regulated to death? And how fast one will be triangulated if one decides to run a "self hosted" radio station today? My bet is in 5 years not only owning AI-inference-capable computer but using AI itself will be regulated. Essentially, we will have to scan biometrics to just ask any SOTA model to "summarise this".

Why? Because capable and free models at the dawn of AI almost made people think again and - oh oh - ask questions!

kobalsky · 2026-04-29T17:39:06 1777484346

> For $3500 I can get 7-8 years of GLM

mind sharing where's the go to place to pay for open models?

simjnd · 2026-04-29T18:19:20 1777486760

I recommend using OpenRouter (openrouter.ai). Basically a broker between inference providers and you which allows you to pick, try, and switch models from a massive catalog, extremely transparent about usage and pricing.

pbgcp2026 · 2026-04-30T06:10:53 1777529453

+5% to every API call.

rsanek · 2026-04-29T23:00:48 1777503648

I've had a decent experience with ollama cloud. It is slower than going thru openrouter but much, much cheaper -- the generosity of their $20 plan reminds me of what the Claude Code $20 plan was back in the day

DeathArrow · 2026-04-29T17:48:03 1777484883

You can get GLM coding plans from Z.ai and Ollama Cloud and OpenCode Go.

redrove · 2026-04-29T16:41:36 1777480896

It’s 128b dense model. Good luck getting more than 3t/s out of a mac. It doesn’t matter if it fits or not.

zozbot234 · 2026-04-29T17:12:03 1777482723

You could run it on a single Mac Studio with M3 Ultra, or two Mac Studios with M4 Max at higher perf than that. And lightly quantizing this could give us modern dense models in the ~80GB size range, which is a very compelling target.

freakynit · 2026-04-29T17:15:56 1777482956

Wouldn't matter much still. M3 ultra has 819GB/s unified memory bandwidth. That means theoretical max tokem rate is 819/128 =~ 6.39 t/s. At 80 GB (5 bit quantization), its still near about 10 t/s ... far from a good coding experience. Also, these are theoretical max.. real world token generation rates would be at least 15-20% less.

varispeed · 2026-04-29T23:32:38 1777505558

> It doesn't beat the other models, but it sure competes despite its size.

But what is the rationale for running a dumb model? Because it can ocasionally produce something passable?

I don't get where is the value apart from mild entertainment, as in "I am somewhat of Anthropic myself".

simjnd · 2026-04-30T07:42:02 1777534922

Are you dumb because you're not Einstein? Intelligence is a spectrum. Just because you're not #1 doesn't mean you're dumb. A lot of small models are not frontier but are still very competent and are very useful coding agent. It may take better prompting and more guiding, but that can be a reasonable tradeoff for some people.

zackangelo · 2026-04-29T17:56:11 1777485371

Isn't Kimi K2.6 natively INT4?

simjnd · 2026-04-29T18:27:39 1777487259

I don't think any models are natively INT4? I wouldn't see the point to nerf the model out-of-the-box.

zozbot234 · 2026-04-29T18:39:00 1777487940

It's not nerfed, it's natively trained at that quantization a.k.a. Quantization Aware Training.

pbgcp2026 · 2026-04-30T06:13:13 1777529593

QAT typically uses BF16/FP32 during the training process to simulate lower precision.

revolvingthrow · 2026-04-29T18:31:58 1777487518

Eh. Those results would be noteworthy if it was a a MoE. A 120B dense? Firmly in meh territory.

gregorygoc · 2026-04-29T18:58:12 1777489092

Why do you care?

deepsquirrelnet · 2026-04-29T18:13:29 1777486409

I would love to be able to run frontier locally, but I think the larger importance of open weight models is price accountability.

In the US with our broken system of capitalism, it’s the only way we can tether these companies to reality. Left to their own devices, I’m not convinced they would actually compete with each other on price.

Buy nobody like to talk about how “moat” building is fundamentally anti-competitive, even in name.

Funny that self proclaimed capitalists hate the system in practice. Commodity pricing is what truly terrifies them.

simjnd · 2026-04-29T18:30:48 1777487448

I'm not necessarily interested in having frontier locally. You don't need to be frontier to be a very good and useful coding agent. I agree with your point on price accountability though. Hopefully no tariff comes down on the Chinese and European open-weight models.

freakynit · 2026-04-29T17:05:39 1777482339

I was hoping a lot from it... but this one, is not up to that mark. For example, here is it's comparion with 4.7x smaller model, qwen3.7-27b.

https://chatgpt.com/share/69f239e8-7414-83a8-8fdd-6308906e5f...

Tldr: qwen3.6-27b, a 4.7x smaller model, have similar performance.

lostmsu · 2026-04-29T17:11:20 1777482680

To be fair MoE from Qwen itself had the same "problem". 3.5 122B MoE was same or worse than 3.5 27B. Yet to see 122B 3.6.

UPD. NVM, Mistral Medium 3.5 is dense. So yes, it is worse in every way.

r0b05 · 2026-04-29T17:08:39 1777482519

That's a chatgpt summary. Actual usage would a better test.

freakynit · 2026-04-29T17:12:11 1777482731

yep.. until then, this is good enough since the tests are standard, and the results are numeric and can be compared without any doubt.

simjnd · 2026-04-28T18:53:43 1777402423

What a terrible name

simjnd · 2026-04-28T18:19:04 1777400344

Probably a testament to how good Qwen3.6 is considering Qwen3.6-35B-A3B is not only ahead of their similar weight class XS.2 but also their M.1 (close to 10x bigger at 225B-A23B).

Interestingly, Gemma 4 26B-A4B and Qwen3.6 27B (dense) have been left out of the comparison.

The smaller models are becoming very good and quantization techniques like importance weighting and TurboQuant on model weights let you run aggressively quantized version (IQ2, TQ3_4S) on consumer hardware with extremely acceptable perplexity and quality loss.

Very exciting times for local LLMs.

simjnd · 2026-04-27T22:05:58 1777327558

https://npmx.dev is not

simjnd · 2026-03-28T14:17:53 1774707473

Also from a security perspective. People have been able to extract copyrighted code / API keys some LLMs have been trained on before. If you opt-in to this, your / your company code will be used to train and improve the model. People may then be able to extract that from the model. Another threat vector.

simjnd · 2026-03-23T08:11:04 1774253464

Feedback from just 5 minutes of messing around.

- You can't delete the audio track from a video, only mute it

- You can't add a new Audio track

- Because of the two points above you can't really seem to be able to overlay a song to a video that already has an audio track?

- You can only scroll horizontally and not vertically in the track block, so tracks can go out of screen and remain invisible

simjnd · 2026-02-22T18:50:01 1771786201

I also think Steam does a great job a hiding it, and the new recommendation page is really great IMO. Other than some generic AAA, it introduced me to really great games I enjoyed based on my play history.

The more content is available, the more curation is important and IMO their algorithm currently does a good job at it.

SomeUserName432 · 2026-02-23T09:46:01 1771839961

> I also think Steam does a great job a hiding it

Steam kept pushing a game as "recommended for you" with 99% negative reviews.

In what world would I possibly want to buy a game with a <1% approval rating?

docmars · 2026-02-23T16:19:11 1771863551

There are some odd cases like that, but you can always "Ignore" a game and it'll never show up again. That also feeds into Steams curation for you based on your interests.

simjnd · 2026-01-26T14:06:50 1769436410

It's not about productivity, it's about good posture

simjnd · 2026-01-19T05:33:34 1768800814

> The earliest icons aren't recognisable enough as they're too illustrative.

What? The fact that they have a unique silhouette alone make them so much more recognizable than all the other versions.