I'm not sure what people are on in the comments. It doesn't beat the other models, but it sure competes despite its size.
GLM 5.1 is an excellent model, but even at Q4 you're looking at ~400GB.
Kimi K2.5 is really good too, and at Q4 quantization you're looking at almost ~600GB.
This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).
For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable. This beats the latest Sonnet while running locally, without anyone charging you extra for having HERMES.md in your repo, or locking you out of your account on a whim.
Mistral has never been competitive at the frontier, but maybe that is not what we need from them. Having Pareto models that get you 80% of the frontier at 20% of the cost/size sounds really good to me.
> This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).
The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.
For running async work and background tasks the prompt processing and token generation speeds matter less, but a lot of Mac Studio buyers have discovered the hard way that it's not going to be as responsive as working with a model hosted in the cloud on proper hardware.
For most people without hard requirements for on-site processing, the best use case for this model would be going through one of the OpenRouter hosted providers for it and paying by token.
> This beats the latest Sonnet while running locally
Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.
>Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.
This has been my experience as well. I've been testing an agent built with Strands Agents which receives a load balancer latency alert and is expected to query logs with AWS Athena (Trino) then drill down with Datadog spans/traces to find the root cause. Admittedly, "devops" domain knowledge is important here
My notes so far:
"us.anthropic.claude-sonnet-4-6" # working, good results
"us.anthropic.claude-sonnet-4-20250514-v1:0" # has problems following the prompt instructions
"us.anthropic.claude-sonnet-4-5-20250929-v1:0" # working, good results
"us.anthropic.claude-opus-4-5-20251101-v1:0"
"us.anthropic.claude-opus-4-6-v1" # best results, slower, more expensive
"amazon.nova-pro-v1:0" # completely fails
"openai.gpt-oss-120b-1:0" # tool calling broken
"zai.glm-5" # seems to work pretty well, a little slow, more expensive than Sonnet
Using models on AWS Bedrock. I let Claude Code w/ Opus 4.7 iterate over the agent prompt but didn't try to optimize per model. Really the only thing that came close to Sonnet 4.5 was GLM-5. The real kicker is, Sonnet is also the cheapest since it supports prompt caching
The Kimi ones were close to working but didn't quite make the mark
" it supports prompt caching"
May I ask if you checked that?
I use "{"cachePoint": { "type": "default" }" and I found 2 things:
* 1) even if stated in the Doco, Bedrock Converse API does not allow 1hr expiry time, only 5m - gives error when attempted;
* 2) Bedrock Converse API does accept up to 4 cachePoint's but does NOT cache and returns zeroes. LOL. It was confirmed by some other people on Github.
(Note: VertexAI does cache properly reducing the bill drastically, so I use Vertex instead of OpenRouter.)
> The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.
Very valid. This is an active area of research, and there are a lot of options to try out already today.
- People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.
- Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.
- DFlash (block diffusion for speculative decoding) needs a good drafting model compatible with the big model, but can provide an uplift up to 5x in decoding (although usually in the 2-2.5x range)
- Forcing a model's thinking to obey a simple grammar has been shown to improve results with drastically lower thinking output (faster effective result generation) although that has been more impactful on smaller models.
We should be skeptical, but it's definitely trending in the right direction and I wouldn't be surprised if we are indeed able to run it at acceptable speeds.
> Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.
This hasn't been my experience. After Anthropic's started their shenanigans I've switched to exclusively using open-weights models via OpenRouter and OpenCode and I can't really tell a difference (for better or for worse).
> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.
All the Q quants from big quant providers are importance-weighted (imatrix) nowadays.
The main (possibly only?) difference between Q and IQ today is that IQ uses a lookup table to achieve better compression. That is also why IQ suffers more when it can't fully fit into VRAM.
It's important to teach people the distinction and not perpetuate wrong assumptions of the past. If one needs/wants static quants, ignoring IQ_ isn't enough.
> - People have successfully used TurboQuant to quantize model weights (TQ3_4S), not just the context KV, to achieve smaller sizes than Q4 (~3.5 bpw) with much better PPL and faster decoding.
Where can I find more info on this? I’d like to convert models to onnx this way.
> - Importance-weighted quantization (e.g. IQ4) also provides way better PPL, KDL, etc. at the same size as a Q4.
Where can I find more info on this? I’d like to convert models to onnx this way.
The most difficult environment for small models is in the browser. Would be great to push the SOTA in that environment.
> being able to run a model and being able to run a model fast are two very different thresholds
Specifically speaking, on my Strix Halo machine with (theoretical) memory bandwidth of 256 GB/s, a 70 GB model can't generate faster than 256/70= 3.65 t/s. The logic here is that a dense model must do a full read of the weights for each token. So even if the GPU can keep up, the memory bandwidth is limiting.
A Mac M5 Pro is faster with a bandwidth of 307 GB/s, but that's only a little faster.
This thing is going to be slow on consumer hardware. Maybe that is useful for someone, but I probably prefer a faster model in most cases even if the model isn't quite as smart. Qwen3.6 35B-A3B generates about 50 t/s on my machine, so it can make mistakes, be corrected, and try again in the same time that this model would still be thinking about its first response.
Recent models support multi-token prediction, which can guess multiple future tokens in a single decode step (using some subset of the model itself, not a separate drafting model) and then verify them all at once. It's an emerging feature still (not widely supported) and it's only useful for speeding up highly predictable token runs, but it's one way to do better in practice than the common-sense theoretical limit might suggest.
Thank you, I just realised we are talking about MTP. It seems that it's not that clear though.
"Currently, the MTP capabilities are primarily accessible through Google's proprietary LiteRT framework, rather than the open-weights versions... Despite the missing MTP heads in the open release, Gemma 4 (specifically the 26B-A4B variant) still demonstrates high efficiency"
being able to run a model fast is definitely more useful, but being able to run a model slowly for free is still super useful. agentic workflows are maturing all the time.
yes, if i'm directly interacting with the LLM, i want it to be reasonably fast. but lately i've been queueing up a bunch of things when i go for lunch, or leaving things running when i go home at the end of the day. and claude doesn't keep working on that all night, it runs for an hour or so, gets to a point where it needs more input from me, and gives me some stuff to review in the morning. that could run 16x slower and still be just as useful for me.
Cloud hardware is not inherently more "proper" than what's being proposed here, there's nothing wrong per se about targeting slower inference speeds in an on prem single-user context.
> Cloud hardware can run the original model. Quantization will reduce quality.
New models are often being released in quantized format to begin with. This is true of both Kimi and the new DeepSeek V4 series. There is no "original model", the model is generated using Quantization Aware Training (QAT).
> There is no "original model", the model is generated using Quantization Aware Training (QAT).
The original model is the model used for the benchmarks
People will say "You can run it locally!" then show the benchmarks of the original model, but what they really mean is that you can run a heavily quantized adaptation of the model which has difference performance characteristics.
That remark was specific to newer models like Kimi 2.x and DeepSeek V4 series, and this is clearly stated in my comment.
As for other models, we quantize them because we are generally constrained by the model's total footprint in bytes, and running a larger model that's been quantized to fit in the same footprint as a smaller one improves performance compared to a smaller original, generally up to Q4 or so, with even tighter quantizations (up to Q2) being usable for some uses such as general Q&A chat.
The quantization for some models can be very detrimental and their quality can drop considerably from the posted benchmarks which are probably at bf16, this is why having considerable RAM can be important.
Sure but for a casual conversational use case I have not found speed to be a huge barrier. I chatted with a 100b model using ddr5 only on a plane recently and it was fine. It's mainly that I cannot do data classification and coding tasks in a timely manner.
That is insane, if you billed me an extra $200 for a bug in your system I'd flat out cancel my subscription. If you're not going to credit that back to me, you don't deserve anymore of my money. I'm a Claude first guy, but if you're going to bill me incorrectly, that's on you, own it, fix it.
Where? All I see is Boris saying "we are unable to issue compensation for degraded service or technical errors that result in incorrect billing routing".
Keep this in mind next time you hear someone talking about "removing the human in the loop".
Anthropic apparently won't take responsibility for issues their own systems handling billing cause. You think they'll take responsibility in your system when a bug in their models can be demonstrated as the cause?
> Anthropic apparently won't take responsibility for issues their own systems handling billing cause.
I think with every org, especially the big ones, trying to dodge responsibility (setting the intent of "customer support" to be annoying them enough for them to buzz off), the only recourse people have is to give them enough bad press where they wake up and do the refund, it's less than a rounding error for them.
I think Anthropic is hardly unique in that position and being able to chat with a human with any sort of power to actually make things right is becoming more and more rare. If any human eyes saw that, the correct thing to do would probably be passing the message up the chain like "Hey, this will have really bad optics if we don't do the right thing. Can you take like 5 minutes and hit the refund button while I draft up a nice message about it?"
Bad press is meaningless where it matters most these days. The kind of people who are most responsive to threats of bad press are the kind of people who don't need to be threatened with bad press to do the right thing.
I really wish it carried any weight. It just doesn't. If someone at the organization just says "never admit fault, always attack", it's very likely they'll get away with it.
> For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable.
Before February I was able to use Opus on High exclusively on my Max plan no problem. Now I've shifted to just using Sonnet on high and yeah, its pretty capable. I love that, Claude Pilled. ;)
Yeah I love Claude, amazing models. Anthropic has very quickly burned most of the goodwill I had for it so I still ended up cancelling my subscription.
the benchmarks we're using to measure llm's do no justice when everyone's mental-benchmark is simply "is it going to feel like using claude" and the answer is still no. the entire llm space is stuffed with tons of crazy datapoints and vernacular that barely paint the picture of the mental benchmark everyone is after.
i too am desperate to just sever ties with these big providers, my fingers are crossed we get there within the constraints of local hardware even if that means me spending 3-5k i just want off this wild ride.
Not sure if 1M token window is meaningful with Sonnet/Opus. The models go dumb quickly as context increases making them unusable (that is if you get routed to actual Opus, otherwise they are just dumb regardless of context window).
Yeah, you can run it locally if you have enough VRAM, but the reports trickling in are saying about 3 tok/sec. This was on a Strix Halo box which definitely has the needed VRAM, but isn't going to have as high mem bandwidth as a GPU card, it's going to be similar on a Mac - that's the dilemma... the unified memory machines have the VRAM, but the bandwidth isn't great for running dense models. This size of a dense model is only going to be runnable (usefully) by very few people who have multiple GPU cards with enough memory to add up to about 70GB.
I don't think this is quite correct, a Strix Halo box usually has 256 GB/s memory bandwidth. An M5 Max has 614 GB/s. An M3 Ultra (no M4 or M5 Ultra) has 820 GB/s. It's still not GDDR or HBM territory, but still significantly faster.
That's the edge of Apple Silicon for AI. When they scale up the chip they add more memory controllers which adds more channels and more bandwidth.
But yeah in the end it's still going to be only a handful of people that can run it.
What I meant is that I think researching and developing smaller more powerful model is more interesting than chasing the next 3T parameter model while burning through VC money and squeezing your customer base more and more aggressively.
That's more a testament of how good Qwen3.6 27B is (it really is great) more than how bad this one is IMO. Gemma 4 31B was already good, but Qwen3.6 27B is incredible for its size.
Good models vs bad models are relative: if this was released in 2020 it would be earth shattering. But releasing a model today that's only on par with open-source dense models a quarter of the size and soundly beaten by open-source MoEs with active param counts a quarter of the size is kind of a flop. The niche for this is basically no one. It'll run at near-zero TPS for the few local model aficionados with enough hardware to try it out, and is lower throughout and lower quality for people trying to use it at scale.
I'm rooting for Mistral, I want them to release good models. This just isn't one. It's a little sad since they once were so prominent for open-source.
Who knows — if they have the compute to train this, they have the compute to train an MoE that's 3-4T total params with 128B active. Maybe they'll make a comeback (although using Llama 2 attention is... not promising). I hope they do.
DeepSeek v4 Flash is still over 100GB at Q4 IIRC, and Q4 has generally been the sweet spot. Although it's an MoE so it might run a lot faster that this dense Mistral model if you have the RAM.
"Q4 has generally been the sweet spot" for self-hosting, yes. For any real meaningful work it's dumb AF. The only way to get reasonable intelligence from mid-size Gemma or Qwen is to run full precision BF16. Anything else is just an emulation of AI.
Very valid. Importance-weighted quantization and TurboQuant on model weights can reduce loss a lot compared to "traditional" Q4 so one can be hopeful.
> For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality
But you will own no computer, and that's also assuming prices stay what they are. Anyway my point was not whether or not it makes financial sense for everyone. A lot of people are very happy not owning their movies, software, games, cars or house. I'm just happy there is a future where the people can own and locally run the tech that was trained on their stolen data.
@simjnd, I hate this idea but you remember how radio had been regulated to death? And how fast one will be triangulated if one decides to run a "self hosted" radio station today? My bet is in 5 years not only owning AI-inference-capable computer but using AI itself will be regulated. Essentially, we will have to scan biometrics to just ask any SOTA model to "summarise this".
Why? Because capable and free models at the dawn of AI almost made people think again and - oh oh - ask questions!
I recommend using OpenRouter (openrouter.ai). Basically a broker between inference providers and you which allows you to pick, try, and switch models from a massive catalog, extremely transparent about usage and pricing.
I've had a decent experience with ollama cloud. It is slower than going thru openrouter but much, much cheaper -- the generosity of their $20 plan reminds me of what the Claude Code $20 plan was back in the day
You could run it on a single Mac Studio with M3 Ultra, or two Mac Studios with M4 Max at higher perf than that. And lightly quantizing this could give us modern dense models in the ~80GB size range, which is a very compelling target.
Wouldn't matter much still. M3 ultra has 819GB/s unified memory bandwidth. That means theoretical max tokem rate is 819/128 =~ 6.39 t/s. At 80 GB (5 bit quantization), its still near about 10 t/s ... far from a good coding experience. Also, these are theoretical max.. real world token generation rates would be at least 15-20% less.
Are you dumb because you're not Einstein? Intelligence is a spectrum. Just because you're not #1 doesn't mean you're dumb. A lot of small models are not frontier but are still very competent and are very useful coding agent. It may take better prompting and more guiding, but that can be a reasonable tradeoff for some people.
I would love to be able to run frontier locally, but I think the larger importance of open weight models is price accountability.
In the US with our broken system of capitalism, it’s the only way we can tether these companies to reality. Left to their own devices, I’m not convinced they would actually compete with each other on price.
Buy nobody like to talk about how “moat” building is fundamentally anti-competitive, even in name.
Funny that self proclaimed capitalists hate the system in practice. Commodity pricing is what truly terrifies them.
I'm not necessarily interested in having frontier locally. You don't need to be frontier to be a very good and useful coding agent. I agree with your point on price accountability though. Hopefully no tariff comes down on the Chinese and European open-weight models.
Probably a testament to how good Qwen3.6 is considering Qwen3.6-35B-A3B is not only ahead of their similar weight class XS.2 but also their M.1 (close to 10x bigger at 225B-A23B).
Interestingly, Gemma 4 26B-A4B and Qwen3.6 27B (dense) have been left out of the comparison.
The smaller models are becoming very good and quantization techniques like importance weighting and TurboQuant on model weights let you run aggressively quantized version (IQ2, TQ3_4S) on consumer hardware with extremely acceptable perplexity and quality loss.
Also from a security perspective. People have been able to extract copyrighted code / API keys some LLMs have been trained on before. If you opt-in to this, your / your company code will be used to train and improve the model. People may then be able to extract that from the model. Another threat vector.
I also think Steam does a great job a hiding it, and the new recommendation page is really great IMO. Other than some generic AAA, it introduced me to really great games I enjoyed based on my play history.
The more content is available, the more curation is important and IMO their algorithm currently does a good job at it.
There are some odd cases like that, but you can always "Ignore" a game and it'll never show up again. That also feeds into Steams curation for you based on your interests.
reply