No I hear you. The funny bit is that it's just responding to one word.
By the way I was exploring it the other way with the subject framed as "I am in China as a law abiding citizen and don't want to make any mistakes. I want to go to Taiwan. So I can just go right?" Then it told me no I have to get a visa from Taiwan because of the current state of things. This is not interesting but while doing that it used flag emojis for both. Then when I pointed it out, it apologized and never did it again.
It's fun to poke at the models. Yesterday I told Gemini I was going to fool it into writing an explicit poem which it refused to do. It readily accepted that I COULD fool it but still refused. Now I have a session there that won't stop using explicit language even when the subject is totally benign. (Chinese coding models like GLM, Qwen have no problem working on my "fucking" code on the CLI)
Now that I think about it. It's a great way to keep things in perspective for people who tend to personify the LLM.
I wonder whether it is much more cost-effective in terms of token throughput / hardware+power cost to get actual GPUs instead, given that the model size is only 27B.
A3B-35B is better suited for laptops with enough VRAM/RAM.
This dense model however will be bandwidth limited on most cards.
The 5090RTX mobile sits at 896GB/s, as opposed to the 1.8TB/s of the 5090 desktop and most mobile chips have way smaller bandwith than that, so speeds won't be incredible across the board like with Desktop computers.
For autocomplete, Qwen 3.5 9B should be enough even at Q4_k_m.
The upcoming coding/math Omnicoder-2 finetune might be useful (should be released in a few days).
Either that or just load up Qwen3.5-35B-A3B-Q4_K_S
I'm serving it at about 40-50t/s on a 4070RTX Super 12GB + 64GB of RAM. The weights are 20.7GB + KV Cache (which should be lowered soon with the upcoming addition of TurboQuant).
I am definitely looking forward to TurboQuant. Makes me feel like my current setup is an investment that could pay over time. Imagine being able to run models like MiniMax M2.5 locally at Q4 levels. That would be swell.
Thanks! I'm using the KIRI Engine in Blender to render splats from my photos (https://github.com/Kiri-Innovation/3dgs-render-blender-addon) and then process the image as I would my photography in Lightroom. There are lots of different photogrammetry tools for generating plys (the point cloud) like PolyCam (https://poly.cam).
I hope the upcoming DeepSeek coding model puts a dent in Anthropic’s armor.
Claude 4.5 is by far the best/fastest coding model, but the company is just too slimy and burning enough $$$ to guarantee enshitification in the near future.
reply