If I understand correctly, both the staging database and the production database share the same volume. Thus, production data was gone as well after deleting the volume.
You don't. You are missing the part where the LLM had a token which blocked access as expected. Then the LLM searched the source base, found a different token with the delete privs and then used that.
PS That warning happens in staging envs too, the LLM doesn't know which env is which by design.
Huh that's not what I gathered from the tweet at all. If I am going to write a five why's analysis, the immediate cause is the LLM wrongly decided to delete a volume, while the root cause is the bad design to co-locate staging and production data in the same volume. The writing was quite vague though, let's wait for a response from railway.
Based on the release schedule of 3.5 previously, my optimistic take is that they distill the small models from the 397B, and it is much faster to distill a sparse A3B model. Hopefully the other variants will be released in the coming days.
Very nice TG improvement from Flash Attention KQ fusion. Is it something that was already done in ik_llama.cpp? If not, then it will be a welcomed addition for hybrid CPU/GPU inference.
90% of what you pay in agentic coding is for cached reads, which are free with local inference serving one user. This is well known in r/LocalLLaMA for ages, and an article about this also hit HN front page few weeks ago.
What about the VRAM requirement for KV cache? That may matter more than memory bandwidth. With these GPUs, there are more compute capacity than memory bandwidth than VRAM.
DeepSeek got MLA, and then DSA. Qwen got gated delta-net. These inventions allow efficient inference both at home and at scale. If Anthropic got nothing here, then their inference cost can be much higher.
DeepSeek also got https://github.com/deepseek-ai/3FS that makes cached reads a lot cheaper with way longer TTL. If Anthropic didn't need to invent and uses some expensive solution like Redis, as indicated by the crappy TTL, then that also contributes to higher inference cost.
1st hint - the API call only contains one volume:
2nd hint - this gem from the tweet:> No "this volume contains production data, are you sure?"
reply