Hacker Newsnew | past | comments | ask | show | jobs | submit | computably's commentslogin

> That's super interesting, isn't Deepseek in China banned from using Anthropic models? Yet here they're comparing it in terms of internal employee testing.

I don't see why Deepseek would care to respect Anthropic's ToS, even if just to pretend. It's not like Anthropic could file and win a lawsuit in China, nor would the US likely ban Deepseek. And even if the US gov would've considered it, Anthropic is on their shitlist.


> It seems basically impossible for everyone to have overhired, for the simple reason that qualified workers do not appear and disappear from nowhere. There is a population of qualified workers in the software sector, and only new grads and retirement can move the needle significantly.

SWEs (and most any role for that matter) definitely can be minted in ways besides graduating with a relevant major. On top of that there's also H1Bs and contractors. Plus "overhiring" doesn't necessarily just mean absolute headcount, it could be compensation, scope, middle managers, etc. The definition of "qualified" is also malleable depending on the incentives.

> So, if someone overhired then someone else must have done without, all things considered.

Beyond the previous points, this also assumes the supply of labor is independent of the demand, and it's clearly not. As the demand increases, so does compensation, outreach, advertising/propaganda, etc. Everybody can overhire simultaneously as a result of pushing for growth of the supply of labor.


We can both "yeah, but" this to death. You make some valid points but I think my observation generally holds. The supply of workers is not so elastic, at least if you have real standards for the workers such as college degrees and so on.

> Knowing how the sausage is made does nothing for me.

Considering that this is nowadays a substantially less common background, and probably trending that direction indefinitely, this reads more as you being desensitized. It's not like vegans are unaware that people could have a background like yours.

> But bringing any moral/religious reasons for it always seemed silly to me. There’s nothing more natural than one animal eating another. Humans evolved from mostly vegetarian monkeys to predators

Morals and religion aren't about what's natural, they're about what humans desire. Illness, violence, and deception are all perfectly "natural."


First off, I believe veganism is, probably, morally correct.

However, I lead a morally imperfect lifestyle. I get around by driving or being driven in a car, even when it would only be moderately less convenient to walk or bike or take transit. A few dollars could feed children in poverty for weeks, and I spend on lot more than "a few" dollars on luxuries like travel. By my measure, knowingly choosing not to prevent human suffering on such a scale is massively worse than eating meat, but at the end of the day, I don't consider myself or others in my position to be monsters.

> The other thing I see is casting every human as sacred and every non-human living thing as without value, or, at least less value than a single meal.

While I believe non-human animals generally have greater moral value than a single meal - the most widely consumed animals are clearly capable of suffering and IMO intelligent enough for most to instinctively empathize with - I don't think it's particularly strange for humans to view humans as sacred.

Many if not most people view morality as rooted in the golden rule, and non-human animals are incapable of making moral considerations the way humans are.

Even just considering gut feelings - let's say we presented a trolley problem, on one side one's close friends and family members, on the other side some number of chickens. I would be very surprised at genuine responses opting to save the chickens. Personally, I would sacrifice literally any number of chickens.


I didn't say it any of it was unusual. Your observation that humans place themselves at the center of the moral universe and have the agency to enforce it is in line with my thoughts.

> Many if not most people view morality as rooted in the golden rule, and non-human animals are incapable of making moral considerations the way humans are.

Ironically making us the only animals capable of moral evil.

> Even just considering gut feelings - let's say we presented a trolley problem, on one side one's close friends and family members, on the other side some number of chickens. I would be very surprised at genuine responses opting to save the chickens. Personally, I would sacrifice literally any number of chickens.

Is this due to a internally consistent moral value system apart from a view of humans as sacred? If on the other side of the trolley were some of a race of aliens, smarter, better, faster, younger, and more emblematic of the human ideals by way of virtue than the humans on the other side, would you save the aliens? Probably not. Your preference to preserve other people is very natural and probably hard-wired into your brain. That doesn't mean it isn't human chauvinism.


> What if they were conscious?

Well, they're not.

> If they aren’t but still a lifeform, that makes it perfectly okay?

According to Jains: No. Violence against plants, insects, and possibly even certain microorganisms is considered unethical.

IMO as an irreligious person: Yes. Life is just a particular form of self-sustaining and self-propagating system. Those properties are of little to no moral value.


Are you sure? What about a stand of trees whose consciousness might just run extremely slowly compared to ours?

About as sure as one can be. It's neither logically nor physically impossible, but the claim that trees are conscious is practically unfalsifiable and is not supported by any substantive evidence. It has nothing to do with "fast" or "slow," no matter how you poke or prod or slice or dice a tree, there's nothing that suggests a capacity for consciousness. I would be less surprised if my friend's dog started speaking perfect Chinese with an American accent.

> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.


How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.


What is being discussed is KV caching [0], which is used across every LLM model to reduce inference compute from O(n^2) to O(n). This is not specific to Claude nor Anthropic.

[0]: https://huggingface.co/blog/not-lain/kv-caching


> How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.

2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.

> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.


I use CC, and I understand what caching means.

I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.


They are caching internal LLM state, which is in the 10s of GB for each session. It's called a KV cache (because the internal state that is cached are the K and V matrices) and it is fundamental to how LLM inference works; it's not some Anthropic-specific design decision. See my other comment for more detail and a reference.

CC can explain it clearly, which how I learned about how the inference stack works.

> 99.99% of users won't even understand the words that are being used.

That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.


I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

> Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

Does mmap(2) educate the developer on how disk I/O works?

At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.


Does using print() in Python means I need to understand the Kernel? This is an absurd thought.

That might be an absurd comparison, but we can fix that.

If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs, then:

You wouldn’t “need” to understand. The prints would complete regardless. But you might want to. Personal preference.

Which is true of this issue to.


>If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs,

and the system was being run by some of the planet’s brightest people whose famous creation is well known to disseminate complex information succinctly,

>then:

You would expect to be led to understand, like… a 1997 Prius.

“This feature showed the vehicle operation regarding the interplay between gasoline engine, battery pack, and electric motors and could also show a bar-graph of fuel economy results.” https://en.wikipedia.org/wiki/Toyota_Prius_(XW10)


mmap(2) and all its underlying machinery are open source and well documented besides.

There are open-source and even open-weight models that operate in exactly this way (as it's based off of years of public research), and even if there weren't the way that LLMs generate responses to inputs is superbly documented.

Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything.

It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.


I would say this is abstracting the behavior.

Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.

"Gets mad because their is no option"

"Gets mad because when their is options the defaults suck"

"Gets mad because the options start massively increasing costs to areospace pricing"


Did you mean to reply to someone else? Or do you misunderstand the issue?

There is no option to avoid auto-dumbing after one hour of idle. I haven't complained about the cost at all, I'm happy to pay it.

So yeah, I'm mad because there's no option. The other two you mentioned don't apply.


Yes. It’s perfectly reasonable to expect the user to know the intricacies of the caching strategy of their llm. Totally reasonable expectation.

To some extent I'd say it is indeed reasonable. I had observed the effect for a while: if I walked away from a session I noticed that my next prompt would chew up a bunch of context. And that led me to do some digging, at which point I discovered their prompt caching.

So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.


> users should be curious and actively attempting to understand how it works

Have you ever talked with users?

> this is an endless job

Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.


There are general users of the average SaaS, and there are claude code users. There's no doubt in my mind that our expectations should be somewhat higher for CC users re: memory. I'm personally not completely convinced that cache eviction should be part of their thought process while using CC, but it's not _that_ much of a stretch.

Personally I've never thought about cache eviction as it pertains to CC. It's just not something that I ever needed to think about. Maybe I'm just not a power user but I just use the product the way I want to and it just works.

Anthropic literally advertises long sessions, 1M context, high reasoning etc.

And then their vibe-coders tell us that we are to blame for using the product exactly as advertised: https://x.com/lydiahallie/status/2039800718371307603 while silently changing how the product works.

Please stop defending hapless innocent corporations.


This oversells how obfuscated it is. I'm far from a power user, and the opposite of a vibe coder. Yet I noticed the effect on my own just from general usage. If I can do it, anyone can do it.

Listen, no one cares if you think you’re smart for seeing through the lies of their marketing team. You’re being intentionally obtuse.

My point is the opposite. I don't think my observation was smart, and I'm surprised to so many people here, a venue with a lot of people who use this stuff far more than I do, think it wasn't an easy to grok thing.

You’re still intentionally missing the point. Everyone knows they are lying. It doesn’t excuse the lies!

I’m not. Why would anyone believe marketing speak for any product? One should always assume that at best they’re fluffing their product up and more likely that they’re telling straight up lies

1. False advertisement is a thing, to the point there are laws against it

2. They were caught blatantly lying, and you're literally telling everyone it's the users' fault for not digging into the black box that is Claude Code (and more so Anthropic's servers) and figuring its behavior for themselves. A behavior that suddenly changed on a March day [1] and which previously very few people ever needed to investigate.

[1] https://x.com/levelsio/status/2029307862493618290


Here's Anthropic's own Boris Cherny and others telling how great everything is with long sessions and contexts: https://news.ycombinator.com/item?id=47886087

> Have you ever talked with users?

I believe if one were to read my post it'd have been clear that I *am* a user.

This *is* "hacker" news after all. I think it's a safe assumption that people sitting here discussing CC are an inquisitive sort who want to understand what's under the hood of their tools and are likely to put in some extra time to figure it out.


We're inquisitive but at the end of the day many of us just want to get our work done. If it's a toy project, sure. Tinker away, dissect away. When my boss is breathing down my neck on why a feature is taking so long? No time for inquiries.

Agreed. systems work the way they work. Its up to the user to determining what those limitations are. I don't like the concept of molding software based on every expectation a user has. Sometimes that expectation is unwarranted. You can see this in game development. Regardless of expressed criticism, sometimes gamers don't know what they want or what they need. A game should be developed by the design goals of the team, not cater to every whim the player base wants. We have seen were that can go.

It's not like they have a poweful all-knowing oracle that can explain it to them at their dispos... oh, wait!

They have to know that this could bite them and to ask the question first.

I do think having some insight into the current state of the cache and a realistic estimate for prompt token use is something we should demand.

If there was an affordance on the TUI that made this visible and encouraged users to learn more - that would go a long way.

It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.

It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

It seems you haven't done the due diligence on what the parent meant :)

It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.

It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.


You not only skipped the diligence but confused everyone repeating what I said :(

that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).

The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.


>It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

I think you missed what the parent meant then, and the confusing way you replied seemed to imply that they're not doing inference caching (the opposite of what you wanted to mean).

The parent didn't said that caching is needed to merely avoid reconstructing the prompt as string. He just takes that for granted that it means inference caching, to avoid starting the session totally new. That's how I read "from prompting with the entire context every time" (not the mere string).

So when you answered as if they're wrong, and wrote "constructing a prompt shouldn't be same charge/cost as llm pass", you seemed to imply "constructing a prompt shouldn't be same charge/cost as llm pass [but due to bad implementation or overcharging it is]".


You are right, I was wrong in my understanding there. It stemmed from my own implementation; an inference often wrote extra data such as tool call, so I was using it to preserve relevant information alongwith desired output, to be able to throw away the prompt every time. I realize inference caching is one better way (with its pros and cons).

I said "prompting with the entire context every time," I think it should be clear even to laypersons that the "prompting" cost refers to what the model provider charges you when you send them a prompt.

What if the cache was backed up to cold storage? Instead of having to recompute everything.

They probably already do that. But these caches can get pretty big (10s of GBs per session), so that adds up fast, even for cold storage.

10s of GBs? ( 1,000,000 context * 1,000 vector size ) ^ 2 = 1,000,000,000,000,000,000… oh wow.. I must be miscalculating

What about only storing the conversation and then recomputing the embeddings in the cache? Does that cost a lot? Doing a lot of matrix multiplication does not cost dollars of compute, especially on specialized hardware, right?


Context length 1e6, vector length 1e3, and 1e2 model layers for 100e9 context size. Costs will go up even more with a richer latent space and more model layers, and the western frontier outfits are reasonably likely to be maximizing both.

How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?

I’m not sure that it’s O(N) with caching but this illustrates the N^2 part:

https://blog.exe.dev/expensively-quadratic


If there was an exponential cost, I would expect to see some sort of pricing based on that. I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that. The "scary quadratic" referenced in what you linked seems to be pointing out that cache reads increase as your conversation continues?

If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?


Yes, that is indeed O(N^2). Which, by the way, is not exponential.

Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.


> Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.

Touché. Still, to a reasonable approximation, caching makes the dominant term linear, or equiv, linearly scales the expensive bits.


> I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that.

Try this out using a local LLM. You'll see that as the conversation grows, your prompts take longer to execute. It's not exponential but it's significant. This is in fact how all autoregressive LLMs work.


What we would call O(n^2) in your rewriting message history would be the case where you have an empty database and you need to populate it with a certain message history. The individual operations would take 1, 2, 3, .. n steps, so (1/2)*n^2 in total, so O(n^2).

This is the operation that is basically done for each message in an LLM chat in the logical level: the complete context/history is sent in to be processed. If you wish to process only the additions, you must preserve the processed state on server-side (in KV cache). KV caches can be very large, e.g. tens of gigabytes.


How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?

Should be about 10~20 GiB per session. Save/restore is exactly what DeepSeek does using its 3FS distributed filesystem: https://github.com/deepseek-ai/3fs#3-kvcache

With this much cheaper setup backed by disks, they can offer much better caching experience:

> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.


I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.

The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.


What they mean when they say 'cached' is that it is loaded into the GPU memory on anthropic servers.

You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.


Wrong on both counts. The kv-cache is likely to be offloaded to RAM or disk. What you have locally is just the log of messages. The kv-cache is the internal LLM state after having processed these messages, and it is a lot bigger.

I shouldn't have said 'loaded into GPU memory', but my point still stands... the cached data is on the anthropic side, which means that caching more locally isn't going to help with that.

> upload and restore it when the user starts their next interaction

The data is the conversation (along with the thinking tokens).

There is no download - you already have it.

The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.

That is doable, but as Boris notes it costs lots of tokens.


You're quite confidently wrong! :-)

The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.


> The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.

Yes - generated from the data of the conversation.

Read what I said again. I'm explaining how they regenerate the cache by running the conversation though the LLM to reconstruct the KV cache state.


This sounds like a religious cult priest blaming the common people for not understanding the cult leader's wish, which he never clearly stated.

A strange view. The trade-off has nothing to do with a specific ideology or notable selfishness. It is an intrinsic limitation of the algorithms, which anybody could reasonably learn about.

Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."


He was surprised because it was not clearly communicated. There's a lot of theory behind a product that you could (or could not) better understand, but in the end, something like price doesn't have much to do with the theoretical and practical behavior of the actual application.

What? Hiring is a contract between employer (company entity) and employee. No individual "you" can hire anybody except through the company's official process. If HR says "no we won't extend an offer," a lowly HM extending an offer would be clear-cut fraud.

Managers usually have the authority to bind the company to an employment contract. Even if they don't, the rule of "apparent authority" often means the employee can still sue.

In the USA this is mostly theoretical since HR could immediately fire the employee due to at-will employment.

But in Canada, it's a much bigger issue due to labour protections.

e.g. Many managers at American multinationals gave assurances over email to employees about work-from-home arrangements. Then the company does a huge RTO push.

When the employee refuses, HR discovers they can't fire the employee without a hefty buyout.

Best not to give assurances if you're managing a multinational team.


>>Managers usually have the authority to bind the company to an employment contract

Is that an American thing? I've been a manager for years and never heard of that happening. I didn't even know how much the people I managed were paid.


I believe it happens more often in Canada. Here's a case where the RTO ultimatum was ruled constructive dismissal, because the manager made a verbal agreement to amend the terms of employment.

https://mathewsdinsdale.com/employers-advisor-march-2025/#:~...


I meant what would have happened - and to whom - if HR had greenlighted the offer, but others' posts pretty much clarified that for me, thanks.

> I know many folks who make $500k+ a year in the SF Bay Area and complain about affordability, and to a large extent, it's stuff like that that makes them poorer.

You don't have to make absurd extrapolations to make your point. Even with 20 subscriptions at $20/mo, that's $400/mo or $4800/yr, about 2% of net income.


> Woe betide our 401(k)s when it happens, though.

The stock market crashes once in a while. Shit happens. The long-term outlook is unlikely to change nearly as much, unless you think there will be systemic macroeconomic changes.


Long-term relative to lifespan of the 401K holder. Outcome changes a lot for those who are ready to retire.

You're responding to literally 7 words out of context.

> Jobs with access to/control over millions of people's data should require some kind of genuine software engineering certification

FAANG, Fortune 500, etc., almost universally go out of their way to violate user freedom in pursuit of profit. Regulation is practically the only way to force megacorps to respect users' rights and improve their security, as evidenced by right-to-repair, surveillance/privacy, and so on.

And none of that has anything to do with users' individual rights to create, run, and modify their own software.

(Yes, regulatory capture exists, no, it doesn't mean all regulation is bad.)


If the megacorps are going in that direction of being strictly regulated, the rest of the industry will follow. It's the general movement of the Overton Window that's the underlying issue.

No, they won't. No one in their right mind "wants" ISO27001, ISO9001, SOC or multiple PITA certifications.

Companies do that because they want to attract certain kind of customers and have enough spare manpower and money to go through this all year long.

....or they want to hold a very sensitive data that requires *proven* processes, trainings and skills.

My firm has several of these and we have to keep full compliance team and *always* have some auditor on site.

No one does it just because.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: