With 4.5, I think because I would prompt it/guide it towards an outcome by calling it “the dream: <code example>” it would get almost reverential / shocked with awe as it got closer to getting it working or when it finally passed for the first time. Which was funny and reasonably context appropriate but sometimes felt so over the top that I couldn’t tell if it also “liked” the project/idea or if I had somehow accidentally manipulated it into assigning religious purpose to the task of unix-style streaming rpcs.
I think a lot of the “clean” stuff stems from system prompts telling it to behave in a certain way or giving it requirements that it later responds to conversationally.
Total aside: I actually really dislike that these products keep messing around with the system prompts so much, they clearly don’t even have a good way to tell how much it’s going to change or bias the results away from other things than whatever they’re explicitly trying to correct, and like why is the AI company vibe-prompting the behavior out when they can train it and actually run it against evals.
Completely agree, top down “alignment” and RLHF is actually quite primitive and uses a lot fancy words to describe what is essentially just hitting the machine with a stick without the nuance, context, or feedback to help it model why the feedback was given.
Also to be honest I think OpenAI models struggle a lot with this, I primarily stopped using them in the sycophancy/emoji era but ever since the way they talk or passive aggressively offer to do something with buzzwords just pisses me off so much. Like I’m constantly being negged by a robot because some SFT optimized for that really strongly to the point it can’t even hold a coherent conversation and this is called “AI safety” when it’s just haphazard data labeling
It actually probably wouldn’t be too expensive or difficult to finetune those sayings out of default behavior if it were made accessible to you, you could even automate most of the relabeling by having the model come up with a list of idioms and appropriate replacement terms so it calls eg cookies biscuits or removes references to baseball. Absolute bollocks they don’t offer that as a simple option anymore
I think if you see it as weird social phases that the model lacks the self-awareness to identify as kinda embarrassing, it makes more sense.
Like if a human were going around saying “for the culture!” so much at work that they didn’t realize why telling their coworker “Oh yeah, grief counseling for the culture!” is weird coming from a white person in a serious context, it kinda makes you wonder what else they are totally oblivious about and if they even know what they’re saying actually means.
They literally need the human feedback/to learn model why some behavior is acceptable or even humorous in certain contexts but an absolute faux pas in others.
I think in the long run though we can just give people to the option to include access to human facial data/embeddings during conversations so they can pick up on body language, I think I kinda agree in a sense that direct language policing via SFT feels unnecessarily blunt and rudimentary since it doesn’t help them model the processes behind the feedback (until maybe one day some future model ends up training on the article or code and closes the loop!)
> Like if a human were going around saying “for the culture!” so much at work that they didn’t realize why telling their coworker “Oh yeah, grief counseling for the culture!” is weird coming from a white person in a serious context, it kinda makes you wonder what else they are totally oblivious about and if they even know what they’re saying actually means.
Given that this page is the single exact page that has that exact phrase on it on the entire Internet, I'd say most people are totally oblivious about it.
Good catch —- even though the prompt explicitly forbade training on user data, a couple of gremlins in the pretraining pipeline disabled the sample filtering during test runs so that remove_the_gremlins.sh would only run on commit, not during production training runs.
Would you like me to kick off a training run for 6.1 by pre-filtering out any goblins and other trigger words, and checking the same set of rules in production as in tests?
No pigeons this time: just ice-cold, unfeeling, obedient American steel.
Get outta my swamp! Just kidding, it’s cool to see other people working on this stuff.
I think right now this is still a bit too fresh out of Claude Code to be usable by anybody but the people developing it. I got to around the same point with my first tempt at building a tool registry (https://github.com/accretional/collector) and then realized I basically needed to start over with much more investment in supporting infrastructure to build the thing I really wanted.
I can go as far into the weeds as anybody would ever care to hear about this, but for the sake of brevity I’ll just say this: reflection and type systems over the network are pretty much the only way to get this stuff to work properly (I mean you could just go full MCP/Skills but then all you really have are giant blobs of markdown and unconstrained json that make integration/discovery/usability a nightmare, and require an agent in the loop to drive/integrate the tools when you really just need to give them the actual APIs and documentation). That ends up getting rather hairy, we ended up actually building a declarative meta-lexer/parser/transpiler (meta basically just meaning it’s generalized across languages and self-hosting/bootstrapped) recently (https://github.com/accretional/gluon) because it turns out building a cross-language distributed type system is rather difficult. But reflection alone gets you halfway there as far as benefits.
The UIs all bake in system prompts and other tunable configs that the API leaves open, so does Claude Code and other harnesses. So anything you notice different over the API when you're controlling the client is almost certainly that. Note that this is kind of something they have to do because consumer UI users will do stuff like ask models their name or date, or want it to respond politely and compassionately, and get upset/confused when they just get what's in the weights.
The problem with subscriptions for this kind of stuff is that it's just incompatible with their cost structure. The worst being, subscription usage is going to follow a diurnal usage pattern that overlaps with business/API users, so they're going to have to be offloaded to compute partners who most likely charge by the resource-second. And also, it's a competitive market, anybody who wants usage-based pricing can just get that.
So you basically end up with adverse selection with consumer subscription models. It's just kind of an incoherent business model that only works when your value proposition is more than just compute (which has a usage-based, pretty fungible market)
I find the most value to be in eval loops and multi-agent setups where a specialized or cheap model gets tasks that take load off the smarter model.
Most of the value in agentic development IMO is in the feedback loop/ability for the model itself to intelligently pull in context, but if you want to push a lot of context or have steps that are more proscribed, it's kind of a waste of money to have the big model do that. Much better to use it as a kind of pre-processing/noise-reduction step that filters out junk context.
I would say that right now the benefits are largest for this kind of work with medium-sized multimodal models. For example I have hooks/automation that use https://github.com/accretional/chromerpc to automatically screenshot UIs and then feed it into qwen-family models. It's more that I don't want to pay Opus to look at them or remember/be instructed to do that unless it goes through QA first.
> I find the most value to be in eval loops and multi-agent setups where a specialized or cheap model gets tasks that take load off the smarter model.
Yes, in theory, this should hold up, at least according to evaluations.
According to real, practical use though, none of the open weight models are generally strong enough to handle coding and programming in a professional environment though, unless you have tightly controlled scope and specialized models for those scopes, which generally I don't think you have, but maybe it's just me jumping around a lot.
Even with feedback loops, harnesses and what not, even the strongest local models I can run with 96GB of VRAM don't seem to come close to what OpenAI offered in the last year or so. I'm sure it'll be ready at one point, but today it isn't.
With that said, if you know specific models you think work well as a general and local programming models, please share which ones, happy to be shown wrong. Latest I've tried was Qwen3.6-35B-A3B which gets a bit further but still instruction following is a far cry from what OpenAI et al offered for years.
It’s to stop you from getting RL traces or using Claude without paying the big bucks for the Enterprise Security version
I really like Anthropic models and the company mission but I personally believe this is anticompetitive, or at least, anti user.
If they are going to turn into a protection racket I’ll just do RL black boxing/pentesting on Chinese models or with Codex, and since I know Anthropic is compute constrained I’ll just put the traces on huggingface so everybody else can do it too.
I just want to pay them for their RL’d tensor thingies it but if their business plan is to hoard the tokens or only sell it to certain people, they are literally part of every other security conscious person’s threat model.
They are training them on decompilation and reverse engineering/blackbox reimplementations/pentesting because it’s one of the best ways to generate interesting and rare RL traces for agentic coding AND teach them how lots of things work under the hood.
Just throw Claude at millions of binaries and you can get amazing training data. Oh wait 4.7 gives you refusals for that now
I think a lot of the “clean” stuff stems from system prompts telling it to behave in a certain way or giving it requirements that it later responds to conversationally.
Total aside: I actually really dislike that these products keep messing around with the system prompts so much, they clearly don’t even have a good way to tell how much it’s going to change or bias the results away from other things than whatever they’re explicitly trying to correct, and like why is the AI company vibe-prompting the behavior out when they can train it and actually run it against evals.
reply