At this scale, that kind of thing is not really a problem; you just dump all of the data you can find into the model (pre-training)1. Of course, the pre-training data influences the model, but the reinforcement learning is really what determines the model’s writing style and, in general, how it “thinks” (post-training).
Yeah I flat out don't believe the 2% thing. It's possible that I was the 1 out of 50 who checked the page and saw that Claude code was removed... but it really seems like everyone I shared it with saw the same thing which is incredibly unlikely. Also I am an existing subscriber and checked the price page while logged in, so I shouldn't be counted in "2% of new subscribers" at all...
I have a Claude Pro tier subscription; Claude Code, as of right now, is still functional for me. If Anthropic does boot Pro-tier users off Claude Code, I will be cancelling my subscription.
They would probably grandfather existing users in for at least a year or something, you have to imagine. Even if this "test" goes very well and points to removal
This test makes perfect sense with their actions the last few weeks, they think they've done enough to transition into the general public and away from devs and our goodwill no longer is something they should be concerned with.
Its funny that openai, who in my eyes went for the general public rather than devs initially, seems to be semi pivoting and catching all the fallout from anthropic's recent behavior.
It is a massive bummer, up until those few weeks ago, i was hard pulling for anthropic for quite some time, now i just dont care and hope something dope emerges quickly that signals i wont ever have to consider either of them.
I was part of a team researching MS at a university a while ago. It truly is an endlessly fascinating disease. Most evidence currently points to MS being caused by a combination of Epstein-Barr infection and genetic factors [0,1]. It is hypothesized that Epstein-Barr triggers autoimmunity which results in the prototypical demyelination [2].
Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.
It's a sort of arbitrary pattern matching thing that can't be trained on in the sense that the MMLU can be, but you can definitely generate billions of examples of this kind of task and train on it, and it will not make the model better on any other task. So in that sense, it absolutely can be.
I think it's been harder to solve because it's a visual puzzle, and we know how well today's vision encoders actually work https://arxiv.org/html/2407.06581v1
The real question is: Why are people designing benchmarks that, if a model is trained on them, it won't improve the performance of the model at any real-world tasks? Why would anyone care about such benchmarks?
In this case, I had made an overlarge squashed merge that included both the Intercom integration (a suspiciously likely cause of slowness) and the feedback button that added the heart – so I needed to go deeper to figure out the true cause. (Noto Emoji was in the app from before, but wasn't triggered in the dashboard until we added an emoji there.)
> Your outlook above is too self critical. This is the first time an AI has beaten this park much less played a full game of RollerCoaster Tycoon through a TUI. There are important learnings for B2B SaaS. This isn't LinkedIn (it is, in fact, LinkedIn). But seriously. What can we learn here.
Starlink receivers are actually very complicated. They make use of a bunch of high-end FPGAs and a bunch of other expensive and uncommon components. See this teardown: https://youtu.be/h6MfM8EFkGg?si=m-sN6UW4nh8_HzPR.
If I read one more article/press release/whatever with such clumsy use of antithesis, I’m going to go insane. I have no problem with using AI to write if it is done well, but this…
1 This data is still heavily filtered/cleaned
reply