Hacker Newsnew | past | comments | ask | show | jobs | submit | getnormality's commentslogin

> I think most ML people now think of neural-network architectures as being, essentially, choices of tradeoffs that facilitate learning in one context or another when data and compute are in short supply, but not as being fundamental to learning.

Is this a practical viewpoint? Can you remove any of the specific architectural tricks used in Transformers and expect them to work about equally well?


I think this question is one of the more concrete and practical ways to attack the problem of understanding transformers. Empirically the current architecture is the best to converge training by gradient descent dynamics. Potentially, a different form might be possible and even beneficial once the core learning task is completed. Also the requirements of iterated and continuous learning might lead to a completely different approach.


Coffee modifies physiology and cognition? You're telling me this for the first time.

The paper is about previously unknown ways coffee affects the body.

I was so surprised at this headline that I nearly leapt out of my chair!

But it says it’s the same for decaf. That is more interesting

Been treating coffee as caffeine with aroma. Any important points about coffee itself?

Humans known since 45 minutes after first drink

If the solar panels are movable we can go back to the Middle Ages formula where some of the land is left fallow each year to improve yields!

I skimmed the article for an explanation of why this is needed, what problem it solves, and didn't find one I could follow. Is the point that we want to be able to ask for visualizations directly against tables in remote SQL databases, instead of having to first pull the data into R data frames so we can run ggplot on it? But why create a new SQL-like language? We already have a package, dbplyr, that translates between R and SQL. Wouldn't it be more direct to extend ggplot to support dbplyr tbl objects, and have ggplot generate the SQL?

Or is the idea that SQL is such a great language to write in that a lot of people will be thrilled to do their ggplots in this SQL-like language?

EDIT: OK, after looking at almost all of the documentation, I think I've finally figured it out. It's a standalone visualization app with a SQL-like API that currently has backends for DuckDB and SQLite and renders plots with Vegalite. They plan to support more backends and renderers in the future. As a commenter below said, it's supposed to help SQL specialists who don't know Python or R make visualizations.


I was quite psyched when I read this so maybe I can tell you why it's interesting to me, although I agree the announcement could have done a better job at it.

In my experience, the only thing data fields share is SQL (analysts, scientists and engineers). As you said, you could do the same in R, but your project may not be written in R, or Python, but it likely uses an SQL database and some engine to access the data.

Also I've been using marimo notebooks a lot of analysis where it's so easy to write SQL cells using the background duckdb that plotting directly from SQL would be great.

And finally, I have found python APIs for plotting to be really difficult to remember/get used to. The amount of boilerplate for a simple scatterplot in matplotlib is ridiculous, even with a LLM. So a unified grammar within the unified query language would be pretty cool.


I share your pain. You might enjoy Plotnine for python, helps ease the pain. The only bad thing about ggplot is that once you learn it you start to hate every other plotting system. Iteration is so fast, and it is so easy to go from scrappy EDA plot to publication-quality plotting, it just blows everything else out of the water.

But isn't this then just another tool that you're including in your project? I don't get why I would want to add this as a visualization tool to a project, if it's already using R, or Python, etc...

I mean, is it to avoid loading the full data into a dataframe/table in memory?

I just don't see what the pain point this solves is. ggplot solves quite a lot of this already, so I don't doubt that the authors know the domain well. I just don't see the why this.


Anything to standardise some of the horrifying crap that data scientists write to visualise something.

Well there's always going to be a dependency anyway: loading the data, making it a dataframe, visualizing it, this might be 3 libraries already.

In a sense I really get your complaint. It's the xkcd standard thing all over, we now have a new competing standard.

I think for me it's not so much the ggplot connection, or the fact that I won't need a dataframe library.

It's that this might be the first piece of a standard way of plotting: no matter which backend (matplotlib, vega, ggplot), no matter how you are getting your data (dataframes, database), where you're doing this (Jupyter or marimo notebook, python script, R, heck lokkerstudio?). You could have just one way of defining a plot. That's something I've genuinely dreamt about.

And what makes this different from yet another library api to me is that it's integrated within SQL. SQL has already won the query standardisation battle, so this is a very promising idea for the visualization standardisation.


I see, that's insightful. At first sight I thought of it as a kind of novelty, extending SQL with a visual grammar to integrate with a specific plotting library. But from your comments I can now imagine it has potential as a general solution for that space between data - wherever it comes from, it can typically be queried by SQL - and its visualization.

Thinking further, though, there might be value in extracting the specs of this "grammar of graphics" from SQL syntax and generalized, so other languages can implement the same interface.


I completely agree, and I think this is also where I'm quite excited. This project's connection with ggplot , which has one of the most respected grammar for plotting, means that it would be in a good position to achieve what you describe.

There’s certainly some benefit in a declarative language for creating charts from SQL. Obviously this doesn’t do anything that you can’t also do easily in R or Python / matplotlib using about the same number of lines of code. But safely sandboxing those against malicious input is difficult. Whereas with a declarative language like this you could host something where an untrusted user enters the ggsql and you give them the chart.

So it’s something. But for most uses just prompting your favorite LLM to generate the matplotlib code is much easier.


This isn't about ggplot (or any particular library) per se, it's about using a flavour of SQL with a grammar of graphics: https://en.wikipedia.org/wiki/Wilkinson%27s_Grammar_of_Graph...

What makes it interesting is the interface (SQL) coupled with the formalism (GoG). The actual visualization or runtime is an implementation detail (albeit an important one).


It seems to be for sql users who don’t know python or r.

I would even add that it fits into a more general trend where operations are done within SQL instead of in a script/program which would use SQL to load data. Examples of this are duckdb in general, and BigQuery with all its LLM or ML functions.

Sometimes you fuck the cloud, sometimes the cloud fucks you

Maybe the old man is on to something.

Forgot to put a cache on it probably. :)

Nope

In what way is AI 2027 coming true?

AI 2027 predicted a giant model with the ability to accelerate AI research exponentially. This isn't happening.

AI 2027 didn't predict a model with superhuman zero-day finding skills. This is what's happening.

Also, I just looked through it again, and they never even predicted when AI would get good at video games. It just went straight from being bad at video games to world domination.


> Early 2026: OpenBrain continues to deploy the iteratively improving Agent-1 internally for AI R&D. Overall, they are making algorithmic progress 50% faster than they would without AI assistants—and more importantly, faster than their competitors.

> you could think of Agent-1 as a scatterbrained employee who thrives under careful management

According to this document, 1 of the 18 Anthropic staff surveyed even said the model could completely replace an entry level researcher.

So I'd say we've reached this milestone.


In the system card they seem to dismiss this. Quotes;

> (...) Claude Mythos Preview’s gains (relative to previous models) are above the previous trend we’ve observed, but we have determined that these gains are specifically attributable to factors other than AI-accelerated R&D,

> (The main reason we have determined that Claude Mythos Preview does not cross the threshold in question is that we have been using it extensively in the course of our day-to-day work and exploring where it can automate such work, and it does not seem close to being able to substitute for Research Scientists and Research Engineers—especially relatively senior ones.

> Early claims of large AI-attributable wins have not held up. In the initial weeks of internal use, several specific claims were made that Claude Mythos Preview had independently delivered a major research contribution. When we followed up on each claim, it appeared that the contribution was real, but smaller or differently shaped than initially understood (though our focus on positive claims provides some selection bias). In some cases what looked like autonomous discovery was, on inspection, reliable execution of a human-specified approach. In others, the attribution blurred once the full timeline was accounted for.

Anthropic is making significant progress at the moment. I think this is mostly explained by the fact that a massive reservoir of compute became available to them in mid/late 2025 (the Project Rainier cluster, with 1 million Trainium2 chips).


> According to this document, 1 of the 18 Anthropic staff surveyed even said the model could completely replace an entry level researcher. > > So I'd say we've reached this milestone.

If 1/N=18 are our requirements for statistical significance for world-altering claims, then yeah, I think we can replace all the researchers.


Both Anthropic and OpenAI employees have been saying since about January that their latest models are contributing significantly to their frontier research. They could be exaggerating, but I don’t think they are. That combined with the high degree of autonomy and sandbox escape demonstrated by Mythos seems to me like we’re exactly on the AI 2027 trajectory.


In AI 2027, May 2026 is when the first model with professional-human hacking abilities is developed. It's currently April 2026 and Mythos just got previewed.


I think previous models could do hacking just fine.


The Mythos system card shows massive improvements over Opus in hacking (e.g. a 0.8% -> 72% in "Firefox shell exploitation"). If you thought Opus was already human-professional-level, well.


What's the professional human baseline?


It's true though that the cyber security skills put firmly these models in the "weapons" category. I can't imagine China and other major powers not scrambling to get their own equivalent models asap and at any cost- it's almost existential at this point. So a proper arms race between superpowers has begun.


It's a little funny that "system/model card" has progressively been stretched to the point where it's now a 250 page report and no one makes anything of it.


I was thinking the same thing, but then I decided to embrace the frustration of the image. It's reminding us that the pictures we have in our heads are kind of fragile. They don't prepare us for a live encounter with Earth from some random angle in space.


This is kind of profound.


If you're confused what you're looking at, turn it upside down.


West Africa & Gibraltar strait


From the article:

> The Earth appears upside down

Which I find to be a rather interesting take.


Hmm. Folks trying to discover the elegant core of data frame manipulation by studying... pandas usage patterns. When R's dplyr solved this over a decade ago, mostly by respecting SQL and following its lead.

The pandas API feels like someone desperately needed a wheel and had never heard of a wheel, so they made a heptagon, and now millions of people are riding on heptagon wheels. Because it's locked in now, everyone uses heptagon wheels, what can you do? And now a category theorist comes along, studies the heptagon, and says hey look, you could get by on a hexagon. Maybe even a square or a triangle. That would be simpler!

No. Stop. Data frames are not fundamentally different from database tables [1]. There's no reason to invent a completely new API for them. You'll get within 10% of optimal just by porting SQL to your language. Which dplyr does, and then closes most of the remaining optimality gap by going beyond SQL's limitations.

You found a small core of operations that generates everything? Great. Also, did you know Brainfuck is Turing-complete? Nobody cares. Not all "complete" systems are created equal. A great DSL is not just about getting down to a small number of operations. It's about getting down to meaningful operations that are grammatically composable. The relational algebra that inspired SQL already nailed this. Build on SQL. Don't make up your own thing.

Like, what is "drop duplicates"? What are duplicates? Why would anyone need to drop them? That's a pandas-brained operation. You want the distinct keys defined by a select set of key columns, like SQL and dplyr provide.

Who needs a separate select and rename? Select is already using names, so why not do your name management there? One flexible select function can do it all. Again, like both SQL and dplyr.

Who needs a separate difference operation? There's already a type of join, the anti-join, that gets that done more concisely and flexibly, and without adding a new primitive, just a variation on the concept of a join. Again, like both SQL and dplyr.

Props to pandas for helping so many people who have no choice but to do tabular data analysis in Python, but the pandas API is not the right foundation for anything, not even a better version of pandas.

[1] No, row labels and transposition are not a good enough reason to regard them as different. They are both just structures that support pivoting, which is vastly more useful, and again, implemented by both R and many popular dialects of SQL.


I guess I have pandas brain because I definitely want to drop duplicates, 100% of the time I'm worried about duplicates and 99% of the time the only thing I want to do with duplicates is drop them. When you've got 19 columns it's _really fucking annoying_ if the tool you're using doesn't have an obvious way to say `select distinct on () from my_shit`. Close second at say, 98% of the time, I want to a get a count of duplicates as a sanity check because I know to expect a certain amount of them. Pandas makes that easy too in a way SQL makes really fucking annoying. There are a lot of parts on pandas that made me stop using it long ago but first class duplicates handling is not among them.

And the API is vastly superior to SQL is some respects from a user perspective despite being all over the place in others. Dataframe select/filtering e.g. df = df[df.duplicated(keep='last')] is simple, expressive, obvious, and doesn't result in bleeding fingers. The main problem is the rest of the language around it with all the indentations, newlines, loops, functions and so on can be too terse or too dense and much hard to read than SQL.


Duplicates in source data are almost always a sign of bad data modeling, or of analysts and engineers disregarding a good data model. But I agree that this ubiquitous antipattern that nobody should be doing can still be usefully made concise. There should be a select distinct * operation.

And FWIW I personally hate writing raw SQL. But the problem with the API is not the data operations available, it's the syntax and lack of composability. It's English rather than ALGOL/C-style. Variables and functions, to the extent they exist at all, are second-class, making abstraction high-friction.


Oooh buddy how's the view from that ivory tower??

But seriously I'm not in always in control of upstream data, I get stuff thrown over to my side of the fence by an organization who just needs data jiggled around for one-off ops purposes. They are communicating to me via CSV file scraped from Excel files in their Shared Drive, kind of thing.


Do what you gotta do, but most of my job for the past decade has been replacing data pipelines that randomly duplicate data with pipelines that solve duplication at the source, and my users strongly prefer it.

Of course, a lot of one-off data analysis has no rules but get a quick answer that no one will complain about!


I updated my OG comment for context. As an org we also help clients come up with pipelines but it's just unrealistic to do a top-down rebuild of their operations to make one-off data exports appeal to my sensibilities.


I agree, sometimes data comes to you in a state that is beyond the point where rigor is helpful. And for some people that kind of data is most of their job!


Duplicates are a sign of reality. Only where you have the resources to have dedicated people clean and organize data do you have well modeled data. Pandas is a power tool for making sense of real data.


> Duplicates in source data are almost always a sign of bad data modeling

Nope. Duplicates in source data(INPUT) is natural, correct and MUST be supported or almost all data become impossible.

What is the actual problem is the OUTPUT. Duplicates on the OUTPUT need to be controlled and explicit. In general, we need in the OUTPUT a unique rowby a N-key, but probably not need it to be unique for the rest, so, in the relational model, you need unique for a combination of columns (rarely, by ALL of them).


You articulate your case well, thank you!

I always warn people (particularly junior people) though that blindly dropping duplicates is a dangerous habit because it helps you and others in your organization ignore the causes of bad data quickly without getting them fixed at the source. Over time, that breeds a lot of complexity and inefficiency. And it can easily mask flaws in one's own logic or understanding of the data and its properties.


When I'm in pandas (or was, I don't use it anymore) I'm always downstream of some weird data process that ultimately exported to a CSV from a team that I know has very lax standards for data wrangling, or it is just not their core competency. I agree that duplicates are a smell but they happen often in the use-cases that I'm specifically reaching to pandas for.


Exactly. It’s not that getting rid of duplicates is bad, is that they may be a symptom of something worse. E.g. incorrect aggregation logic


On reflection I think it's possible I may have missed the potential positive value of the post a bit. Maybe analyzing pandas gets you down to a set of data frame primitives that is helpful to build any API. Maybe the API you start with doesn't matter. I don't know. When somebody works hard to make something original, you should try to see the value in it, even if the approach is not one you would expect to be helpful.

I stand by my warnings against using pandas as a foundation for thinking about tabular data manipulation APIs, but maybe the work has value regardless.


> There's no reason to invent a completely new API for them

Yes there is: SQL is one of many possible ways to interact with tabular data, why should it be the only one? R data frames literally pioneered an alternative API. Dplyr is fantastic for many reasons, one of those being that people like the verb-based approach

Furthermore I argue that dplyr is not particularly similar to SQL in the way you actually use it and how it's actually interpreted/executed.

As for the rest I feel like you're just stating your preferences as fact.


Amen.

The author takes the 4 operations below and discusses some 3-operation thing from category theory. Not worth it, and not as clear as dplyr.

> But I kept looking at the relational operators in that table (PROJECTION, RENAME, GROUPBY, JOIN) and thinking: these feel related. They all change the schema of the dataframe. Is there a deeper relationship?


> just by porting SQL to your language

You make it sound like writing an SQL parser and query engine is a trivial task. Have you ever looked at the implementation of a query engine to see what’s actually involved? You can’t just ‘build on SQL’, you have to build a substantial library of functions to build SQL on top of.


Also it's not like dplyr is anything close to a "port" of SQL. You could in theory collect dplyr verbs and compile them to SQL, sure. That's what ORMs typically do, and what the Spark API does (and its descendants such as Polars).

"Porting" SQL to your language usually means inventing a new API for relational and/or tabular data access that feels ergonomic in the host language, and then either compiling it to SQL or executing it in some kind of array processing backend, or DataFusion if you're fancy like that.


dplyr straightforwardly transpiles to SQL through the dbplyr package, so it's semantically pretty close to a port, even though the syntax is a bit different (better).


I couldn’t agree more. But at the same time I try to stay quiet about it because SQL is the diamond in the rough that 95% of engineers toss into the trash. And I want minimal competition in a tight job market.


"The only tool I'm willing to use is a hammer, and by god I'll turn everything into nails."


SQL only works on well defined data sets that obey relational calculus rules. Pandas is a power tool for dealing with data as you find it. Without Pandas you are stuck with tools like Excel.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: