hiya, anders from dbt here. cool project -- I especially love the branching and budgeting options you've built in. both are things that I'd love for the dbt standard to include one day. was it dbt's lack of those feature that inspired you to start this project? It also seems you have an aversion to Jinja, which, believe me, I get!
FYI dbt-fusion [1] is going GA next week (though GA for Databricks will come later) Most of it is source-available and ELv2-licensed, but there's a number of crates that are Apache 2.0, namely: dbt-xdbc, dbt-adapter, dbt-auth, dbt-jinja, dbt-agate. We also have plans to OSS more as time goes on (stay tuned).
I just wanted to call out the OSS crates in case you'd rather focus on "making your beer taste better" than have to re-build foundations. I'd love to hear if any of those crates come in handy for you (even more so if they don't work for you).
Feel free to reach out on LinkedIn or dbt community Slack if you ever want to chat more!
Hey Anders! Thanks a lot for dropping a comment and show interest in Rocky.
Yes! I won't going to lie that Jinja is one of the things that gives me some itches :). But it wasn't the major reason for start building Rocky though.
It all started with the need for auto-generating dbt models from the FiveTran connections I integrate with, then having to hot reload code location in Dagster to discover new assets. All in a zero-touch data pipeline. FiveTran connections are discovered as they're created, assets are materialized as these connections sync.
Auto-generating these dbt models and get the manifest aligned between Dagster code location reloads plus spinning up pods in EKS for each Dagster runs that need to rely on these auto-generated models have some impact on the performance overall, not only in production, but also affects DX in their local environment.
Rocky wasn't born with a "dbt replacement" in mind at all, but it was born to solve a real issue I'm facing. I made sure I can integrate well with dbt as it's in my plans to leverage the awesome work available as dbt packages for FiveTran.
I'll definitely have a look the crates you mentioned!
Thank you!
> Auto-generating these dbt models and get the manifest aligned between Dagster code location
I just added you on LinkedIn. if you accept my connection there I can DM you a private preview document that you might find very interesting related to dbt project metadata (that is way less painful than `manifest.json`)
ok, this is definitely up my alley. color me nerd-sniped and forgive the onslaught of questions.
my questions are less about the syntax, which i'm largely familiar with knowing both SQL and ggplot.
i'm more interested in the backend architecture. Looking at the Cargo.toml [1], I was surprised to not see a visualization dependency like D3 or Vega. Is this intentional?
I'm certainly going to take this for a spin and I think this could be incredible for agentic analytics. I'm mostly curious right now what "deployment" looks like both currently in a utopian future.
utopia is easier -- what if databases supported it directly?!? but even then I think I'd rather have databases spit out an intermediate representation (IR) that could be handed to a viz engine, similar to how vega works. or perhaps the SQL is the IR?!
another question that arises from the question of composability: how distinct would a ggplot IR be from a metrics layer spec? could i use ggsql to create an IR that I then use R's ggplot to render (or vise versa maybe?)
as for the deployment story today, I'll likely learn most by doing (with agents).
My experiment will be to kick off an agent to do something like: extract this dataset to S3 using dlt [2], model it using dbt [3], then use ggsql to visualize.
p.s. @thomasp85, I was a big fan of tidygraph back in the day [4]. love how small our data world is.
ggsql is modular by design. It consists of various reader modules that takes care of connecting with different data backends (currently we have a DuckDB, an SQLite, and an ODBC reader), a central plot module, and various writer modules that take care of the rendering (currently only Vegalite but I plan to write my own renderer from scratch).
As for deployment I can only talk about a utopian future since this alpha-release doesn't provide much tangible in that area. The ggsql Jupyter kernel already allows you to execute ggsql queries in Jupyter and Quarto notebooks, so deployment of reports should kinda work already, though we are still looking at making it as easy as possible to move database credentials along with the deployment. I also envision deployment of single .ggsql files that result in embeddable visualisations you can reference on websites etc. Our focus in this area will be Posit Connect in the short term
I'm afraid I don't know what IR stands for - can you elaborate?
Ah - yes, in theory you could create a "ggplot2 writer" which renders the plot object to an R file you can execute. It is not too far away from the current Vega-Lite writer we use. The other direction (ggplot2->ggsql) is not really feasible
TIL about Verse looks cool I'll have to check it out.
> SQL is not a pipeline, it is a graph.
Maybe it's both? and maybe there will always be hard-to-express queries in SQL, and that's ok?
the RDBMS's relational model is certainly a graph and joins accordingly introduce complexity.
For me, just as creators of the internet regret that subdomains come before domains, I really we could go back in time and have `FROM` be the first predicate and not `SELECT`. This is much more intuitive and lends itself to the idea of a pipeline: a table scan (FROM) that is piped to a projection (SELECT).
I'm as big a SQL stan as the next person and I'm also very skeptical anytime anyone says that SQL needs to be replaced.
At the same time, it's challenging that SQL cannot be iteratively improved and experimented upon.
IMHO, PRQL is a reasonable approach to extending SQL without replacing SQL.
But what I'd love to see is projects like Google's zeta-sql [1] and Substrait [2] get more traction. It would provide a more stable, standardized foundation upon which SQL could be improved, which would make the case for "SQL forever" even more strong.
I agree that CTEs help solve the problem of being able to read a SQL query from top to bottom, but I wouldn't say they're a panacea!
Personally, it's weird to me that `FROM` (scan) comes after `SELECT` (projection). IMHO the datasource should come first!
CTEs don't solve this problem they just let you chain multiple SELECTs together.
A real use case is that it would allow intellisense to kick in a lot earlier!
Instead you have to write `SELECT * FROM my_table` and only after can you edit the `*` and get auto-complete suggestions of the columns from `my_table`
if I could tell myself in 2015 who had just found the feather library and was using it to power my unhinged topic modeling for power point slides work, and explained what feather would become (arrow) and the impact it would have on the date ecosystem. I would have looked at 2026 me like he was a crazy person.
Yet today I feel it was 2016 dataders who is the crazy one lol
Indeed. feather was a library to exchange data between R and pandas dataframes. People tend to bash pandas but its creator (Wes McKinney) has changed the data ecosystem for the better with the learnings coming from pandas.
I know pandas has a lot of technical warts and shortcomings, but I'm grateful for how much it empowered me early in my data/software career, and the API still feels more ergonomic to me due to the years of usage - plus GeoPandas layering on top of it.
Really, prefer DuckDB SQL these days for anything that needs to perform well, and feel like SQL is easier to grok than python code most of the time.
> Really, prefer DuckDB SQL these days for anything that needs to perform well, and feel like SQL is easier to grok than python code most of the time.
I switched to this as well and its mainly because explorations would need to be translated to SQL for production anyways. If I start with pandas I just need to do all the work twice.
chdb's new DataStore API looks really neat (drop in pandas replacement) and exactly how I envisioned a faster pandas could be without sacrificing its ergonomics
Do people bash pandas? If so, it reminds me of Bjarne's quip that the two types of programming languages are the ones people complain about and the ones nobody uses.
He missed talking about the poor extensibility of pandas. It's missing some pretty obvious primitives to implement your own operators without whipping out slow for loops and appending to lists manually.
Yes (mostly) is the answer. You can use arrow as a backend, and I think with v3 (recently released) it's the default.
The harder thing to overcome is that pandas has historically had a pretty "say yes to things" culture. That's probably a huge part of its success, but it means there are now about 5 ways to add a column to a dataframe.
Adding support for arrow is a really big achievement, but shrinking an oversized api is even more ambitious.
I also use polars in new projects. I think Wes McKinney also uses it. If I remember correctly I saw him commenting on some polars memory related issues on GitHub. But a good chunk of polars' success can be attributed to Arrow which McKinney co-created. All the gripes people have with pandas, he had them too and built something powerful to overcome those.
I saw Wes speak in the early days of Pandas, in Berkeley. He solved problems that others just worked around for decades. His solutions are quirky but the work was very solid. His career advanced a lot IMHO for substantial reasons.. Wes personally marched through swamps and reached the other side.. others complain and do what they always have done.. I personally agree with the criticisms of the syntax, but Pandas is real and it was not easy to build it.
FYI dbt-fusion [1] is going GA next week (though GA for Databricks will come later) Most of it is source-available and ELv2-licensed, but there's a number of crates that are Apache 2.0, namely: dbt-xdbc, dbt-adapter, dbt-auth, dbt-jinja, dbt-agate. We also have plans to OSS more as time goes on (stay tuned).
I just wanted to call out the OSS crates in case you'd rather focus on "making your beer taste better" than have to re-build foundations. I'd love to hear if any of those crates come in handy for you (even more so if they don't work for you).
Feel free to reach out on LinkedIn or dbt community Slack if you ever want to chat more!
[1]: https://github.com/dbt-labs/dbt-fusion
reply