Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I had to implement an ETL

Buddy, google DuckDB, it's like OLAP SQLite. You'll thank me later.



Does someone here use DuckDB in production? Is it as stable as SQLite?

The hosting company I have to work with has a very old version of SQLite installed on the server and they don't want to update it. So I was looking at whether I could replace it with DuckDB since it seems to be easy to install with pip.


Answering the question you didn't ask:

pysqlite3-binary is a Linux only package in PyPI that includes a recent version of SQLite

apsw is a cross-platform package, it brings in a modern version as well, and additionally exposes everything SQLite can do (useful for me, you can write VFS drivers in python for it). The version of apsw in pypi is hopelessly out of date, the homepage has more details on how to install the latest version.


Thanks for the recommendations, I will have a look at pysqlite3-binary and apsw.


I very, very much like the idea of DuckDB, but have had many headaches with what I would consider "low-medium" amounts of log data (400 million rows, 180GB+, after stripping out any un-needed text) which works without issue using SQLite, btw, other than the expected slowness of row-oriented storage on the column-aggregate queries that I needed.

I have tried many different ways of importing the data with no luck. Usually it turned into a memory issue. This was irrespective of how I tried to import, whether it was in batches through a script, or using the built-in import functionality, or breaking the data into chunks prior to import, or anything else I tried. I occasionally keep trying to find a way, because I like the idea and features of DuckDB so much.

Of note, MonetDB slurped it right in without issue, but I would much rather use something not so "heavy" like DuckDB.

Also, regarding SQLite, I have DBs with billions of rows that work flawlessly, though not as fast as postgres.


You could try Clickhouse. It's a bit heavier than DuckDB and the default mode is server-client. But you can also use the client (a single binary) without a server to directly query data from csv or parquet files.

edit: added a better link, the stand-alone mode is called clickhouse-local

https://clickhouse.tech/docs/en/interfaces/cli

https://altinity.com/blog/2019/6/11/clickhouse-local-the-pow...


Interesting! I did not know about the standalone cli feature. Thanks! Will check it out.


One of the DuckDB authors here. Can you open an issue wrt the import problem? Should not happen. Try setting a database file and a memory limit.


DuckDB is designed for query processing, not updating data - I guess you would not want to use it for the transaction processing workloads just because it's easier to install. If you are doing mostly complex queries, then DuckDB is great, but I think there is going to be little usage directly in production, I'd expect it's more used for data processing


Yes I'm doing very few update or insert, but I run a lot of queries, so I guess it should be more than okay for DuckDB.


And how great is SQLite for transactional workloads? Every transaction locks up the entire db. There is no page or even table level locking. DuckDB can't be worse than that.


SQLite in WAL mode allows single concurrent writer and unlimited readers. Which is completely fine for a wide range of production use cases.


In practice, I've found that most SQLite write operations take low-milliseconds to complete - so even under a very heavy write load (assuming WAL mode) you're unlikely to see any contention between writes.

For my own stuff I maintain an in-memory Python queue of write operations and apply them in order against a single write connection - while running numerous other read-only connections to serve SELECTs.


SQLite is not at the top of the charts, but there's a looong ways to drop below it among the wide variety of desirable properties that it offers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: