Hacker Newsnew | past | comments | ask | show | jobs | submit | phiresky's commentslogin

I'm a bit disappointed that this only solves the "find index of file in tar" problem, but not at all the "partially read a tar.gz" file problem. So really you're still reading the whole file into memory, so why not just extract the files properly while you are doing that? Takes the same amount of time (O(n)) and less memory.

The gzip-random-access problem one is a lot more difficult because the gzip has internal state. But in any case, solutions exist! Apparently the internal state is only 32kB, so if you save this at 1MB offsets, you can reduce the amount of data you need to decompress for one file access to a constant. https://github.com/mxmlnkn/ratarmount does this, apparently using https://github.com/pauldmccarthy/indexed_gzip internally. zlib even has an example of this method in its own source tree: https://github.com/gcc-mirror/gcc/blob/master/zlib/examples/...

All depends on the use case of course. Seems like the author here has a pretty specific one - though I still don't see what the advantage of this is vs extracting in JS and adding all files individually to memfs. "Without any copying" doesn't really make sense because the only difference is copying ONE 1MB tar blob into a Uint8Array vs 1000 1kB file blobs

One very valid constraint the author makes is not being able to touch the source file. If you can do that, there's of course a thousand better solutions to all this - like using zip, which compresses each file individually and always has a central index at the end.



This is very cool. Worth a submission by itself.

> Apparently the internal state is only 32kB

Exactly. And often this state is either highly compressible or non-compressible but only sparsely used. The latter can then be made compressible by replacing the unused bytes with zeros.

Ratarmount uses indexed_gzip, and when parallelization makes sense, it also uses rapidgzip. Rapidgzip implements the sparsity analysis to increase compressibility and then simply uses the gztool index format, i.e., compresses each 32 KiB using gzip itself, with unused bytes replaced with zeros where possible.

indexed_gzip, gztool, and rapidgzip all support seeking in gzip streams, but all have some trade-offs, e.g., rapidgzip is parallelized but will have much higher memory usage because of that than indexed_gzip or gztool. It might be possible to compile either of these to WebAssembly if there is demand.


> Each seek point is accompanied by a chunk (32KB) of uncompressed data which is used to initialise the decompression algorithm, allowing us to start reading from any seek point.

> Apparently the internal state is only 32kB, so if you save this at 1MB offsets, you can reduce the amount of data you need to decompress for one file access to a constant.

You may need to revisit the definition of a constant. A 1/32 additional data is small but it still grows the more data you’re trying to process (we call that O(n) not O(1)). Specifically it’s 3% and so you generally want to target 1% for this kind of stuff (once every 3mib)

And the process still has to read though the enter gzip once to build that index


I think you're looking at a different perspective than me. At _build time_ you need to process O(n), yes, and generate O(n) additional data. But I said "The amount of data you need to decompress is a constant". At _read time_, you need to do exactly three steps:

1. Load the file index - this one scales with the number of files unless you do something else smart and get it down to O(log(n)). This gives you an offset into the file. *That same offset/32 is an offset into your gzip index.*

2. take that offset, load 32kB into memory (constant - does not change by number of files, total size, or anything else apart from the actual file you are looking at)

3. decompress a 1MB chunk (or more if necessary)

So yes, it's a constant.


My bad. Yes from decompression perspective you have O(1) ancillary data to initiate decompression at 1 seek point.

This is how seeking can work in encrypted data btw without the ancillary data - you just increment the IV every N bytes so there’s a guaranteed mapping for how to derive the IV for a block so you’re bounded by how much extra you need to encrypt/decrypt to do a random byte range access of the plaintext.

But none of this is unique to gzip. You can always do this for any compression algorithm provided you can snapshot the state - the state at a seek point is always going to be fairly small.


I actually first thought this wasn't possible at all because I'm used to zstd which by default uses a 128MB window and I usually set it to the max (2GB window). 32kB is _really_ tiny in comparison. On the other hand though, zstd also compresses in parallel by default and has tools built in to handle these things, so seekable zstd archives are fairly common.

If anyone knows a similar solution for zstd, I'm very interested. I'm doing streaming uncompression to disk and I'd like to be able to do resumable downloads without _also_ storing the compressed file.

https://github.com/martinellimarco/indexed_zstd

https://github.com/martinellimarco/libzstd-seek

Note, however, that this can only seek to frames, and zstd still only creates files containing a single frame by default. pzstd did create multi-frame files, but it is not being developed anymore. Other alternatives for creating seekable zstd files are: zeekstd, t2sz, and zstd-seekable-format-go.


Thanks, this is helpful. I might just end up using content defined chunking in addition/instead, but it's good to know that there is a path forward if I stick with the current architecture.

Tar doesn't need to imply gzip (or bzip2, or zstd, etc). Tar's default operation produces uncompressed archives.

There's a bit of an issue with the linked deployment (in my opinion). In the most zoomed out view you should see the first layer of blocks - very big blocks titled "English language", "French language", "German language". See https://phiresky.github.io/isbn-visualization/ maybe. That makes it a bit easier to read.

The point of the visualization is showing different attributes of books in the space of ISBNs. ISBNs correlate with country, publisher, and release date, that's why using it as a space is useful. You can clearly see the history of when blocks were created, which blocks are rarer than others (present in fewer libraries), and (on the AA hosting) which blocks are more present in AA vs not.

In any case though, yes ISBNs as spatial data are clearly not perfect. Do you have any suggestions that would order the 100 million data points better?


Here's my article on how I built it - and also an instance hosted on GitHub pages if the AA domain is blocked for you: https://phiresky.github.io/blog/2025/visualizing-all-books-i...

Happy to answer questions as always :)


Love this!


Netdata used to be really impressively minimal, performant, and packed with functions. Fully GPL open source. You ran one install command and it started a web-ui at localhost:19999 in a few seconds. The UI loaded instantly and had hundreds of graphs. You could tell the author was a single opinionated person obsessed with the maximum of monitoring with the minimum footprint.

It auto-detects many programs like docker, nginx and postgresql and automatically creates dashboards for them. It also has many dashboards about system internals I didn't even know were great to monitor, so it taught me a lot. For example, seeing a CPU pinned at 100% processing interrupts because of a network interface overload or having time frames with high IOwait during a SQL query clearly meaning there's some larger seq scans happening.

You also needed zero configuration, no login, etc.

Then they added multi-instance monitoring purely client side - the browser remembers other instance domains and links between them - pretty neat and completely uninvasive.

Then they introduced their cloud login, where you can monitor multiple instances remotely/together. They had a `--no-cloud` flag though if you did not want it. But by now they've removed that flag and they say patching out the cloud functionality is bypassing their license [1]. Some functionality is locked behind premium upgrades, and you get prevented from adding more than N metrics or M instances. It's still _possible_ to use netdata without going through their cloud but you have to go through a nag window every time you try to open the local UI. It's clear they don't want you to use it anymore, and I don't really feel comfortable about their default auto-updating local install any more either.

Now it's still impressive and useful, but it's much more an enterprise focusued tool than an "i have this server i want to monitor" tool.

Of course I understand they need to make money, but what used to be trivial to understand (hooks into everything in your system it can and opens a single port to display it) has become a whole huge integrated ecosystem and for me personally it's competing in the space where I'd probably rather spend the time to make a proper Prometheus/Grafana setup instead.

[1] https://github.com/netdata/netdata/discussions/17594#discuss...


Totally agree. I was quite impressed with UI and bling bling when one of our employees installed it to monitor one of our servers (university team of 15 people) .However it turned out to be a total cognitive overload and very hard to bring it down to the strictly necessary to maintain the server. Then the cloud login stuff was a show stopper. We moved from nagios via icinga to checkmk, but we are still quite unhappy with good metrics based monitoring (we had munin at some point). A lot of the solutions seems overkill or oversimplify alarm states leading to a lot of false positives or duplicate notifications.


I used to love netdata, but right around dashboard v3 I switched to Prometheus+grafana

I still miss the no-fuss configuration and the anomaly detection was well done, but I just can't do the required cloud thing it does now


A $120M spend on AWS is equivalent to around a $12M spend on Hetzner Dedicated (likely even less, the factor is 10-20x in my experience), so that would be 3% of their revenue from a single customer.


> A $120M spend on AWS is equivalent to around a $12M spend on Hetzner Dedicated (likely even less, the factor is 10-20x in my experience), so that would be 3% of their revenue from a single customer.

I'm not convinced.

I assume someone at Netflix has thought about this, because if that were true and as simple as you say, Netflix would simply just buy Hetzner.

I think there lots of reasons you could have this experience, and it still wouldn't be Netflix's experience.

For one, big applications tend to get discounts. A decade ago when I (the company I was working for) was paying Amazon a mere $0,2M a month and getting much better prices from my account manager than were posted on the website.

There are other reasons (mostly from my own experiences pricing/costing big applications, but also due to some exotic/unusual Amazon features I'm sure Netflix depends on) but this is probably big enough: Volume gets discounts, and at Netflix-size I would expect spectacular discounts.

I do not think we can estimate the factor better than 1.5-2x without a really good example/case-study of a company someplace in-between: How big are the companies you're thinking about? If they're not spending at least $5m a month I doubt the figures would be indicative of the kind of savings Netflix could expect.


We run our own infrastructure, sometimes with our own fincing (4), sometimes external (3). The cost is in tens of millions per year.

When I used to compare to aws, only egress at list price costs as much as my whole infra hosting. All of it.

I would be very interested to understand why netflix does not go 3/4 route. I would speculate that they get more return from putting money in optimising costs for creating original content, rather than cloud bill.


> I would be very interested to understand why netflix does not go 3/4 route. I would speculate that they get more return from putting money in optimising costs for creating original content, rather than cloud bill.

I invest in Netflix, which means I'm giving them some fast cash to grow that business.

I'm not giving them cash so that they can have cash.

If they share a business plan that involves them having cash to do X, I wonder why they aren't just taking my cash to do X.

They know this. That's why on the investors calls they don't talk about "optimising costs" unless they're in trouble.

I understand self-hosting and self-building saves money in the long-long term, and so I do this in my own business, but I'm also not a public company constantly raising money.

> When I used to compare to aws, only egress at list price costs as much as my whole infra hosting. All of it.

I'm a mere 0,1% of your spend, and I get discounts.

You would not be paying "list price".

Netflix definitely would not be.


Of course netflix is optimising costs, otherwise it would not be a business, I just think they put much more effort elsewhere. They could be using other words, like "financial discipline" :)

My point is that even if I get 20 times discount on egress its still nowhere close, since i have to buy everything else - compute, storage are more expensive, and even with 5-10x discounts from list price its not worth it.

(Our cloud bills are in the millions as well, I am familiar with what discounts we can get)


Even then you can just err.downcast_ref::<std::Io::Error>() though to get the underlying IOError, no?


I'm happy to answer any questions! Nice to see this here again :)


This is great! Especially the DB sync part, because that happens before a user interaction, so you actually have to wait for it (the update itself can run in the background).

It always felt like such a waste to me how the DB always downloads tens of megabytes of data when likely only 1kB has changed. I mean I also really appreciate the beauty of how simple it is. But I'd bet even a delta against a monthly baseline file would reduce the data by >90%.

Also, it would be interesting to see how zstd --patch-from compares to the used delta library. That is very fast (as fast as normal zstd) and the code is already there within pacman.

For the recompression issue, there is some hard to find libraries that can do byte-exact reproducible decompression https://github.com/microsoft/preflate-rs but I don't know of any that work for zstd.


There's an extension to ISO8601 that fixes this and is starting to become supported in libraries:

    2019-12-23T12:00:00-02:00[America/Sao_Paulo]
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


That seems a bit excessive to sandbox a command that really just downloads arbitrary code you are going to execute immediately afterwards anyways?

Also I can recommend pnpm, it has stopped executing lifecycle scripts by default so you can whitelist which ones to run.


At work, we're currently looking into firejail and bubblewrap a lot though and within the ops-team, we're looking at ways to run as much as possible, if not everything through these tools tbh.

Because the counter-question could be: Why would anything but ssh or ansible need access to my ssh keys? Why would anything but firefox need access to the local firefox profiles? All of those can be mapped out with mount namespaces from the execution environment of most applications.

And sure, this is a blacklist approach, and a whitelist approach would be even stronger, but the blacklist approach to secure at least the keys to the kingdom is quicker to get off the ground.


firejail, bubblewrap, direct chroot, sandbox-run ... all have been mentioned in this thread.

There is a gazillion list of tools that can give someone analysis paralysis. Here's my simple suggestion: all of your backend team already knows (or should) learn Docker for production deployments.

So, why not rely on the same? It might not be the most efficient, but then dev machines are mostly underutilized anyway.


> Also I can recommend pnpm, it has stopped executing lifecycle scripts by default so you can whitelist which ones to run.

Imagine you are in a 50-person team that maintains 10 JavaScript projects, which one is easier?

  - Switch all projects to `pnpm`? That means switching CI, and deployment processes as well
  - Change the way *you* run `npm` on your machine and let your colleagues know to do the same
I find the second to be a lot easier.


I don’t get your argument here. 10 isn’t a huge number in my book but I don’t know of course what else that entails. I would opt for a secure process change over a soft local workflow restriction that may or may not be followed by all individuals. And I would definitely protect my CI system in the same way than local machines. Depending on the nature of CI these machines can have easy access rights. This really depends how you do CI and how lacks security is.


I'll do soft local workflow restriction right away.

The secure process change might take anywhere from a day to months.


There are a great many extra perks to switching to pnpm though. We switched on our projects a while back and haven’t looked back.


Yeah, id just take the time to convert the 10 projects rather than try to get 50 people to chnage their working habots, plus new staff coming in etc.

Switch your projects once, done for all.


So, switching to pnpm does not entail any work habit changes?


Am I missing something? Don't you also need to change how CI and deployment processes call npm? If my CI server and then also my deployment scripts are calling npm the old insecure way, and running infected install scripts/whatever, haven't I just still fucked myself, just on my CI server and whatever deployment system(s) are involved? That seems bad.


Your machine has more projects, data, and credentials than your CI machine, as you normally don't log into Gmail on your CI. So, just protecting your machine is great.

Further, you are welcome to use this alias on your CI as well to enhance the protection.


Attacking your CI machines means to poison your artifacts you ship and systems they get deployed to, get access to all source it builds and can access (often more than you have locally) and all infrastructure it can reach.

CI machines are very much high-value targets of interest.


> Further, you are welcome to use this alias on your CI as well to enhance the protection.

Yes, but if I've got to configure that across the CI fleet as well as in my deploy system(s) in order to not get, and also be distributing malware, what's the difference between having to do that vs switching to pnpm in all the same places?

Or more explicitly, your first point is invalid. Whether you ultimately choose to use docker to run npm or switch to pnpm, it doesn't count to half-ass the fix and only tell your one friend on the team to switch, you have to get all developers to switch AND fix your CI system, AND also your deployment system (s) (if they are exposed).

This comment proffers no option on which of the two solutions should be preferred, just that the fix needs to made everywhere.


You do the backward logic here. I would go for a single person to deal with pnpm migration and CI rather than instruct other 10 for everyone to hopefully do the right thing. And think about it when the next person comes in... so I'd go for the first option for sure.

And npm can be configured to prevent install scripts to be run anyways:

> Consider adding ignore-scripts to your .npmrc project file, or to your global npm configuration.

But I do like your option to isolate npm for local development purposes.


> which one is easier?

> Switch all projects to `pnpm`?

Sorry; I am out of touch. Does pnpm not have these security problems? Do they only exist for npm?


pnpm doesn't execute lifecycle scripts by default, so it avoids the particular attack vector of "simply downloading and installing an NPM package allows it to execute malicious code."

As phiresky points out, you're still "download[ing] arbitrary code you are going to execute immediately afterwards" (in many/most cases), so it's far from foolproof, but it's sufficient to stop many of the attacks seen in the wild. For example, it's my understanding that last month's Shai-Hulud worm depended on postinstall scripts, so pnpm's restriction of postinstall scripts would have stopped it (unless you whitelist the scripts). But last month's attack on chalk, debug, et al. only involved runtime code, so measures like pnpm's would not have helped.


Exactly so you should still execute all JS code in a container.


> That seems a bit excessive to sandbox a command that really just downloads arbitrary code you are going to execute immediately afterwards anyways?

I won't execute that code directly on my machine. I will always execute it inside the Docker container. Why do you want to run commands like `vite` or `eslint` directly on your machine? Why do they need access to anything outside the current directory?


I get this but then in practice the only actually valuable stuff on my computer is... the code and data in my dev containers. Everything else I can download off the Internet for free at any time.


No.

Most valuable data on your system for a malware author is login cookies and saved auth tokens of various services.


Maybe keylogging for online services.

But it is true that work and personal machines have different threat vectors.


Yes, but I'm willing to bet most workers don't follow strict digital life hygiene and cross contaminate all the time.


You don't have any stored passwords? Any private keys in your `.ssh/`? DB credentials in some config files? And the list goes on and on.


I don't store passwords (that always struck me as defeating the purpose) and my SSH keys are encrypted.


This kind of mentality, and "seems a bit excessive to sandbox a command that really just downloads arbitrary code", is why the JS ecosystem is so prone to credential theft. It's actually insane to read stuff like that said out loud.


Right but the opposite mentality winds up putting so much of the eggs in the basket of the container that it defeats a lot of the purpose of the container.


It's weird that it's downvoted because this is the way


maybe i'm misunderstanding the "why run anything on my machine" part. is the container on the machine? isn't that running things on your machine?

is he just saying always run your code in a container?


> is the container on the machine?

> is he just saying always run your code in a container?

yes

> isn't that running things on your machine?

in this context where they're explicitly contrasted, it isn't running things "directly on my machine"


it annoys me that people fully automate things like type checkers and linting into post commit or worse entirely outsourced to CI.

Because it means the hygiene is thrown over the fence in a post commit manner.

AI makes this worse because they also run them "over the fence".

However you run it, i want a human to hold accountability for the mainline committed code.


I run linters like eslint on my machine inside a container. This reduces attack surface.

How does this throw hygiene over the fence?


Yes in a sibling reply, i was able to better understand your comment to mean "run stuff on my machine in a container"


pnpm has lots of other good attributes: it is much faster, and also keeps a central store of your dependencies, reducing disk usage and download time, similar to what java/mvn does.


> command that really just downloads arbitrary code you are going to execute immediately afterwards anyways?

By default it directly runs code as part of the download.

By isolation there is at least a chance to do some form of review/inspection


I've tried use pnpm to replace npm in my project, it really speed up when install dependencies on host machine, but much slower in the CI containers, even after config the cache volume. Which makes me come back to npm.


> That seems a bit excessive to sandbox a command that

> really just downloads arbitrary code you are going to

> execute immediately afterwards anyways?

I don't want to stereotype, but this logic is exactly why javascript supply chain is in the mess its in.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: