Jennifer Aniston and Friends Cost Us 377GB and Broke Ext4 Hardlinks

replooda · 2026-04-10T13:55:05 1775829305

In short: Deduplication efforts frustrated by hardlink limits per inode — and a solution compatible with different file systems.

UltraSane · 2026-04-10T14:21:42 1775830902

The real problem is they aren't deduplicating at the filesystem level like sane people do.

otterley · 2026-04-10T15:42:35 1775835755

From the article:

> [W]e shipped an optimization. Detect duplicate files by their content hash, use hardlinks instead of downloading each copy.

UltraSane · 2026-04-10T15:54:16 1775836456

I meant TRANSPARENT filesystem level dedupe. They are doing it at the application level. filesystem level dedupe makes it impossible to store the same file more than once and doesn't consume hardlinks for the references. It is really awesome.

mmh0000 · 2026-04-10T15:59:33 1775836773

Filesystem/file level dedupe is for suckers. =D

If the greatest filesystem in the world were a living being, it would be our God. That filesystem, of course, is ZFS.

Handles this correctly:

https://www.truenas.com/docs/references/zfsdeduplication/

UltraSane · 2026-04-10T16:01:37 1775836897

I was talking about block level dedupe.

mmh0000 · 2026-04-10T16:08:14 1775837294

I thought you might be.

I just wanted to mention ZFS.

Have I mentioned how great ZFS is yet?

otterley · 2026-04-10T18:19:58 1775845198

ZFS is great! However, it's too complicated for most Linux server use cases (especially with just one block device attached); it's not the default (root filesystem); and it's not supported for at least one major enterprise Linux distro family.

vmilner · 2026-04-11T08:30:26 1775896226

It's not as good as ed: https://www.gnu.org/fun/jokes/ed-msg.html

burnt-resistor · 2026-04-10T20:07:56 1775851676

File system dedupe is expensive because it requires another hash calculation that cannot be shared with application-level hashing, is a relatively rare OS-fs feature, doesn't play nice with backups (because files will be duplicated), and doesn't scale across boxes.

A simpler solution is application-level dedupe that doesn't require fs-specific features. Simple scales and wins. And plays nice with backups.

Hash = sha256 of file, and abs filename = {{aa}}/{{bb}}/{{cc}}/{{d}} where

aa = hash 2 hex most significant digits

bb = hash next 2 hex digits

cc = hash next 2 hex after that

d = remaining hex digits

UltraSane · 2026-04-10T20:23:44 1775852624

All good backup software should be able to do deduped incremental backups at the block level. I'm used to veeam and commvault

burnt-resistor · 2026-04-11T12:22:38 1775910158

That costs even more, unreuseable time and effort. It's simpler to dedupe at the application level rather than shift the burden onto N things. I guess you don't understand or appreciate simplicity.

UltraSane · 2026-04-11T21:35:56 1775943356

This article shows it really isn't that simple and is easy to mess up. Who cares if your storage and backup software both dedupe?

otterley · 2026-04-10T23:28:54 1775863734

For ZFS, at least, `zfs send` is the backup solution. And it performs incremental backups with the `-i` argument.

UltraSane · 2026-04-11T01:58:01 1775872681

zfs send is really awesome when combined with dedupe and incremental

dj_rock · 2026-04-10T14:00:28 1775829628

We were on a break...of your filesystem!

uticus · 2026-04-10T14:05:08 1775829908

And I thought this was a reference to a Win95 problem https://www.slashgear.com/1414245/jennifer-aniston-matthew-p...

mingus88 · 2026-04-10T14:44:39 1775832279

Yeah Block level dedupe has been an industry standard for decades. Tracking file hashes? Why?

And I see above that this is a self-hosted platform and I still don’t get it. I was running terabytes of ZFS with dedupe=on on cheap supermicro gear in 2012

zulux · 2026-04-10T15:49:56 1775836196

File hashes are great to get two systems to work together to dedupe themselves. I have a Windows backup that sends hashes to a backup server, so we don't back up crud we already have.

niobe · 2026-04-11T00:16:24 1775866584

Completely Claude written FWIW. I recongise the style.

trixn86 · 2026-04-10T14:50:37 1775832637

The Problem. The fix. The Limit.

Is it just me or is everybody else just as fed up with always the same AI tropes?

I've reached a point where I just close the tab the moment I read a headline "The problem". At least use tropes.fyi please

colejohnson66 · 2026-04-10T16:25:58 1775838358

Doesn’t read like AI to me

snickerbockers · 2026-04-10T17:40:38 1775842838

Let that sink in.

otterley · 2026-04-10T15:41:21 1775835681

Another reason to use XFS -- it doesn't have per-inode hard link limits.

(Some say ZFS as well, but it's not nearly as easy to use, and its license is still not GPL-friendly.)

burnt-resistor · 2026-04-10T20:11:45 1775851905

xfs on mdraid is what I use on my homelab NAS across several giant RAID arrays. While it lacks some integrity and CoW features, it's really, really stable. I had ZoL ZFS troubles that the maintainers shrugged off requiring transferring everything to another volume.. so I won't ever use or recommend ZFS unless it's Sun-Oracle.

bravetraveler · 2026-04-10T14:02:28 1775829748

As is always the case, short vs long term... but I think I'd put effort into migrating to a filesystem that is aware of duplication instead of trying to recreate one with links [while retaining duplicates, just fewer].

Effectiveness is debatable, this approach still has duplication. An insignificant amount, I'll admit. The filesystem handling this at the block level is probably less problematic/prone to rework and more efficient.

edit: Eh, ignore me. I see this is preparing for [whatever filesystem hosts chose] thanks to 'ameliaquining' below. Originally thought this was all Discourse-proper, processing data they had.

ameliaquining · 2026-04-10T14:28:40 1775831320

Discourse is self-hostable; they can't require their users to use a filesystem that supports deduplication. (Or, well, they could, but it would greatly complicate installation and maintenance and whatnot, and also there would need to be some kind of story for existing installations.)

bravetraveler · 2026-04-10T14:32:06 1775831526

Fair, I am/was confused by the hosting model and presentation. This is a nice User-preparation/consideration, I guess. I still maintain a backup filesystem unaware of duplication at the block level is a mistake.

I completely overlooked the shipping-of-tarballs. Links make sense, here. I had 'unpacked' and relatively-local data in mind. Absolutely would not go as far to suggest their scheme pick up 'zfs {send,receive}'/equivalent, lol.

ameliaquining · 2026-04-10T19:03:57 1775847837

They do also offer it as multi-tenant hosted SaaS, and the post is about their experience running backups on that. But whatever solution they use has to also work with the self-hosted version, which imposes some constraints.

bravetraveler · 2026-04-10T19:27:28 1775849248

Sweet

UltraSane · 2026-04-10T14:20:23 1775830823

This makes them look rather incompetent. Storing the exact same file 246,173 times is just stupid. Dedupe at the filesystem level and make your life easier.