Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From the article:

> [W]e shipped an optimization. Detect duplicate files by their content hash, use hardlinks instead of downloading each copy.

 help



I meant TRANSPARENT filesystem level dedupe. They are doing it at the application level. filesystem level dedupe makes it impossible to store the same file more than once and doesn't consume hardlinks for the references. It is really awesome.

Filesystem/file level dedupe is for suckers. =D

If the greatest filesystem in the world were a living being, it would be our God. That filesystem, of course, is ZFS.

Handles this correctly:

https://www.truenas.com/docs/references/zfsdeduplication/


I was talking about block level dedupe.

I thought you might be.

I just wanted to mention ZFS.

Have I mentioned how great ZFS is yet?


ZFS is great! However, it's too complicated for most Linux server use cases (especially with just one block device attached); it's not the default (root filesystem); and it's not supported for at least one major enterprise Linux distro family.


File system dedupe is expensive because it requires another hash calculation that cannot be shared with application-level hashing, is a relatively rare OS-fs feature, doesn't play nice with backups (because files will be duplicated), and doesn't scale across boxes.

A simpler solution is application-level dedupe that doesn't require fs-specific features. Simple scales and wins. And plays nice with backups.

Hash = sha256 of file, and abs filename = {{aa}}/{{bb}}/{{cc}}/{{d}} where

aa = hash 2 hex most significant digits

bb = hash next 2 hex digits

cc = hash next 2 hex after that

d = remaining hex digits


All good backup software should be able to do deduped incremental backups at the block level. I'm used to veeam and commvault

That costs even more, unreuseable time and effort. It's simpler to dedupe at the application level rather than shift the burden onto N things. I guess you don't understand or appreciate simplicity.

This article shows it really isn't that simple and is easy to mess up. Who cares if your storage and backup software both dedupe?

For ZFS, at least, `zfs send` is the backup solution. And it performs incremental backups with the `-i` argument.

zfs send is really awesome when combined with dedupe and incremental



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: