Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sounds like `git annex` is file-level deduplication, whereas this tool is block-level, but with some intelligent, context-specific way of defining how to split up the data (i.e. Content-Defined Chunking). For data management/versioning, that's usually a big difference.


XetHub Co-founder here. Yes, one illustrative example of the difference is:

Imagine you have a 500MB file (lastmonth.csv) where every day 1MB is changed.

With file-based deduplication every day 500MB will be uploaded, and all clones of the repo will need to download 500MB.

With block-based deduplication, only around the 1MB that changed is uploaded and downloaded.


I combine git-annex with the bup special remote[1], which lets me still externalize big files, while benefiting from block level deduplication. Or depending on your needs, you can just use a tool like bup[2] or borg directly. Bup actually uses the git pack file format and git metadata.

I actually wrote a script which I'm happy to share, that makes this much easier, and even lets you mount your bup repo over .git/annex/objects for direct access.

[1]: https://git-annex.branchable.com/walkthrough/using_bup/

[2]: https://github.com/bup/bup


Have you tested this out with Unreal Engine blueprint files? If you all can do block-based diffing on those, and other binary assets used in game development it'd be huge for game development.

I have a couple ~1TB repositories I've had the misfortune of working with using perforce in the past.


Last time I used perforce in anger it did pretty decent with ~800GB repo(checkout+history).

I keep expecting someone to come along and dethrone it but as far as I can tell it hasn't been done yet. The combination of specific filetree views, drop-in proxies, UI-forward and checkout based workflow that works well with unmergeable binary assets still left Git LFS and other solutions in the dust.

+1 on testing this against a moderate size gamedev repo, that usually has some of the harder constraints where code + assets can be coupled and the art portion of a sync can easily top a couple hundred GB.


1TB of checkout is the kind of repo I'm talking about I have two such repos checked out on this box currently. I'm not sure I've ever checked out a repo of this scale locally with history. I'd love to have the local history.


Not yet. Would be happy to try - can you point me to a project to use?

Do you have a repo you could try us out with?

We have tried a couple Unity projects (41% smaller due to republication) but not much from Unreal projects yet.


Most of my examples of that size are AAA game source that I can't share however, I think this is a project using similar files that is based on unreal. It should show if there is any benefit: https://github.com/CesiumGS/cesium-unreal-samples & where the .umap binaries have been updated and in this example where the .uasset blueprints have been updated https://github.com/renhaiyizhigou/Unreal-Blueprint-Project


Does that work equally well whether the changes are primarily row-based or primarily column-based?


HashBackup author here. Your question is (I think) about how well block-based dedup functions on a database - whether rows are changed or columns are changed. This answer is how most block-based dedup software, including HashBackup work.

Block-based dedup can be done either with fixed block sizes or variable block sizes. For a database with fixed page sizes, a fixed block size matching the page size is most efficient. For a database with variable page sizes, a variable block size will work better, assuming there the dedup "chunking" algorithm is fine-grained enough to detect the database page size. For example, if the db used a 4-6K variable page size and the dedup algo used a 1M variable block size, it could not save a single modified db page but would save more like 20 db pages surrounding the modified page.

Your column vs row question depends on how the db stores data, whether key fields are changed, etc. The main dedup efficiency criteria are whether the changes are physically clustered together in the file or whether they are dispersed throughout the file, and how fine-grained the dedup block detection algorithm is.


Yes, see this for more details of how XetHub deduplication: https://xethub.com/assets/docs/xet-specifics/how-xet-dedupli...


"Sounds like `git annex` is file-level deduplication, whereas this tool is block-level ..."

I am not a user of git annex but I do know that it works perfectly with an rsync.net account as a target:

https://git-annex.branchable.com/forum/making_good_use_of_my...

... which means that you could do a dumb mirror of your repo(s) - perhaps just using rsync - and then let the ZFS snapshots handle the versioning/rotation which would give you the benefits of block level diffs.

One additional benefit, beyond more efficient block level diffs, is that the ZFS snapshots are immutable/readonly as opposed to your 'git' or 'git annex' produced versions which could be destroyed by Mallory ...


> let the ZFS snapshots handle the versioning/rotation which would give you the benefits of block level diffs

Can you explain this a bit? I don't know anything about ZFS, but it sounds as though it creates snapshots based on block level differences? Maybe a git-annex backend could be written to take advantage of that -- I don't know.


ZFS does snapshots (very lightweight and quick) and separately it can do deduplication. It has a lot of nice features, I'd recommend looking into it if you find it interesting. It's quite practical these days (I think it comes with ubuntu even) and it's saved my butt a time or two.


No, that is not correct, git-annex uses a variety of special remotes[2], some of which support deduplication. Mentioned in another comment[1]

When you have checked something out and fetched it, then it consumes space on disk, but that is true with git-lfs, and most other tools like it. It does NOT consume any space in any git object files.

I regularly use a git-annex repo that contains about 60G of files, which I can use with github or any git host, and uses about 6G in its annex, and 1M in the actual git repo itself. I chain git-annex to an internal .bup repo, so I can keep track of the location, and benefit from dedup.

I honestly have not found anything that comes close to the versatility of git-annex.

[1]: https://news.ycombinator.com/item?id=33976418

[2]: https://git-annex.branchable.com/special_remotes/


If git annex stores large files uncompressed you could use filesystem bl9ck level deduplication in combination with it.


can you be more specific here,very interested


There are filesystems that support inline or post-process deduplication. btrfs[1] and zfs[2] come to mind as free ones, but there are also commercial ones like WAFL etc.

It's always a tradeoff. Deduplication is a CPU-heavy process, and if it's done inline, it is also memory-heavy, so you're basically trading CPU and memory for storage space. It heavily depends on the use-case (and the particular FS / deduplication implementation) whether it's worth it or not

[1]: https://btrfs.wiki.kernel.org/index.php/Deduplication

[2]: https://docs.oracle.com/cd/E36784_01/html/E39134/fsdedup-1.h...


One problem is if you need to support Windows clients. Microsoft charges $1600 for deduplication support or something like that: https://learn.microsoft.com/en-us/windows-server/storage/dat...


Deduplication is included with every version and edition of windows server since 2012. You need to license windows server properly of course, but there is no add-on cost for deduplication.


there exists an open-source btrfs filesystem driver for Windows...


Yeah, which is great for storage but doesn't help over the wire.


ZFS at least supports sending a deduplicated stream.


Right, and btrfs can send a compressed stream as well, but we aren't sending raw filesystem data via VCS.


zbackup is a great block level deduplication trick.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: