As a heavy user of ZFS and Linux, what else is there that even comes close to wh...

fgonzag · on Jan 9, 2020

Bcachefs is probably the only thing that will get there. The codebase is clean and we'll mantained, built from solid technology (bcache) and will include most of the ZFS niceties. I just wish more companies would sponsor de project and stop wasting money on BTRFS

reacharavindh · on Jan 9, 2020

Yes, I’m eagerly waiting for Bcachefs to get there at some point, but it is several years away (rightly so, because it is hard and the developer is doing an amazing job) if my understanding of its state is correct.

I have heard of durability issues with btrfs, and do not want to touch it if it fails with its primary job.

XorNot · on Jan 9, 2020

Which is why ZFS is still a thing today - there are no other alternatives. Everything is coming "soon" while ZFS is actually here and clocking up a stability track-record.

ksec · on Jan 10, 2020

>Bcachefs is probably the only thing that will get there.

Or Bcachefs is probably the only thing that might get there.

The amount of engineering hours went into ZFS is insane. It is easy to get a project that has 80% similarity on the surface, but then you spend the same amount of time from 0 - 80% on the last 20% and edge cases. ZFS has been battle tested by many. Rsync is on ZFS.

The amount of Petabyte stored in ZFS safely over the years gives peace of mind.

Speaking of Rsync, normally a topic of ZFS on HN will have him resurface. Hasn't seen any reply from him yet.

lifty · on Jan 9, 2020

I’m looking forward to bcachefs becoming feature complete and upstreamed. We finally have a good chance of having a modern and reliable FS in the Linux kernel. My wish list includes snapshots and per volume encryption.

donmcronald · on Jan 9, 2020

What if the main purpose of BTRFS is to have something "good enough" so no one starts working on a project that can compete with large commercial storage offerings?

Does anyone remember the parity patches they rejected in 2014?

> Your work is very very good, it just doesn’t fit our business case.

I haven't followed it much. Does it have anything more than mirroring (that's stable) these days?

forgotpwd16 · on Jan 9, 2020

>stop wasting money on BTRFS

You're saying they should stop supporting a project that was considered stable by the time the other started being developed. Why do that? What makes Bcachefs a better choice?

fgonzag · on Jan 9, 2020

Take a cursory look into both codebases, the stability of the every feature at launch and on maintenance. It's not hard to see BTRFS is a doomed project. Bcachefs is more like PostgreSQL, the developer doesn't add features until he has a solid design that's well thought out. Hence why he hasn't implemented snapshots.

I don't think too many people consider it stable enough for production, either. (Unless you count a very limited subset of its functionality).

I rather run Bcachefs today than Btrfs, by a mile. At least with bcachefs I won't lose my data.

phire · on Jan 9, 2020

If you are on BTRFS and you encounter an unrecoverable bug (which seems to be reasonably common), the developers will most likely recommend you wipe the drive and restore from backups (because you had backups, right?)

Even if the data is still on the drive and a bugfix would make the filesystem recoverable again, they don't have the time/knowledge/resources to untangle that codebase and make fixes. Even BTRFS developers don't trust the filesystem with their own data.

If you are on Bcachefs and you encounter an unrecoverable bug, the developer will ask for some logs, or reproduction steps, or potentially even remote debugging access to your corrupt filesystem.

And then he will fix the bug, releasing a new version that can read/repair your filesystem. He knows his codebase like the back of hand.

In my research, I couldn't find any examples of someone actually losing data due to Bcachefs. All the bugs appeared to be "data has been written to drive, but bug prevented reading"

While I would still hesitate to trust Bcachefs, I would trust it way more than BTRFS.

XorNot · on Jan 9, 2020

Just want to note that bcachefs looks great (I was sort of tangentially aware but hadn't dedicated significant attention to it).

Definitely something to try out (backing up my home servers is just about to reach viability for me, so I'd definitely consider switching to it in that use case).

Thanks!

starfallg · on Jan 9, 2020

Btrfs is the only FS I used that resulted in complete FS corruption losing nearly all data on disk, not once, but 3 times.

After that, none of the features like compression, snapshots, COW or checksums meant anything to me. I'm much happier with ext4 and xfs on lvm.

rbanffy · on Jan 10, 2020

Anecdote, I know, but I have about a dozen machines with BtrFS volumes, all active with varying loads and never experienced data loss. It seems some features are more mature than others - only two of the volumes span more than one disk and none has files that are larger than a physical volume (even though one of the multi-device volumes is striped).

microtonal · on Jan 10, 2020

In the 26 years or so I have used Linux, I have had corrupted filesystems with reiserfs, XFS, btrfs, and ext[23]. In the case of reiserfs and XFS it was practically impossible to recover the filesystem (IIRC reiserfs would reattach anything that resembled a B-tree). For ext[23], it was surprisingly easy to get back most of the data. Never had any corruption with ZFS or ext4. I didn't try to fix the btrfs filesystem, since it was a machine that had to be repurposed anyway.

starfallg · on Jan 11, 2020

My experience with recovering btrfs is that you get back most of your files, but with the content replaced with random gibberish. Which is not too useful.

In a way, I would rather it bomb out and declare a total loss than to keep sinking more time into it as it leads you along.

mekster · on Jan 10, 2020

When was it that XFS got corrupted on you? I think as RedHat embraces XFS, I assume it's quite good now.

microtonal · on Jan 11, 2020

Somewhere between being merged in mainline and 2009.

mekster · on Jan 10, 2020

Funny the other day on another HN thread, someone was saying btrfs is good even though I said RedHat has abandoned the btrfs ship but then he said Facebook had been using it heavily.

But seeing how so many people had lost data using it, I will never use btrfs...

phire · on Jan 9, 2020

I don't think BTRFS has ever been considered stable.

I think they just said: "The on-disk data structure is stable" and lots of people misinterpreted that as "the whole thing is stable"

A stable on-disk data structure just means it's been frozen and can't be changed in non-backwards compatible ways. It says nothing about code quality, feature completeness or if the frozen data structure was any good.

forgotpwd16 · on Jan 10, 2020

The finalization of the on-disk data structure came soon after Btrfs was announced and happened before 2010. I meant that by 2010s when Bcachefs started development, Btrfs was considered a supported filesystem for "big name" server distros such as Oracle and SUSE.

jessant · on Jan 9, 2020

Snapshots don't seem to be done yet.

aseipp · on Jan 9, 2020

Kent has admitted (many times) that snapshots are one of the more difficult features to add in a reliable and safe way, and will require significant work to do right, especially for what he wants to see them do (I assume "really damn fast and low overhead" is a major one, plus some other tricks he has up his sleeve.) So he has intentionally not tackled them yet, instead going after a slew of other features first. Reflink, full checksum, replication, caching, compression, native encryption, etc. All of that works today.

Snapshots are a huge feature for sure, but it's not like bcachefs is completely incapable without them.

There was a very recent update he gave in late December (2019) that mentioned he's actively chipping away at the roadblocks for snapshots.

ZoomZoomZoom · on Jan 10, 2020

They're being worked on ATM: Dec 29, 2019 "Just finished a major rework that gets us a step closer to snapshots: the btree code is incrementally being changed to handle extents like regular keys." https://www.patreon.com/posts/towards-32698961

fgonzag · on Jan 9, 2020

That's exactly why I said it's probably the only one that will get there.

Rapzid · on Jan 10, 2020

Heh, BTRFS deja vu. Been hearing about the ZFS alternative "not quite there, but catching up" for about as long as high-speed rail. I wonder which will arrive first :)

fgonzag · on Jan 10, 2020

BTRFS is never going to become stable. Ever. Just take a quick dive into the codebase.

Bcachefs has never had an unrecoverable data error AFAIK, even though it isn't even considered stable enough to merge into the kernel. The bcache on disk format won't be considered stable until he merges his code into mainline, though he doesn't feel he will need to adjust it further.

Features that currently work: Full data checksumming Compression Multiple device support Tiering/writeback caching RAID1/RAID10

All of these are stable, tested, and mostly bug free. Honestly, once the code gets mainlined you'll be able to start using it very quickly.

Main issue right now is performance, as it about as slow as BTRFS, which isn't inspiring. However the author has stated that he's going for correctness first, then he'll begin optimizing.

zippergz · on Jan 9, 2020

I know this isn't an option for everyone, but this is part of why I run FreeBSD instead of Linux for servers where I need ZFS.

ggm · on Jan 10, 2020

This isn't why I started running FreeBSD, but it is also one of the reasons I continue to run FreeBSD.

mekster · on Jan 10, 2020

Yes, I run Linux for business but keep a FreeBSD personally so I'm used to it in case I need zfs for business.

xupybd · on Jan 9, 2020

I agree that ZFS has a lot to offer. But the legal difficulties in merging ZFS support into the mainline kernal are understandable. It's a shame but I think he is making the right call.

shawnz · on Jan 9, 2020

Merging into the mainline kernel is not what the person he is replying to was even asking for. All they were asking is for Linux to stop putting APIs behind DRM that prevents non-GPL modules like ZFS from using them. That doesn't mean ZFS must be bundled with Linux.

I think everyone is in agreement that ZFS can't be included in the mainline kernel. The question is just if users should be able to install and use it themselves or not.

xupybd · on Jan 9, 2020

Thanks, I should have read more into this.

The follow up actually clears things up pretty well. https://www.realworldtech.com/forum/?threadid=189711&curpost...

stopreformation · on Jan 10, 2020

Kernal? If you can merge zfs support into 8KB kernal then you are not a mere mortal, so no need to worry about any legal difficulties.

m4rtink · on Jan 10, 2020

XFS on LVM thin pool LV should give you a very robust fs, cheap CoW snapshots, multi device support. If you want, you can make the thin pool be on RAID via LVM RAID under the thin pool.

For import export, IIRC XFS has support for it and you can dump/import LV snapshots to get atomicity.

For caching there is LVM cache, should be again possible to combine with thinpool & RAID. Or you can use it separately for normal LV.

All this is functionality tested by years of production use.

For compression/deduplication, that is AFAIK work in progress upstream based on the open sourced VDO code.

reacharavindh · on Jan 10, 2020

Interesting combination of tools I have used independently but never as a replacement of my beloved ZFS.

Never made snapshots with LVM. Always used LVM as a way to carve up logical storage from a pool of physical devices but nothing more. I need to RTFM on how snapshotting would work there - could I restore just a few files from an hour ago while letting everything else be as they are?

With ZFS, I use RAM as read chace(ARC) and an Optane disk as sync write cache(SLOG). I wonder if LVM cache would let me do such a thing. Again, a pointer for more manual reading for me.

Compression is a nice to have for me at this moment. Good to know that it is being worked on at the LVM layer.

m4rtink · on Jan 10, 2020

IIRC you can mount any of the snapshots & copy files from it without influencing the others & the thin LV itself. As for RAM caching, I'm not sure LVM would allow LVM cache residing on ram disk PV, but isin't regular Linux transparent FS access RAM caching sufficient actually ?

For some reading about LVM thin provisioning:

http://man7.org/linux/man-pages/man7/lvmthin.7.html

https://access.redhat.com/documentation/en-us/red_hat_enterp...

nickik · on Jan 10, 2020

Call me when somebody like a major cloud provider has used this system to drive millions of hard-drives. I'm not gone patch my data security together like that.

There is difference between 'all these tools have been used in production' and 'this is an integrated tool that has been used for 15+ years in the biggest storage installations in the world'.

tpetry · on Jan 10, 2020

Yes! The problem with the LVM approach trying to replicate anything ZFS is doing that you have to use a myriad of different tools. And then you have to pray that they all work correctly together, and if one has a bug you possible lost all your data because there may be so many data corruptions emerging because of it.

bayindirh · on Jan 9, 2020

Honestly asking, how Btrfs compares to ZFS?

There's also Lustre but it's a different beast altogether for a different scenario.

dsr_ · on Jan 9, 2020

On the surface, btrfs is pretty close to zfs.

Once you actually use them, you discover all the ways that btrfs is a pain and zfs is a (minor) joy:

- snapshot management

- online scrub

- data integrity

- disk management

I lost data from perfectly healthy-appearing btrfs systems twice. I've never lost data on maintained zfs systems, and I now trust a lot more data to zfs than I ever have to btrfs.

the8472 · on Jan 9, 2020

At least disk management is far easier with btrfs. You can restripe at will while zfs has severe limitations around resizing, adding and removing devices.

Granted, at enterprise scale this hardly matters because you can just send-receive to rebuild pools if you have enough spares, but for consumer-grade deployments it's a non-negligible annoyance.

p_l · on Jan 9, 2020

Restriping is source of unsafety, though. A lot of ZFS data safety comes from the fact it doesn't support overwriting anything, making it so that normal operation can't introduce unrecoverable corruption. In fact, all writes are done through snapshots.

the8472 · on Jan 10, 2020

ZFS wanted to have that too (the mythical block pointer rewrite) but it never happend, instead they add clunky workarounds like indirection tables for.

p_l · on Jan 11, 2020

It was treated more like "ok, yet another person complaining about it - here's what you need to implement, and why you won't".

The indirection tables are survivable for fixing short term mistakes, though.

HorstG · on Jan 9, 2020

Actually, this matters a lot in many enterprises. Beancounters hate excess capacities, so there are never enough spares and everything is always almost full.

Maybe SV is different...

aidenn0 · on Jan 9, 2020

Since the plural of anecdote is data, I'll provide mine here. ZFS is the only file-system from which I've lost data on hardware that was functioning properly, though that does come with a caveat.

Twice btrfs ended up in a non-mountable situation, but both times it was due to a known issue and #btrfs on freenode was able to walk me through getting it working again.

With ZFS, I neded up in a non-mountable system, and the response in both #zfs and #zfsonlinux to me posting the error message were, "that sucks, hope you had backups." Since I both had backups and it was my laptop 2000 miles from home that was my only computing device, I didn't dig deeper to see if I could discover the problem. FWIW, I've been using ZFS on that same hardware for almost 2 years since with no issues.

bayindirh · on Jan 9, 2020

Thanks for your answer and sorry for your data loss.

> I lost data from perfectly healthy-appearing btrfs systems twice.

I still consider btrfs as beta-level software. This is why I never looked into it very seriously and asked this question.

Looks like btrfs has something around five years to be considered serious at the scale where ZFS just starting to warm-up.

berbec · on Jan 9, 2020

The one thing I can't understand about btrfs is the unknown answer to the question "How much disk space do I have left?". I don't get that being a "this much, maybe" answer

muxator · on Jan 10, 2020

# btrfs filesystem usage /

Overall:

    Device size:         142.86GiB
    Device allocated:     48.05GiB
    Device unallocated:   94.81GiB
    Device missing:          0.00B
    Used:                 37.75GiB
    Free (estimated):    103.94GiB  (min: 103.94GiB)
    Data ratio:               1.00
    Metadata ratio:           1.00
    Global reserve:       82.20MiB  (used: 0.00B)

berbec · on Jan 14, 2020

"Free (estimated)"

cookiecaper · on Jan 10, 2020

btrfs is such a mess that for a database or VM to be marginally stable, you have to disable the CoW featureset for those files with the +C attribute. It's nowhere near a serious solution.

nickik · on Jan 10, 2020

Btrfs has eat my data, and once that happens I will never, ever, ever, literally ever go back to that system. Its unacceptable to me that a system eats data specially after multiple rounds of 'its stable now'.

But in the end it always turns out that only if you 'use' it correctly it is actually not gone eat your data.

I used ZFS for far longer and had far fewer issues.

freedomben · on Jan 10, 2020

Stratis and VDO have a lot of promise, although it's still a little early. The approach that Stratis has taken is refreshing. It's very simple and reuses lots of already existing stuff so by the time it's released it will already be mature (since the underlying code has been running for many years).

Once a little more guidance comes out about how to properly use VDO and Stratis together, I'll move my personal stuff to it.

arminiusreturns · on Jan 10, 2020

So besides the obvious btrfs answer, what about ceph as clustered storage with very fast connectivity?

There is also BeeGFS, I haven't used it but /r/datahoarders sometimes touts it.

Not for linux but I have been keeping an eye on M Dillons DragonFly BSD where he has been working on HAMMER2, which is very interesting.

I don't know much but bcachefs has been making more waves lately also.

I think the bottom line is that people need to have good backup in place regardless.

jessant · on Jan 9, 2020

Does btrfs met your requirements?

Youden · on Jan 9, 2020

I've tried btrfs without much luck.

btrfs still has a write hole for RAID5/6 (the kind I primarily use) [0] and has since at least 2012.

For a filesystem to have a bug leading to dataloss unpatched for over 8 years is just plain unacceptable.

I've also had issues even without RAID, particularly after power outages. Not minor issues but "your filesystem is gone now, sorry" issues.

[0]: https://btrfs.wiki.kernel.org/index.php/RAID56

the8472 · on Jan 9, 2020

It's not a bug, but an unimplemented feature. They never made any promise that raid5 is production-ready.

Pretty much all software-raid systems suffer from it unless they explicitly patch over it via journaling. Hardware raid gets away with it if it has battery backups, if they don't they suffer from exactly the same problem.

mindslight · on Jan 9, 2020

... hence the desire to use ZFS, which skips trying to present a single coherent block device and performs parity at the file (chunk) level.

brianpgordon · on Jan 9, 2020

My home NAS runs btrfs in RAID 5. The key is to use software RAID / LVM to present a single block device to btrfs. That way you never use btrfs's screwed-up RAID 5/6 implementation.

Youden · on Jan 10, 2020

If you use LVM/mdadm for RAID, it's not possible for btrfs to correct checksum mismatches (i.e. protect against bitrot).

brianpgordon · on Jan 11, 2020

That's a good point, though Synology (my brand of NAS) claims that they've developed analogous corruption checks operating at the LVM level, so you get the benefits of btrfs (including checksum checks and RAID scrubbing) without having to actually use its RAID implementation.

https://www.synology.com/en-global/knowledgebase/DSM/help/DS...

Youden · on Jan 12, 2020

I wasn't actually able to find any real documentation on how Synology's SHR works.

Their recovery documentation [0] indicates that SHR is just plain mdadm + LVM and a couple of NAS recovery sites [1,2] indicate the same.

In the end I got a Reddit post [3] with a response from a Synology representative who says that the btrfs filesystem will request a read from a redundant copy from mdadm in order to correct checksum errors.

I wonder whether this is unique to Synology or whether the change has been upstreamed into the main Linux kernel.

[0]: https://www.synology.com/en-global/knowledgebase/DSM/tutoria...

[1]: https://support.reclaime.com/kb/article/8-synology-shr-raid/

[2]: http://www.nas-recovery.com/kb_hybrydraid.php

[3]: https://www.reddit.com/r/DataHoarder/comments/5yb13m/anyone_...

beatgammit · on Jan 10, 2020

Why use RAID5/6, RAID10 is much more safe because you drastically reduce the change of a cascading resilvering failure. Yes, you get less capacity per drive, but drives are (relatively) cheap.

I thought I wanted RAID5, but after reading horror stories of drives failing when replacing a failed drive, I decided it just wasn't worth the risk.

I currently run RAID1, and when I need more space, I'll double my drives and set up RAID10. I don't need most of the features of ZFS, so BTRFS works for me.

Youden · on Jan 10, 2020

I use RAID6 because it gives me highly efficient utilization of my available storage capacity while still giving me some degree of redundancy should a disk fail. My workload is also mostly sequential, so random read/write performance isn't too important to me.

If a disk fails and resilvering causes a cascading failure, I can restore from a backup.

I think you might be mistaking RAID for a backup, which is a mistake. RAID is very much not a backup or any kind of substitute for a backup. A backup ensures durability and integrity of your data by providing an independent fallback should your primary storage fail. RAID ensures availability of your data by keeping your storage online when up to N disks fail.

RAID won't protect you from an accidental "rm -Rf /", ransomware or other malware, bugs in your software or many other common causes of data loss.

I might consider RAID10 if I were running a business-critical server where availability was paramount, or where I needed decent random read/write performance but even so I'd still want a hot-failover and a comprehensively tested backup strategy.

nosequel · on Jan 9, 2020

btrfs is not at all reliable, so if you care about your files staying working files, it probably doesn't meet your requirements. It is like the MongoDB 0.1 of filesystems.

jessant · on Jan 9, 2020

Seems pretty reliable these days. Are you commenting based upon personal experience? If so, when was it that you used btrfs?

reacharavindh · on Jan 9, 2020

When it comes to file systems “pretty reliable” these days does not sound very good. Reliability had to have been a fundamental requirement for design of a file system. If not, it sounds like putting lipstick on a pig.

Redhat throwing towel on their support for development does not instill confidence either.

Nothing personally against Btrfs. Just an end user making a file system choice saying what I care about.

jessant · on Jan 9, 2020

re Redhat deprecating btrfs:

> People are making a bigger deal of this than it is. Since I left Red Hat in 2012 there hasn't been another engineer to pick up the work, and it is _a lot_ of work.

https://news.ycombinator.com/item?id=14909843

yjftsjthsd-h · on Jan 10, 2020

I have a laptop running opensuse, with root on btrfs. Twice I have had to reinstall because it managed to corrupt the file system.

the8472 · on Jan 9, 2020

btrfs + dm-cache? throw in dm-raid if you want raid5.

montjoy · on Jan 9, 2020

Hardware RAID controllers can do most if not all of these things.

ggm · on Jan 10, 2020

I've lost more data in hardware RAID than in ZFS but I have lost data in both.

Hardware RAID has very poor longevity. Vendor support and battery backup replacement collide in BIOS and host management badly.

Disclaimer: I work on Dell rackmounts, which means rather than native SAS I am 'Dells hack on SAS' which is a problem and I know its possible to 'downgrade' back to native.

montjoy · on Jan 11, 2020

Yeah we started ordering the ones with the supercap so we didn’t have to replace batteries anymore.

Somewhat recently I dealt with LSI and Dell cards. Longevity seemed just fine for a normal 3 year server lifecycle. The only time we had an issue is when the power went down in the data center. The power spike fried a few of the cards. Luckily we had spares.

Way way back I dealt with the Compaq/hp smartarrays. Those were awful. Also anything consumer grade is awful.

AtHeartEngineer · on Jan 9, 2020

The problem with most of these is you have to bring the system down to do maintenance. You can do a scrub on zfs while it's up.

HorstG · on Jan 9, 2020

Most non-hobbyist RAID hardware does online-scrub just fine (not that I would recommend wasting money on such hw).

Btw, ZFS scrub is not only a RAID-block-check but also a partial fsck, so its not really comparable.

montjoy · on Jan 10, 2020

We used the LSI 9286CV-8e (or dell equivalent) which was somewhere between $1000-$1500 back in the day. Worth it compared to babysitting any software RAID IMO.

nickik · on Jan 10, 2020

Pay more for less safety and put all your data into the hands of the guy who wrote the firmware for that thing. I'm sure that software is well maintained open source code.