Any remarks/experiences about the "record size" of ZFS, maybe especially in rela...

seized · on Nov 20, 2020

The record size setting is the max, ZFS can use less in some cases. The best one depends on the data youre writing and the ashift of the pool, so testing is best. Large record sizes are helpful for large files (less overhead losses).

Some good beginner info https://arstechnica.com/gadgets/2020/05/zfs-versus-raid-eigh...

Some advanced info https://www.joyent.com/blog/bruning-questions-zfs-record-siz...

zepearl · on Nov 20, 2020

Thank you!

I did read Arstechnica's article in the past but I did not feel comfortable with their results... (I'm not challenging them, I'm just not sure if they're relevant for me or not).

So, I just did a test (ashift 12, RAIDZ1 with 4 8TB HDDs) and I got better performance in both cases with a 1MB recordsize vs. 128KB (all sequential).

Recordsize 1MB:

  reading one 10GB file: 21 seconds.
  reading 10000 1MB files: 83 seconds

Recordsize 128KB:

  reading one 10GB file: 31 seconds
  reading 10000 1MB files: 116 seconds

Maybe a small recordsize can have some benefits when overwriting parts of the files...mmmhhh...?

Ok, it seems complicated => I'll just have to test different variants :)

radiowave · on Nov 20, 2020

> Maybe a small recordsize can have some benefits when overwriting parts of the files...mmmhhh...?

Right. People who've done more testing than me reckon on 16KB being a good record size for transaction-processing database work, where tables are seeing lots of small inserts and updates. (You might think matching the database's block size would be ideal, e.g. Postgres writes 8KB at a time, but the rationale here is that you tend to get better compression at 16KB recordsize than 8KB, and the benefit from this outweights the write-amplification.)

But if database update performance isn't a big deal for you then you can probably just ignore this.

I've not done any testing of my own at the 1MB size, but I don't think I'd be inclined to try it unless I was fairly confident that there weren't going to be many small writes to big files.

In short: use the large recordsize where you think you've got a good case for it, and likewise with a small record size. Otherwise, just stick with the default.

zepearl · on Nov 20, 2020

Thank you.

Yeah, in my case the DBs "Clickhouse" and "MariaDB+MyRocks" might fit well the 1MB-case (as they both never "update" existing files but keep writing new files not just for "inserts" but as well for "updates", Clickhouse anyway not supporting "update/delete" almost at all, he).

On the other hand "PostgreSQL" and (maybe) as well "MariaDB+TokuDB" might need a small recordsize -> I'll have to test it, and anyway, splitting each single DB to use different datasets seems to be a great idea :)

agapon · on Nov 20, 2020

In some cases -- when the file is smaller. For small files ZFS uses the smallest possible block size that can accommodate the file. Once the file grows beyond the recordsize (maximum block size), it uses recordsize-sized blocks.