ZFS

From MTU LUG Wiki
Revision as of 06:36, 14 July 2024 by D2wn (talk | contribs) (initial commit (very rough draft))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page is currently full of schizobabble as I work on creating it, please ignore for the time being.


Described as "the last word in filesystems" (cite), ZFS is cool

Frequently used on servers, not typically workstations (leads into license issues, BTRFS claim to fame)


ZFS does block-level compression/redundancy vs file level as in BTRFS

Everything is all built in, no dealing with layering filesystems with mdadm/dm-integrity(some guy on /g/ said is bad?) like the traditional way (link to page on software raid, should have a little diagram showing how block-device layering works, also figure out the correct terminology)

bitrot protection

error correcting

how scrubs work

FOSS / License Status

not GPL compatible, SUN, Oracle,

RAID Levels

RAID0 = striped

RAID1 = mirror

RAID10 = a bunch of mirrors

RAID5 = RAIDZ1 ("1" drive can fail?)

RAID6 = RAIDZ2 ("2" drives can fail?)

???? = RAIDZ3 ("3" drives can fail?)


https://en.wikipedia.org/wiki/ZFS#RAID_(%22RAID-Z%22)


ZFS is pretty flexible with pool layouts, it's vdevs that aren't

you can add/remove vdevs from a pool, but you can't add/remove disks from a vdev (well, until recently)

no, you'd put the disks into two RAID1 ("mirrored") vdevs in one single pool

I actually sent this a little while ago, but ZFS abstracts storage into a couple different logical levels. It goes from Disks -> VDEVs -> zpools -> dataset

we have 14 total raw disks, each in a mirrored vdev with one other disk, in one single zpool (named "zhome"), that itself has individual datasets for each user's homedir (so each homedir can create/revert a snapshot independently)

then each dataset is mounted to it's respective path on the filesystem, in this example it's always  /home/<user>

So for your plan with the disks, you'd put the first two into a mirrored vdev in one pool, then later once you get the other two disks, you simply add them to a new mirrored vdev in the same pool

this isn't recommended, as logically half the content in the pool will already be split among the first two disks, with the latter two disks having more space, so new content will be shifted onto them more often

This causes a performance penalty but it will function just fine (and there may be a way to manually resliver a pool that has this issue, though I haven't had to look into it)

So for your pool, it'd probably look like this to start out:

DISK  |    VDEV    |  ZPOOL  | DATASET
DISK1 -> mirror-0 -|         |
DISK2 /            | zpool1 -| zpool1/


then once you add the two new disks:

DISK  |    VDEV    |  ZPOOL  | DATASET
DISK1 -> mirror-0 -|         |
DISK2 /            | zpool1 -| zpool1/
DISK3 -> mirror-1 -|         |
DISK4 /            |         |

the command would be: zpool create zpool1 mirror ata-APPLE_HDD_ST500LM012_S33MJ9AF303107 ata-WDC_WD5000BPVT-22HXZT3_WD-WXF1EB0NC891 to create the initial pool (though with different disks serial numbers)

and the command to add the two new disks would be: zpool add zpool1 mirror ata-WDC_WD5000BEVT-75A0RT0_WD-WX81A70N2643 ata-WDC_WD5000LPVX-75V0TT0_WXB1AC41L1DY (also with different serial numbers)

You can put whatever vdevs you want into a pool, for example this is my home NAS:

   DISK   |    VDEV    |  ZPOOL  | DATASET
10TB DISK -> raidz2-0 -|         |
10TB DISK /            |         |
10TB DISK /            |         |
10TB DISK /            |         |
10TB DISK /            |         |
10TB DISK /            |         |
10TB DISK /            |         |
10TB DISK -> raidz2-1 -|  JBOD  -| JBOD/Data
10TB DISK /            |         |
10TB DISK /            |         |
10TB DISK /            |         |
10TB DISK /            |         |
10TB DISK /            |         |
10TB DISK /            |         |
10TB DISK -> hotspare -|         |

a RAIDZ2 vdev means two disks total can fail without losing data, and since I have two of them it's a RAID60

I wouldn't recommend it though, I plan to transfer all my data and reformat the pool as RAID10 because RAID60 is slooooooooooow

Steal Stevens diagram in Storage_management on LVM and adjust for HDD/vdev/zpool/dataset layers to use as visual guide



yeah ZFS just abstracts devices a bit more than a normal RAID

so you have disks->vdevs->pools->datasets

almost like a pseudo OSI-model for storage

you can group multiple disks into a vdev, then multiple vdevs into a pool

so for example, in that image you have 6 disks, grouped into 3 vdevs, in one pool

and specifically, the vdevs are mirrors, which makes it RAID10 (a bunch of mini-RAID1's in one big RAID0)

the setup I have on my home NAS is RAID60, where I have two RAID6 vdevs in one pool

and then datasets is just what sits on top of a pool, and lets you set quotas and separate snapshots


On LUG's shell for example, the zpool configuration looks like this:

    DISK    |   VDEV    |  ZPOOL  | DATASET
500GB DISK -> mirror-0 -|         | zhome/adam
500GB DISK /            |         | zhome/alhirzel
500GB DISK -> mirror-1 -|         | zhome/allen
500GB DISK /            |         | zhome/avonyx
500GB DISK -> mirror-2 -|         | zhome/chefy
500GB DISK /            |         | zhome/dane
500GB DISK -> mirror-3 -| zhome - | zhome/david
500GB DISK /            |         | zhome/dev
500GB DISK -> mirror-4 -|         | zhome/jhstiebe
500GB DISK /            |         | zhome/noah
500GB DISK -> mirror-5 -|         | zhome/ron
500GB DISK /            |         | zhome/ryan
500GB DISK -> mirror-6 -|         | zhome/saladin
500GB DISK /            |         | etc etc


RAID 10 vs RAID 01:

due to how many disks need to be spun up to rebuild, 1 with RAID10, 3 with RAID01

no one uses RAID01, more disk wear for essentially same result


Compression

lz4 (link backblaze testing showing lz4 is more benifical than no compression due to reduced disk seek itmes)

zstd

gzip

the ZFS exclusive one no one uses anymore


can be applied at dataset or pool level


below example from the LUG Shell server, where each user's home directory is a ZFS dataset with lz4 compression enabled:

check a file's size on disk:

noah@shell:~/hnarchive $ du -h hnarchive.db
13G hnarchive.db

check the actual file size (useful for webservers to tell how many bytes will go over the wire when a client downloads the file):

noah@shell:~/hnarchive $ du -h --apparent-size hnarchive.db
22G hnarchive.db

this data is very compressible because it is a database of text comments left by users on a social media platform (Hacker News)

Encryption

send/recive kinda broken in openzfs, user beware (send/recive from one encrypted pool to another encrypted pool with differing keys causes corruption, reportedly no issues when sending from or receiving to an unencrypted pool)

encrypt whole pool and thus all future child datasets via inheritence, or indivusla encryption per-dataset

script I use on freebsd to auto-decrypt and mount the root pool (and why it's not needed on systemd, what it does)

Deduplication

dont use


ECC memory (ZFS wants..?)

memory hog misconception (only really when enabling dedupe, and you shouldnt most of the time)

dedupe requires FS table to me stored in RAM, about 1GB of memory per 1TB of storage is general guide for memory needed for dedupe

BTRFS apparently does offline dedupe and doesn't have as much memory requirement

Snapshots

pretty neat

snapshots occur at the block-level (like the rest of ZFS)

take up more storage the more 'different' the dataset becomes from the time you took the snapshot

e.g. creating a snapshot of a dataset with 50G of data initially takes up zero space, but if you delete the 50G of data then the snapshot now occupies 50G of space.

Alternatively, if the dataset has 50G of data and you copy over 100G more data, now the snapshot occupies 100G of space (meaning now a total of 200G of space is being taken up by the dataset in total)

Usage / Tips and Tricks / Guide

should be biggest section, just paste common commands and maybe steal some from tldr as well as their output in a code-block formatted text.


zfs set quota=150G zhome/$username

Performance Optimizations (/ "Tuning")

ashift, 512 vs 4096 (for performance)

Write/read cache (https://www.reddit.com/r/freenas/comments/70h8tf/slog_write_cache_question/dn3aau4/)


https://www.youtube.com/watch?v=l55GfAwa8RI

https://www.youtube.com/watch?v=Hjpqa_kjCOI

same presentation, but some more modern than others?

https://www.cs.utexas.edu/~dahlin/Classes/GradOS/papers/zfs_lc_preso.pdf

https://pages.cs.wisc.edu/~dusseau/Classes/CS736/CS736-F13/Lectures/1_zfs_overview.pdf

https://www.racf.bnl.gov/Facility/TechnologyMeeting/Archive/Apr-09-2007/zfs.pdf

https://www.snia.org/sites/default/orig/sdc_archives/2008_presentations/monday/JeffBonwick-BillMoore_ZFS.pdf