More actions
This page is currently full of schizobabble as I work on creating it, please ignore for the time being.
Described as "the last word in filesystems" (cite), ZFS is cool
Frequently used on servers, not typically workstations (leads into license issues, BTRFS claim to fame)
ZFS does block-level compression/redundancy vs file level as in BTRFS
Everything is all built in, no dealing with layering filesystems with mdadm/dm-integrity(some guy on /g/ said is bad?) like the traditional way (link to page on software raid, should have a little diagram showing how block-device layering works, also figure out the correct terminology)
bitrot protection
error correcting
how scrubs work
FOSS / License Status
not GPL compatible, SUN, Oracle,
RAID Levels
RAID0 = striped
RAID1 = mirror
RAID10 = a bunch of mirrors
RAID5 = RAIDZ1 ("1" drive can fail?)
RAID6 = RAIDZ2 ("2" drives can fail?)
???? = RAIDZ3 ("3" drives can fail?)
https://en.wikipedia.org/wiki/ZFS#RAID_(%22RAID-Z%22)
ZFS is pretty flexible with pool layouts, it's vdevs that aren't
you can add/remove vdevs from a pool, but you can't add/remove disks from a vdev (well, until recently)
no, you'd put the disks into two RAID1 ("mirrored") vdevs in one single pool
I actually sent this a little while ago, but ZFS abstracts storage into a couple different logical levels. It goes from Disks -> VDEVs -> zpools -> dataset
we have 14 total raw disks, each in a mirrored vdev with one other disk, in one single zpool (named "zhome"), that itself has individual datasets for each user's homedir (so each homedir can create/revert a snapshot independently)
then each dataset is mounted to it's respective path on the filesystem, in this example it's always /home/<user>
So for your plan with the disks, you'd put the first two into a mirrored vdev in one pool, then later once you get the other two disks, you simply add them to a new mirrored vdev in the same pool
this isn't recommended, as logically half the content in the pool will already be split among the first two disks, with the latter two disks having more space, so new content will be shifted onto them more often
This causes a performance penalty but it will function just fine (and there may be a way to manually resliver a pool that has this issue, though I haven't had to look into it)
So for your pool, it'd probably look like this to start out:
DISK | VDEV | ZPOOL | DATASET
DISK1 -> mirror-0 -| |
DISK2 / | zpool1 -| zpool1/
then once you add the two new disks:
DISK | VDEV | ZPOOL | DATASET
DISK1 -> mirror-0 -| |
DISK2 / | zpool1 -| zpool1/
DISK3 -> mirror-1 -| |
DISK4 / | |
the command would be: zpool create zpool1 mirror ata-APPLE_HDD_ST500LM012_S33MJ9AF303107 ata-WDC_WD5000BPVT-22HXZT3_WD-WXF1EB0NC891
to create the initial pool (though with different disks serial numbers)
and the command to add the two new disks would be: zpool add zpool1 mirror ata-WDC_WD5000BEVT-75A0RT0_WD-WX81A70N2643 ata-WDC_WD5000LPVX-75V0TT0_WXB1AC41L1DY
(also with different serial numbers)
You can put whatever vdevs you want into a pool, for example this is my home NAS:
DISK | VDEV | ZPOOL | DATASET
10TB DISK -> raidz2-0 -| |
10TB DISK / | |
10TB DISK / | |
10TB DISK / | |
10TB DISK / | |
10TB DISK / | |
10TB DISK / | |
10TB DISK -> raidz2-1 -| JBOD -| JBOD/Data
10TB DISK / | |
10TB DISK / | |
10TB DISK / | |
10TB DISK / | |
10TB DISK / | |
10TB DISK / | |
10TB DISK -> hotspare -| |
a RAIDZ2 vdev means two disks total can fail without losing data, and since I have two of them it's a RAID60
I wouldn't recommend it though, I plan to transfer all my data and reformat the pool as RAID10 because RAID60 is slooooooooooow
Steal Stevens diagram in Storage_management on LVM and adjust for HDD/vdev/zpool/dataset layers to use as visual guide
yeah ZFS just abstracts devices a bit more than a normal RAID
so you have disks->vdevs->pools->datasets
almost like a pseudo OSI-model for storage
you can group multiple disks into a vdev, then multiple vdevs into a pool
so for example, in that image you have 6 disks, grouped into 3 vdevs, in one pool
and specifically, the vdevs are mirrors, which makes it RAID10 (a bunch of mini-RAID1's in one big RAID0)
the setup I have on my home NAS is RAID60, where I have two RAID6 vdevs in one pool
and then datasets is just what sits on top of a pool, and lets you set quotas and separate snapshots
On LUG's shell for example, the zpool configuration looks like this:
DISK | VDEV | ZPOOL | DATASET
500GB DISK -> mirror-0 -| | zhome/adam
500GB DISK / | | zhome/alhirzel
500GB DISK -> mirror-1 -| | zhome/allen
500GB DISK / | | zhome/avonyx
500GB DISK -> mirror-2 -| | zhome/chefy
500GB DISK / | | zhome/dane
500GB DISK -> mirror-3 -| zhome - | zhome/david
500GB DISK / | | zhome/dev
500GB DISK -> mirror-4 -| | zhome/jhstiebe
500GB DISK / | | zhome/noah
500GB DISK -> mirror-5 -| | zhome/ron
500GB DISK / | | zhome/ryan
500GB DISK -> mirror-6 -| | zhome/saladin
500GB DISK / | | etc etc
RAID 10 vs RAID 01:
due to how many disks need to be spun up to rebuild, 1 with RAID10, 3 with RAID01
no one uses RAID01, more disk wear for essentially same result
Compression
lz4 (link backblaze testing showing lz4 is more benifical than no compression due to reduced disk seek itmes)
zstd
gzip
the ZFS exclusive one no one uses anymore
can be applied at dataset or pool level
below example from the LUG Shell server, where each user's home directory is a ZFS dataset with lz4 compression enabled:
check a file's size on disk:
noah@shell:~/hnarchive $ du -h hnarchive.db
13G hnarchive.db
check the actual file size (useful for webservers to tell how many bytes will go over the wire when a client downloads the file):
noah@shell:~/hnarchive $ du -h --apparent-size hnarchive.db
22G hnarchive.db
this data is very compressible because it is a database of text comments left by users on a social media platform (Hacker News)
Encryption
send/recive kinda broken in openzfs, user beware (send/recive from one encrypted pool to another encrypted pool with differing keys causes corruption, reportedly no issues when sending from or receiving to an unencrypted pool)
encrypt whole pool and thus all future child datasets via inheritence, or indivusla encryption per-dataset
script I use on freebsd to auto-decrypt and mount the root pool (and why it's not needed on systemd, what it does)
Deduplication
dont use
ECC memory (ZFS wants..?)
memory hog misconception (only really when enabling dedupe, and you shouldnt most of the time)
dedupe requires FS table to me stored in RAM, about 1GB of memory per 1TB of storage is general guide for memory needed for dedupe
BTRFS apparently does offline dedupe and doesn't have as much memory requirement
Snapshots
pretty neat
snapshots occur at the block-level (like the rest of ZFS)
take up more storage the more 'different' the dataset becomes from the time you took the snapshot
e.g. creating a snapshot of a dataset with 50G of data initially takes up zero space, but if you delete the 50G of data then the snapshot now occupies 50G of space.
Alternatively, if the dataset has 50G of data and you copy over 100G more data, now the snapshot occupies 100G of space (meaning now a total of 200G of space is being taken up by the dataset in total)
Usage / Tips and Tricks / Guide
should be biggest section, just paste common commands and maybe steal some from tldr as well as their output in a code-block formatted text.
zfs set quota=150G zhome/$username
Performance Optimizations (/ "Tuning")
ashift, 512 vs 4096 (for performance)
Write/read cache (https://www.reddit.com/r/freenas/comments/70h8tf/slog_write_cache_question/dn3aau4/)
https://www.youtube.com/watch?v=l55GfAwa8RI
https://www.youtube.com/watch?v=Hjpqa_kjCOI
same presentation, but some more modern than others?
https://www.cs.utexas.edu/~dahlin/Classes/GradOS/papers/zfs_lc_preso.pdf
https://pages.cs.wisc.edu/~dusseau/Classes/CS736/CS736-F13/Lectures/1_zfs_overview.pdf
https://www.racf.bnl.gov/Facility/TechnologyMeeting/Archive/Apr-09-2007/zfs.pdf