Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Solaris ZFS

News

Books/Certification books

Certification Recommended Links Recommended Papers Reference Selected Blueprints Selected man pages
Solaris ACLs     RAID 1 volumes
(mirroring)
RAID 5 volumes Shared disksets Humor Etc

ZFS was designed and implemented by a team at Sun led by Jeff Bonwick. It was announced on September 14, 2004. For a humorous introduction to ZFS' features, see presentation given by Pawel at EuroBSDCon 2007: http://youtube.com/watch?v=o3TGM0T1CvE.

When ZFS first got started, the outlook for file systems in Solaris was rather dim. UFS was already nearing the end of its usefulness in terms of file system size and performance. Many Solaris enterprize customers paid substantial sums of money to Veritas to run VxFS instead. In a strange way VxFS has become the enterprise filesystem for Solaris and most job announcements for Solaris sysadmin jobs included its knowledge as a prerequisite.

Solaris needed a new file system. Jeff Bonwick decided to solve the problem and started the ZFS project with the organizing metaphor of the virtual memory subsystem - why can't disk be as easy to administer and use as memory? ZFS has been based in part on NetApp's sucessful Write Anywhere File Layout (WAFL) system. It evolved pretty far from  WAFL and now has many differences. This table lists some of them (please read the blog replies which correct some table errors.).

The central on-disk data structure was the slab - a chunk of disk divided up into the same size blocks, like that in the SLAB kernel memory allocator, which he also created. Instead of extents, ZFS would use one block pointer per block, but each object would use a different block size - e.g., 512 bytes, or 128KB - depending on the size of the object. Block addresses would be translated through a virtual-memory-like mechanism, so that blocks could be relocated without the knowledge of upper layers. All file system data and metadata would be kept in objects. And all changes to the file system would be described in terms of changes to objects, which would be written in a copy-on-write fashion.

ZFS organizes everything on disk into a tree of block pointers, with different block sizes depending on the object size. ZFS checksums and reference-counts variable-sized blocks. It write out changes to disk using copy-on-write - extents or blocks in use are never overwritten in place, they are always copied somewhere else first.

When it was released with Solaris 10 ZFS broke records in scalability, reliability, and flexibility. 

Although performance is not usually cited as ZFS advantage ZFS is far faster than most users realize, especially in environments that involve large number of "typical" files smaller than 5-10 megabytes. The native support of a volume manager in ZFS is also pretty interesting. That and copy on write semantics provide snapshots which are really important for some applications and for security.

ZFS is one of the few Unix filesystems that can go neck-to-neck with Microsoft NTFS for performance. Among important features:

ZFS is also supported in Free BSD and Mac OS X (leopard). There are rumors Apple may be preparing to adopt it as the default filesystem replacing the aging HFS+ in the future.

A good overview is available from BigAdmin Feature Article ZFS Overview and Guide

ZFS organizes physical devices into logical pools called storage pools. Both individual disks and array logical unit numbers (LUNs) visible to the operating system may be included in a ZFS pools.

...Storage pools can be sets of disks striped together with no redundancy (RAID 0), mirrored disks (RAID 1), striped mirror sets (RAID 1 + 0), or striped with parity (RAID Z). Additional disks can be added to pools at any time but they must be added with the same RAID level. For example, if a pool is configured with RAID 1, disks may be added only to the pool in mirrored sets in the same number as was used when the pool was created. As disks are added to pools, the additional storage is automatically used from that point forward.

Note: Adding disks to a pool causes data to be written to the new disks as writes are performed on the pool. Existing data is not redistributed automatically, but is redistributed when modified.

When organizing disks into pools, the following issues should be considered:

Note: RAID-Z is a special implementation of RAID-5 for ZFS allowing stripe sets to be more easily expanded with higher performance and availability.

Storage pools perform better as more disks are included. Include as many disks in each pool as possible and build multiple file systems on each pool.

ZFS File System

ZFS offers a POSIX-compliant file system interface to the operating system. In short, a ZFS file system looks and acts exactly like a UFS file system except that ZFS files can be much larger, ZFS file systems can be much larger, and ZFS will perform much better when configured properly.

Note: It is not necessary to know how big a file system needs to be to create it.

ZFS file systems will grow to the size of their storage pools automatically.

ZFS file systems must be built in one and only one storage pool, but a storage pool may have more than one defined file system. Each file system in a storage pool has access to all the unused space in the storage pool. As any one file system uses space, that space is reserved for that file system until the space is released back to the pool by removing the file(s) occupying the space. During this time, the available free space on all the file systems based on the same pool will decrease.

ZFS file systems are not necessarily managed in the /etc/vfstab file. Special, logical device files can be constructed on ZFS pools and mounted using the vfstab file, but that is outside the scope of this guide. The common way to mount a ZFS file system is to simply define it against a pool. All defined ZFS file systems automatically mount at boot time unless otherwise configured.

Finally, the default mount point for a ZFS file system is based on the name of the pool and the name of the file system. For example, a file system named data1 in pool indexes would mount as /indexes/data1 by default. This default can be overridden either when the file system is created or later if desired.

Command-Line Interface

The command-line interface consists primarily of the zfs and zpool commands.. Using these commands, all the storage devices in any system can be configured and made available. A graphical interface is available through the Sun Management Center. Please see the SMC documentation at docs.sun.com for more information.

For example, assume that a new server named proddb.mydomain.com is being configured for use as a database server. Tables and indexes must be on separate disks but the disks must be configured for highly available service resulting in the maximum possible usable space. On a traditional system, at least two arrays would be configured on separate storage controllers, made available to the server by means of hardware RAID or logical volume management (such as Solaris Volume Manager) and UFS file systems built on the device files offered from the RAID or logical volume manager. This section describes how this same task would be done with ZFS.

Planning for ZFS

Tip 2: Use the format command to determine the list of available devices and to address configuration problems with those devices.

The following steps must be performed prior to configuring ZFS on a new system. All commands must be issued by root or by a user with root authority:

Additional planning information can be found at docs.sun.com.

In the running example, two bodies of JBOD ("just a bunch of disks" or non-RAID managed storage) are attached to the server. Though there is no reason to avoid hardware RAID systems when using ZFS, this example is clearer without hardware RAID systems. The following table lists the physical devices presented from attached storage.

c2t0d0 c4t0d0 c3t0d0 c5t0d0
c2t1d0 c4t1d0 c3t1d0 c5t1d0
c2t2d0 c4t2d0 c3t2d0 c5t2d0
c2t3d0 c4t3d0 c3t3d0 c5t3d0

Based on the need to separate indexes from data, it is decided to use two pools named indexes and tables, respectively. In order to avoid controller contention, all the disks from controllers 2 and 4 will be in the indexes pool and those from controllers 3 and 5 will be in the tables pool. Both pools will be configured using RAID-Z for maximum usable capacity.

Creating a Storage Pool

Storage pools are created with the zpool command. Please see the man page, zpool (1M), for information on all the command options. However, the following command syntax builds a new ZFS pool:

#  zpool create <pool_name> [<configuration>] <device_files>

The command requires the user to supply a name for the new pool and the disk device file names without path (c#t#d# as opposed to /dev/dsk/c#t#d#). In addition, if a configuration flag, such as mirror or raidz, is used, the list of devices will be configured using the requested configuration. Otherwise, all disks named are striped together with no parity or other highly available features.

Tip 3: Check out the -m option for defining a specific mount point or the -R option for redefining the relative root path for the default mount point.

Continuing the example, the zpool commands to build two RAID-Z storage pools of eight disks, each with minimum controller contention, would be as follows:

# zpool create indexes raidz c2t0d0 c2t1d0 c2t2d0 \
  c2t3d0 c4t0d0 c4t1d0 c4t2d0 c4t3d0
# zpool create tables raidz c3t0d0 c3t1d0 c3t2d0 \
  c3t3d0 c5t0d0 c5t1d0 c5t2d0 c5t3d0

The effect of these commands will be to create two pools named indexes and tables, respectively, each with RAID-Z striping and data redundancy. ZFS pool names can be named anything starting with a letter except the strings mirror, raidz, spare, or any string starting with c# where # is any digit 0 through 9. ZFS pool names can include only letters, digits, dashes, underscores, or periods.

Creating File Systems

If the default file system that is created is not adequate to suit the needs of the system, additional file systems can be created using the zfs command. Please see the man page, zfs (1M), for detailed information on the command's options.

Suppose, in the running example, two databases were to be configured on the new storage and for management purposes, each database needed to have its own mount points in the indexes and tables pools. Use the zfs command to create the desired file systems as follows:

# zfs create indexes/db1
# zfs create indexes/db2
# zfs create tables/db1
# zfs create tables/db2

Note: Be careful when naming file systems. It is possible to reuse the same name for different file systems in different pools, which might be confusing.

The effect is to add a separate mount point for db1 and db2 under each of /indexes and /tables. In the mount output, something like the following would be shown:

The space available to /indexes, /indexes/db1, and /indexes/db2 is all of the space defined in the indexes pool. Likewise, the space available to /tables, /tables/db1, and /tables/db2 is all of the space defined in the tables pool. The file systems db1 and db2 in each pool are mounted as separate file systems in order to provide distinct control and management interfaces for each defined file system.

Tip 4: Check out the set options of the zfs command to manipulate the mount point and other properties of each file system.

Displaying Information

Information on the pools and file systems can be displayed using the list commands for zpool and zfs. Other commands exist as well. Please read the man pages for zfs and zpool for the complete list.

# zpool list

NAME            SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
indexes         240M    110K    240M     0%  ONLINE     -
tables          240M    110K    240M     0%  ONLINE     -

# zfs list

NAME            USED    AVAIL  REFER  MOUNTPOINT
indexes         107K    208M   25.5K  /indexes
indexes/db1     24.5K   208M   24.5K  /indexes/db1
indexes/db2     24.5K   208M   24.5K  /indexes/db2
tables          107K    208M   25.5K  /tables
tables/db1      24.5K   208M   24.5K  /indexes/db1
tables/db2      24.5K   208M   24.5K  /indexes/db2

Monitoring

Though a detailed discussion of monitoring is out of this document's scope, this overview would be incomplete without some mention of the ZFS built-in monitoring. As with management, the command to monitor the system is simple:

# zpool iostat <pool_name> <interval> <count>

This command works very much like the iostat command found in the operating system. If the pool name is not specified, the command reports on all defined pools. If no count is specified, the command reports until stopped. A separate command was needed as the iostat command in the operating system cannot see the true reads and writes performed by ZFS; it can see only those submitted to and requested from file systems.

The command output is as follows:

# zpool iostat test_pool 5 10

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
test_pool     80K  1.52G      0      7      0   153K
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0
test_pool     80K  1.52G      0      0      0      0

Other commands can be used to contribute to an administrator's understanding of the status, performance, options, and configuration of running ZFS pools and file systems. Please read the man pages for zfs and zpool for more information.

There is ZFS centered litigation between  NetApps and Sun. in 2007 NetApp alleged that Sun violated seven of its patents and demanded Sun remove its ZFS file system from the open-source community and storage products, and limit its use to computing devices. Please note that Sun indemnifies all its customers against IP claims.

In October 2007 Sun counter-sued, saying NetApp infringed 22 of its patents which puts NetApp in typical for such suits crossfire.  Sun requested the removal of all NetApp products from the marketplace.  It the was a big present to EMC as the letter below suggests:

To NetApp Employees and Customers on Sun’s Lawsuit

[Note: This is an e-mail that I sent internally to our employees, with the expectation that they might also share it with customers. Some of it repeats previous posts, but other parts are different. In the spirit of openness, I decided to post here as well.]

To: everyone-at-netapp
Subject: Sun's Lawsuit Against NetApp

This morning, Sun filed suit seeking a “permanent injunction against NetApp” to remove almost all of our products from the market place. That’s some pretty scary language! It seems designed to make NetApp employees wonder, Do I still have a job? And customers to wonder, Is it safe to buy NetApp products?

I’d like to reassure you. Your job is safe. Our products are all still for sale.

Can you ever remember a Fortune 1000 company being shut down by patents? It just doesn’t happen! Even for the RIM/Blackberry case, which is the closest I can think of to a big company being shut down, it took years and years to get to that point, and was still averted in the end. I think it’s safe to say the odds of Sun fulfilling their threat are near zero.

If you are a customer, you can be confident buying NetApp products.

If you are an employee, just keep doing your job! Even if your job is to partner with Sun, keep doing your job. Here’s a ironic story. When James and I received the IEEE Storage Systems Award for our work in WAFL and appliances “which has revolutionized storage”, it was a Sun employee who organized the session where the award was presented. He was friendly, we were friendly, and we didn’t talk about the lawsuit. You can do it too. The first minute or two might feel odd, but then you’ll get over it. We have many joint customers to take care of.

Also NetApp landed on the wrong side of open source debate which will cost them a lot both in goodwill and actual customers.  Old proverb "Those who live in glass houses should not throw stones" is very relevant here.  The shadow of SCO  over NetApp is very real threat to their viability on the marketplace. As a powerful movement open source community is a factor that weights into all deliberations. 

I think Netapp made a mistake here: for companies of approximately equal size the courtroom really doesn't work well as a venue for battles in storage industry. Lawsuits about software patent (usually with broad, absurg generalizations included in the patent) infringement claims are extremely risky ventures that can backfire.  Among near equals, the costly patent enforcement game is essentially a variant of MAD (mutually-assured destruction).
Biotech companies learned this a long time ago, when they realized that it makes little sense to sue each other over drug-enabling tech even before FDA approval, which is the true gating function which confers the desired monopoly and knocks the other out of the ring. In case of software prior art defense can work wonders for most so called patents.

After Oracle acquisition of Sun NetApp claims should be reviewed as they politically NetApp cannot go after Oracle -- the main database that is using NetAPP storage appliances.  Any attempt to extract money from Oracle means a lot of lost revenue for the company.  Unless Oracle does not care about open source ZFS existence, which is also a possibility. 


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

[May 17, 2010] Solaris Filesystem Choices Comments

Some interesting discussion that among other things outlines position of Linuxoids toward ZFS. Envy is great motivator for findings faults ;-)
ZFS d

ZFS has gotten a lot of hype. It has also gotten some derision from Linux folks who are accustomed to getting that hype themselves. ZFS is not a magic bullet, but it is very cool. I like to think that if UFS and ext3 were first generation UNIX filesystems, and VxFS and XFS were second generation, then ZFS is the first third generation UNIX FS.

ZFS is not just a filesystem. It is actually a hybrid filesystem and volume manager. The integration of these two functionalities is a main source of the flexibility of ZFS. It is also, in part, the source of the famous "rampant layering violation" quote which has been repeated so many times. Remember, though, that this is just one developer's aesthetic opinion. I have never seen a layering violation that actually stopped me from opening a file.

Being a hybrid means that ZFS manages storage differently than traditional solutions. Traditionally, you have a one to one mapping of filesystems to disk partitions, or alternately, you have a one to one mapping of filesystems to logical volumes, each of which is made up of one or more disks. In ZFS, all disks participate in one storage pool. Each ZFS filesystem has the use of all disk drives in the pool, and since filesystems are not mapped to volumes, all space is shared. Space may be reserved, so that one filesystem can't fill up the whole pool, and reservations may be changed at will. However, if you don't want to decide ahead of time how big each filesystem needs to be, there is no need to, and logical volumes never need to be resized. Growing or shrinking a filesystem isn't just painless, it is irrelevant.

ZFS provides the most robust error checking of any filesystem available. All data and metadata is checksummed (SHA256 is available for the paranoid), and the checksum is validated on every read and write. If it fails and a second copy is available (metadata blocks are replicated even on single disk pools, and data is typically replicated by RAID), the second block is fetched and the corrupted block is replaced. This protects against not just bad disks, but bad controllers and fibre paths. On-disk changes are committed transactionally, so although traditional journaling is not used, on-disk state is always valid. There is no ZFS fsck program. ZFS pools may be scrubbed for errors (logical and checksum) without unmounting them.

The copy-on-write nature of ZFS provides for nearly free snapshot and clone functionality. Snapshotting a filesystem creates a point in time image of that filesystem, mounted on a dot directory in the filesystem's root. Any number of different snapshots may be mounted, and no separate logical volume is needed, as would be for LVM style snapshots. Unless disk space becomes tight, there is no reason not to keep your snapshots forever. A clone is essentially a writable snapshot and may be mounted anywhere. Thus, multiple filesystems may be created based on the same dataset and may then diverge from the base. This is useful for creating a dozen virtual machines in a second or two from an image. Each new VM will take up no space at all until it is changed.

These are just a few interesting features of ZFS. ZFS is not a perfect replacement for traditional filesystems yet - it lacks per-user quota support and performs differently than the usual UFS profile. But for typical applications, I think it is now the best option. Its administrative features and self-healing capability (especially when its built in RAID is used) are hard to beat.

Note to future contestants

by sbergman27 on Mon 21st Apr 2008 20:10 UTC


Member since:
2005-07-24

ZFS has gotten a lot of hype. It has also gotten some derision from Linux folks who are accustomed to getting that hype themselves.

It would be advisable to stay on topic and edit out any snipey and unprofessional off-topic asides like the above quoted material. This article is supposed to be about "Solaris Filesystem Choices". Please talk about Solaris filesystems.

Aside from some understandable concerns about layering, I think most "Linux folks" recognize that ZFS has some undeniable strengths.

I hope that this Article Contest does not turn into a convenient platform from which authors feel they can hurl potshots at others.

Edited 2008-04-21 20:25 UTC

Reply Permalink Score: 9

RE: Note to future contestants

by jwwf on Mon 21st Apr 2008 21:05 UTC in reply to "Note to future contestants"

Member since:
2006-01-19

Both of those quoted sentences are factual, and I think it's important to understand that technology and politics are never isolated subjects.

However, I understand the spirit of your sentiment. In my defense, I wrote the article both to educate and to entertain. If a person just wants to know about Solaris filesystems, the Sun docs are way better than anything I might write.

Reply Permalink Score: 9

RE[2]: Note to future contestants

by anomie on Tue 22nd Apr 2008 17:03 UTC in reply to "RE: Note to future contestants"

Member since:
2007-02-26

Both of those quoted sentences are factual...

Let's not confuse facts with speculation.

You wrote: "It has also gotten some derision from Linux folks who are accustomed to getting that hype themselves."

In interpretive writing, you can establish that "[ZFS] has gotten some derision from Linux folks" by providing citations (which you did not provide, actually).

But appending "... who are accustomed to getting that hype themselves" is tacky and presumptuous. Do you have references to demonstrate that Linux advocates deride ZFS specifically because they are not "getting hype"? If not, this is pure speculation on your part. So don't pretend it is fact.

Moreover, referring to "Linux folks" in this context is to make a blanket generalization.

Member since:
2006-01-19

"Both of those quoted sentences are factual...

Let's not confuse facts with speculation.

You wrote: "It has also gotten some derision from Linux folks who are accustomed to getting that hype themselves."

In interpretive writing, you can establish that "[ZFS] has gotten some derision from Linux folks" by providing citations (which you did not provide, actually).

But appending "... who are accustomed to getting that hype themselves" is tacky and presumptuous. Do you have references to demonstrate that Linux advocates deride ZFS specifically because they are not "getting hype"? If not, this is pure speculation on your part. So don't pretend it is fact.

Moreover, referring to "Linux folks" in this context is to make a blanket generalization. "


+1. The author of this article is clearly a tacky, presumptuous speculator, short on references and long on partisanship.

Seriously, I know I shouldn't reply here, but in the light of the above revelation, I will. It is extremely silly to turn this into some semantic argument on whether I can find documentation on what is in someone's heart. If I could find just two 'folks' who like linux and resent non-linux hype relating to ZFS, it would make my statement technically a fact. Are you willing to bet that these two people don't exist?

Yet, would this change anything? No, it would be complete foolishness. Having spent my time in academia, I am tired of this kind of sophistry of "demonstrating facts". I read, try things, form opinions, write about it. You have the same opportunity.

Reply Permalink Score: 2

RE[4]: Note to future contestants

by sbergman27 on Tue 22nd Apr 2008 18:33 UTC in reply to "RE[3]: Note to future contestants"

Member since:
2005-07-24

I figure that with popularity comes envy of that popularity. And with that comes potshots. Ask any celebrity. As Morrissey sings, "We Hate It When Our Friends Become Successful".

http://www.oz.net/~moz/lyrics/yourarse/wehateit.htm

It's probably best to simply expect potshots to be taken at Linux and Linux users and accept them with good grace. Politely pointing out the potshots is good form. Drawing them out into long flame-threads (as has not yet happened here) is annoying to others and is thus counterproductive. It just attracts more potshots.

Edited 2008-04-22 18:41 UTC

Reply Permalink Score: 2

RE[5]: Note to future contestants

by jwwf on Tue 22nd Apr 2008 19:16 UTC in reply to "RE[4]: Note to future contestants"

Member since:
2006-01-19

I figure that with popularity comes envy of that popularity. And with that comes potshots. Ask any celebrity. As Morrissey sings, "We Hate It When Our Friends Become Successful".

http://www.oz.net/~moz/lyrics/yourarse/wehateit.htm

It's probably best to simply expect potshots to be taken at Linux and Linux users and accept them with good grace. Politely pointing out the potshots is good form. Drawing them out into long flame-threads (as has not yet happened here) is annoying to others and is thus counterproductive. It just attracts more potshots.

Certainly this happens. On the other hand, who would be better than a celebrity to demonstrate the "because I am successful, I must be brilliant" fallacy we mere mortals are susceptible to. I think we would both agree that the situation is complicated.

Myself, I believe that a little bias can be enjoyable in a tech article, if it is explicit. It helps me understand the context of the situation--computing being as much about people as software.

Reply Permalink Score: 3

RE[4]: Note to future contestants

by anomie on Tue 22nd Apr 2008 19:52 UTC in reply to "RE[3]: Note to future contestants"

Member since:
2007-02-26

The author of this article is clearly a tacky, presumptuous speculator

Don't twist words. My comments were quite obviously in reference to a particular sentence. (I'd add that I enjoyed the majority of your essay.)

If I could find just two 'folks' who like linux and resent non-linux hype relating to ZFS, it would make my statement technically a fact.

Facts are verifiable through credible references. This is basic Supported Argument 101.

Having spent my time in academia, I am tired of this kind of sophistry of "demonstrating facts".

Good god, man. What academic world do you come from where you don't have to demonstrate facts? You're the one insisting that your statements are fact.

Sure you're entitled to your opinion. But don't confuse facts with speculation. That is all.

edit: added comment

RE[2]: ZFS is a dead end.

by Arun on Wed 23rd Apr 2008 23:13 UTC in reply to "RE: ZFS is a dead end."

Member since:
2005-07-07

Can't say I disagree. The layering violations are more important than some people realise, and what's worse is that Sun didn't need to do it that way. They could have created a base filesystem and abstracted out the RAID, volume management and other features while creating consistent looking userspace tools.

Please stop parroting one Linux developer's view. Go look at the ZFS docs. ZFS is layered. Linux developers talk crap about every thing that is not linux. Classic NIH syndrome.

ZFS was designed to make volume management and filesystems easy to use and bulletproof. What you and linux guys want defeats that purpose and the current technologies in linux land illustrate that fact to no end.

The all-in-one philosophy makes it that much more difficult to create other implementations of ZFS, and BSD and Apple will find it that much more difficult to do - if at all really. It makes coexistence with other filesystems that much more difficult as well, with more duplication of similar functionality. Despite the hype surrounding ZFS by Sun at the time of Solaris 10, ZFS still isn't Solaris' main filesystem by default. That tells you a lot.

That's just plain wrong. ZFS is working fine on BSD and OS X. ZFS doesn't make coexistence with other filesystems difficult. On my Solaris box I have UFS and ZFS filesytems with zero problems. In fact I can create a zvol from my pool and format it with UFS.

RE[2]: Comment by agrouf

It's not just that. It's maintainability. When features get added to the wrong layer, it means code redundancy, wasted developer effort, wasted memory, messy interfaces, and bugs that get fixed in one filesystem, but remain in the others.

It does make a difference just how many filesystems you care about supporting. The Linux philosophy is to have one that is considered standard, but to support many. If Sun is planning for ZFS to be the "be all and end all" filesystem for *Solaris, it is easy to see them coming to a different determination regarding proper layering. Neither determination is wrong. They just have different consequences.

Perhaps btrfs will someday implement all of ZFS's goodness in the Linux Way. I confess to being a bit impatient with the state of Linux filesystems today. But not enough to switch to Solaris. I guess one can't expect to have everything.

This is a good, balanced explanation. I think the question is whether the features provided by ZFS are best implemented in a rethought storage stack. In my opinion, the naming of ZFS is a marketing weakness. I would prefer to see something like "ZSM", expanding to "meaningless letter storage manager". Calling it a FS makes it easy for people to understand, but usually to understand incorrectly.

I see ZFS as a third generation storage manager, following partitioned disks and regular LVMs. Now, if the ZFS feature set can be implemented on a second generation stack, I say, more power to the implementors. But the burden of proof is on them, and so far it has not happened.

I too am impatient with the state of Linux storage management. For better or worse, I just don't think it is a priority for the mainline kernel development crew, or Red Hat, which, like it or not, is all that matters in the commercial space. I think ext3 is a stable, well-tuned filesystem, but I find LVM and MD to be clumsy and fragile. Once ext4 is decently stable, I would love to see work on a Real Volume Manager (tm).

Understanding and managing NFSv4 ACLs

April 16, 2009 | E O N

Using EON/Opensolaris and ZFS for storage will at some point cause you to cross paths with NFSv4 Access Control Lists. The control available through ACLs are really granular and powerful but they are also hard to manage and a bit confusing. Here i'll share my methods of handling ACLs which requires some pre-requisite reading to help understand the Compact Access codes:
add_file w, add_subdirectory p, append_data p, delete d , delete_child D , execute x , list_directory r , read_acl c , read_attributes a , read_data r , read_xattr R , write_xattr W , write_data w , write_attributes A , write_acl C , write_owner o
Inheritance compact codes:(remember i on a directory causes a recursive inheritance)
file_inherit f , dir_inherit d , inherit_only i , no_propagate n
ACL set codes:
full_set = rwxpdDaARWcCos = all permissions
modify_set = rwxpdDaARWc--s = all permissions except write_acl, write_owner
read_set = r-----a-R-c--- = read_data, read_attributes, read_xattr, read_acl
write_set = -w-p---A-W---- = write_data, append_data, write_attributes, write_xattr
If I create a file/folder (foo) via a windows client on a SMB/CIFS share the permissions typically resemble.
eon:/deep/tank#ls -Vd foo
d---------+  2 admin    stor           2 Apr 20 14:12 foo
         user:admin:rwxpdDaARWcCos:-------:allow
   group:2147483648:rwxpdDaARWcCos:-------:allow
This works fine for the owner (admin) but in a case where multiple people (family) use the storage, adding user access and more control over sharing is usually required. So how do I simply add the capability needed? If I wish to modify this(above), I always start by going back to default values
eon:/deep/tank#chmod A- foo
eon:/deep/tank#ls -Vd foo
d---------   2 admin    stor           2 Apr 20 14:12 foo
             owner@:rwxp----------:-------:deny
             owner@:-------A-W-Co-:-------:allow
             group@:rwxp----------:-------:deny
             group@:--------------:-------:allow
          everyone@:rwxp---A-W-Co-:-------:deny
          everyone@:------a-R-c--s:-------:allow
I then copy and paste them directly into a terminal or script (vi /tmp/bar) for trial and error and simply flip the bits I wish to test on or off. Note I'm using A= which will wipe and replace with whatever I define. With A+ or A-, it adds or removes the matched values. So my script will look like this after the above is copied
chmod -R A=\
owner@:rwxp----------:-------:deny,\
owner@:-------A-W-Co-:-------:allow,\
group@:rwxp----------:-------:deny,\
group@:--------------:-------:allow,\
everyone@:rwxp---A-W-Co-:-------:deny,\
everyone@:------a-R-c--s:-------:allow \
foo
Let's modify group:allow to have write_set = -w-p---A-W----
chmod -R A=\
owner@:rwxp----------:-------:deny,\
owner@:-------A-W-Co-:-------:allow,\
group@:--------------:-------:deny,\
group@:-w-p---A-W----:-------:allow,\
everyone@:rwxp---A-W-Co-:-------:deny,\
everyone@:------a-R-c--s:-------:allow \
foo
Running the above
eon:/deep/tank#sh -x /tmp/bar
+ chmod -R A=owner@:rwxp----------:-------:deny,owner@:-------A-W-Co-:-------:allow,group@:--------------:-------:deny,group@:-w-p---A-W----:-------:allow,everyone@:rwxp---A-W-Co-:-------:deny,everyone@:------a-R-c--s:-------:allow foo
eon:/deep/tank#ls -Vd foo/
d----w----+  2 admin    stor           2 Apr 20 14:12 foo/
             owner@:rwxp----------:-------:deny
             owner@:-------A-W-Co-:-------:allow
             group@:--------------:-------:deny
             group@:-w-p---A-W----:-------:allow
          everyone@:rwxp---A-W-Co-:-------:deny
          everyone@:------a-R-c--s:-------:allow
Adding a user (webservd) at layer 5, 6 with full_set permissions
eon:/deep/tank#eon:/deep/tank#chmod A+user:webservd:full_set:d:allow,user:webservd:full_set:f:allow foo
eon:/deep/tank#ls -Vd foo
d----w----+  2 admin    stor           2 Apr 20 14:12 foo
      user:webservd:rwxpdDaARWcCos:-d-----:allow
      user:webservd:rwxpdDaARWcCos:f------:allow
             owner@:rwxp----------:-------:deny
             owner@:-------A-W-Co-:-------:allow
             group@:--------------:-------:deny
             group@:-w-p---A-W----:-------:allow
          everyone@:rwxp---A-W-Co-:-------:deny
          everyone@:------a-R-c--s:-------:allow
Ooops, that's level 1, 2 so let's undo this by simply repeating the command with A- instead of A+. Then lets fix it by repeating the command with A5+ instead of A-
eon:/deep/tank#chmod A-user:webservd:full_set:d:allow,user:webservd:full_set:f:allow foo
eon:/deep/tank#ls -Vd foo
d----w----+  2 admin    stor           2 Apr 20 14:12 foo
             owner@:rwxp----------:-------:deny
             owner@:-------A-W-Co-:-------:allow
             group@:--------------:-------:deny
             group@:-w-p---A-W----:-------:allow
          everyone@:rwxp---A-W-Co-:-------:deny
          everyone@:------a-R-c--s:-------:allow
eon:/deep/tank#chmod A5+user:webservd:full_set:d:allow,user:webservd:full_set:f:allow foo
eon:/deep/tank#ls -Vd foo
d----w----+  2 admin    stor           2 Apr 20 14:12 foo
             owner@:rwxp----------:-------:deny
             owner@:-------A-W-Co-:-------:allow
             group@:--------------:-------:deny
             group@:-w-p---A-W----:-------:allow
          everyone@:rwxp---A-W-Co-:-------:deny
      user:webservd:rwxpdDaARWcCos:-d-----:allow
      user:webservd:rwxpdDaARWcCos:f------:allow
          everyone@:------a-R-c--s:-------:allow
This covers adding, deleting, modifying and replacing NFSv4 ACLs. Hope that provides some guidance in case you have to tangle with NFSv4 ACLs. The more exercise you get with NFSv4 ACLs the more familiar you'll be with getting it to do what you want.

ZFS ACLs by Mark Shellenbaum

Nov 16, 2005 | Mark Shellenbaum's Weblog

The ZFS file system uses a pure ACL model, that is compliant with the NFSv4 ACL model. What is meant by pure ACL model, is that every file always has an ACL, unlike file systems such as UFS that have either an ACL or it has permission bits. All access control decisions are governed by a file's ACL. All file's still have permission bits, but they are constructed by analyzing a file's ACL.

NFSv4 ACL Overview
The ACL model in NFSv4 is similar to the Windows ACL model. The NFSv4 ACL model supports a rich set of access permissions and inheritance controls. An ACL in this model is composed of an array of access control entries (ACE). Each ACE specifies the permissions, access type, inheritance flags and to whom the entry applies. In the NFSv4 model the "who" argument of each ACE, may be either a username or groupname. There are also a set of commonly know names, such as "owner@", "group@", "everyone@". These abstractions are used by UNIX variant operating systems to indicate if the ACE is for the file owner, file group owner or for the world. The everyone@ entry is not equivalent to the POSIX "other" class, it really is everyone. The complete description of the NFSv4 ACL model is availabe in Section 5.11 of the NFSv4 protocol specification.
NFSv4 Access Permissions
Permission
Description
read_data
Permission to read the data of the file
list_data
Permission to list the contents of a directory
write_data
Permission to modify the file's data anywhere in the file's offset range. This includes the ability to grow the file or write to an arbitrary offset.
add_file
Permission to add a new file to a directory
append_data
The ability to modify the data, but only starting at EOF.
add_subdirectory
Permission to create a subdirectory to a directory
read_xattr
The ability to read the extended attributes of a file or to do a lookup in the extended attributes directory.
write_xattr
The ability to create extended attributes or write to the extended attributes directory.
execute
Permission to execute a file
delete_child
Permission to delete a file within a directory
read_attributes
The ability to read basic attributes (non-ACLs) of a file. Basic attributes are considered the stat(2) level attributes.
write_attributes
Permission to change the times associated with a file or directory to an arbitrary value
delete
Permission to delete a file
read_acl
Permission to read the ACL
write_acl
Permission to write a file's ACL
write_owner
Permission to change the owner or the ability to execute chown(1) or chgrp(1)
synchronize
Permission to access a file locally at the server with synchronous reads and writes.

NFSv4 Inheritance flags
Inheritance Flag
Description
file_inherit
Can be place on a directory and indicates that this ACE should be added to each new non-directory file created.
dir_inherit
Can be placed on a directory and indicates that this ACE should be added to each new directory created.
inherit_only
Placed on a directory, but does not apply to the directory itself, only to newly created files and directories. This flag requires file_inherit and or dir_inherit to indicate what to inherit.
no_propagate
Placed on directories and indicates that ACL entries should only be inherited to one level of the tree. This flag requires file_inherit and or dir_inherit to indicate what to inherit.

NFSv4 ACLs vs POSIX

The difficult part of using the NFSv4 ACL model was trying to still preserve POSIX compliance in the file system. POSIX allows for what it calls "additonal" and "alternate" access methods. An additional access method is defined to be layered upon the file permission bits, but they can only further restrict the standard access control mechanism. The alternate file access control mechanism is defined to be independent of the file permission bits and which if enabled on a file may either restrict or extend the permissions of a given user. Another major distinction between the additional and alternate access control mechanisms is that, any alternate file access control mechanism must be disabled after the file permission bits are changed with a chmod(2). Additional mechanisms do not need to be disabled when a chmod is done.

Most vendors that have implemented NFSv4 ACLs have taken the approach of "discarding" ACLs during a chmod(2). This is a bit heavy handed, since a user went through the trouble of crafting a bunch of ACLs, only to have chmod(2) come through and destroy all of their hard work. It was this single issue that was the biggest hurdle to POSIX compliance with ZFS in implementing NFSv4 ACLs. In order to achieve this Sam, Lisa and I spent far too long trying to come up with a model that would preserve as much of the original ACL, while still being useful. What we came up with is a model that retains additional access methods, and disabled, but doesn't delete alternate access controls. Sam and Lisa have filed an internet draft which has the details about the chmod(2) algorithm and how to make NFSv4 ACLs POSIX complient.

So whats cool about this

Lets assume we have the following directory /sandbox/test.dir.
Its initial ACL looks like:

% ls -dv test.dir
drwxr-xr-x 2 ongk bin 2 Nov 15 14:11 test.dir
0:owner@::deny
1:owner@:list_directory/read_data/add_file/write_data/add_subdirectory
/append_data/write_xattr/execute/write_attributes/write_acl
/write_owner:allow
2:group@:add_file/write_data/add_subdirectory/append_data:deny
3:group@:list_directory/read_data/execute:allow
4:everyone@:add_file/write_data/add_subdirectory/append_data/write_xattr
/write_attributes/write_acl/write_owner:deny
5:everyone@:list_directory/read_data/read_xattr/execute/read_attributes
/read_acl/synchronize:allow


Now if I want to give "marks" the ability to create files, but not subdirectories in this
directory then the following ACL would achieve this.

First lets make sure "marks" can't currently create files/directories

$ mkdir /sandbox/bucket/test.dir/dir.1
mkdir: Failed to make directory "/sandbox/test.dir/dir.1"; Permission denied

$ touch /sandbox/test.dir/file.1
touch: /sandbox/test.dir/file.1 cannot create

Now lets give marks add_file permission

% chmod A+user:marks:add_file:allow /sandbox/test.di
% ls -dv test.dir
drwxr-xr-x+ 2 ongk bin 2 Nov 15 14:11 test.dir
0:user:marks:add_file/write_data:allow
1:owner@::deny
2:owner@:list_directory/read_data/add_file/write_data/add_subdirectory
/append_data/write_xattr/execute/write_attributes/write_acl
/write_owner:allow
3:group@:add_file/write_data/add_subdirectory/append_data:deny
4:group@:list_directory/read_data/execute:allow
5:everyone@:add_file/write_data/add_subdirectory/append_data/write_xattr
/write_attributes/write_acl/write_owner:deny
6:everyone@:list_directory/read_data/read_xattr/execute/read_attributes
/read_acl/synchronize:allow

Now lets see if it works for user "marks"

$ id
uid=76928(marks) gid=10(staff)

$ touch file.1
$ ls -v file.1
-rw-r--r-- 1 marks staff 0 Nov 15 10:12 file.1
0:owner@:execute:deny
1:owner@:read_data/write_data/append_data/write_xattr/write_attributes
/write_acl/write_owner:allow
2:group@:write_data/append_data/execute:deny
3:group@:read_data:allow
4:everyone@:write_data/append_data/write_xattr/execute/write_attributes
/write_acl/write_owner:deny
5:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
:allow

Now lets make sure "marks" can't create directories.

$ mkdir dir.1
mkdir: Failed to make directory "dir.1"; Permission denied

The write_owner permission is handled in a special way. It allows for a user to "take" ownership of a file. The following example will help illustrate this. With the write_owner a user can only do a chown(2) to himself or to a group that he is a member of.

We will start out with the following file.

% ls -v file.test
-rw-r--r-- 1 ongk staff 0 Nov 15 14:22 file.test
0:owner@:execute:deny
1:owner@:read_data/write_data/append_data/write_xattr/write_attributes
/write_acl/write_owner:allow
2:group@:write_data/append_data/execute:deny
3:group@:read_data:allow
4:everyone@:write_data/append_data/write_xattr/execute/write_attributes
/write_acl/write_owner:deny
5:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
:allow

Now if user "marks" tries to chown(2) the file to himself he will get an error.

$ chown marks file.test
chown: file.test: Not owner

$ chgrp staff file.test
chgrp: file.test: Not owner

Now lets give "marks" explicit write_owner permission.

% chmod A+user:marks:write_owner:allow file.test
% ls -v file.test
-rw-r--r--+ 1 ongk staff 0 Nov 15 14:22 file.test
0:user:marks:write_owner:allow
1:owner@:execute:deny
2:owner@:read_data/write_data/append_data/write_xattr/write_attributes
/write_acl/write_owner:allow
3:group@:write_data/append_data/execute:deny
4:group@:read_data:allow
5:everyone@:write_data/append_data/write_xattr/execute/write_attributes
/write_acl/write_owner:deny
6:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
:allow

Now lets see who "marks" can chown the file to.

$ id
uid=76928(marks) gid=10(staff)
$ groups
staff storage
$ chown bin file.test
chown: file.test: Not owner

So "marks" can't give the file away.

$ chown marks:staff file.test

Now lets look at an example to show how a user can be granted special delete permissions. ZFS doesn't create any delete permissions when a file is created, instead it uses write_data/execute for permission to write to a directory and execute to search the directory.

Lets first create a read-only directory and then give "marks" the ability to delete files.

% ls -dv test.dir
dr-xr-xr-x 2 ongk bin 2 Nov 15 14:11 test.dir
0:owner@:add_file/write_data/add_subdirectory/append_data:deny
1:owner@:list_directory/read_data/write_xattr/execute/write_attributes
/write_acl/write_owner:allow
2:group@:add_file/write_data/add_subdirectory/append_data:deny
3:group@:list_directory/read_data/execute:allow
4:everyone@:add_file/write_data/add_subdirectory/append_data/write_xattr
/write_attributes/write_acl/write_owner:deny
5:everyone@:list_directory/read_data/read_xattr/execute/read_attributes
/read_acl/synchronize:allow


Now the directory has the following files:

% ls -l
total 3
-r--r--r-- 1 ongk bin 0 Nov 15 14:28 file.1
-r--r--r-- 1 ongk bin 0 Nov 15 14:28 file.2
-r--r--r-- 1 ongk bin 0 Nov 15 14:28 file.3

Now lets see if "marks" can delete any of the files?

$ rm file.1
rm: file.1: override protection 444 (yes/no)? y
rm: file.1 not removed: Permission denied

Now lets give "marks" delete permission on just file.1

% chmod A+user:marks:delete:allow file.1
% ls -v file.1
-r--r--r--+ 1 ongk bin 0 Nov 15 14:28 file.1
0:user:marks:delete:allow
1:owner@:write_data/append_data/execute:deny
2:owner@:read_data/write_xattr/write_attributes/write_acl/write_owner
:allow
3:group@:write_data/append_data/execute:deny
4:group@:read_data:allow
5:everyone@:write_data/append_data/write_xattr/execute/write_attributes
/write_acl/write_owner:deny
6:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
:allow


$ rm file.1
rm: file.1: override protection 444 (yes/no)? y

Lets see what a chmod(1) that changes the mode would do to a file with a ZFS ACL.
We will start out with the following ACL which gives user bin read_data and write_data permission.

$ ls -v file.1
-rw-r--r--+ 1 marks staff 0 Nov 15 10:12 file.1
0:user:bin:read_data/write_data:allow
1:owner@:execute:deny
2:owner@:read_data/write_data/append_data/write_xattr/write_attributes
/write_acl/write_owner:allow
3:group@:write_data/append_data/execute:deny
4:group@:read_data:allow
5:everyone@:write_data/append_data/write_xattr/execute/write_attributes
/write_acl/write_owner:deny
6:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize
:allow


$ chmod 640 file.1
$ ls -v file.1
-rw-r-----+ 1 marks staff 0 Nov 15 10:12 file.1
0:user:bin:write_data:deny
1:user:bin:read_data/write_data:allow
2:owner@:execute:deny
3:owner@:read_data/write_data/append_data/write_xattr/write_attributes
/write_acl/write_owner:allow
4:group@:write_data/append_data/execute:deny
5:group@:read_data:allow
6:everyone@:read_data/write_data/append_data/write_xattr/execute
/write_attributes/write_acl/write_owner:deny
7:everyone@:read_xattr/read_attributes/read_acl/synchronize:allow


In this example ZFS has prepended a deny ACE to take away write_data permission. This
is an example of disabling "alternate" access methods. More details about
how ACEs are disabled are described in internet draft.

The ZFS admin guide and the chmod(1) manpages have many more examples of setting ACLs and how the inheritance model works.

With the ZFS ACL model access control is no longer limited to the simple "rwx" model that UNIX has used since its inception.

Lisa Week's Weblog

Over the last several months, I've been doing a lot of work with NFSv4 ACLs. First, I worked with Sam to get NFSv4 ACL support into Solaris 10. The major portion of this work involved implementing the pieces to be able to pass ACLs over-the-wire as defined by section 5.11 of the NFSv4 specification (RFC3530) and the translators (code to translate from UFS (or also referred to as POSIX-draft) ACLs to NFSv4 ACLs and back). At that point, Solaris was further along with regard to ACLs than it ever had been, but was still not able to support the full semantics of NFSv4 ACLs. So...here comes ZFS!

After getting the support for NFSv4 ACLs into Solaris 10, I started working on the ZFS ACL model with Mark and Sam. So, you might wonder why a couple of NFS people (Sam and I) would be working with ZFS (Mark) on the ZFS ACL model...well that is a good question. The reason for that is because ZFS has implemented native NFSv4 ACLs. This is really exciting because it is the first time that Solaris is able to support the full semantics of NFSv4 ACLs as defined by RFC3530.

In order to implement native NFSv4 ACLs in ZFS, there were a lot of problems we had to overcome. Some of the biggest struggles were ambiguities in the NFSv4 specification and the requirement for ZFS to be POSIX compliant. These problems have been captured in an Internet Draft submitted by Sam and me on October 14, 2005.

ACLs in the Computer Industry:

What makes NFSv4 ACLs so special...so special to have the shiny, new ZFS implement them? No previous attempt to specify a standard for ACLs has succeeded, therefore, we've seen a lot of different (non-standard) ACL models in the industry. With NFS Version 4, we now have an IETF approved standard for ACLs.

As well as being a standard, the NFSv4 ACL model is very powerful. It has a rich set of inheritance properties as well as a rich set of permission bits outside of just read, write and execute (as explained in the Access mask bits section below). And for the Solaris NFSv4 implementation this means better interoperability with other vendor's NFSv4 implementations.

ACLs in Solaris:

Like I said before, ZFS has native NFSv4 ACLs! This means that ZFS can fully support the semantics as defined by the NFSv4 specification (with the exception of a couple things, but that will be mentioned later).

What makes up an ACL?

ACLs are made up of zero or more Access Control Entries (ACEs). Each ACE has multiple components and they are as follows:

1.) Type component:
The type component of the ACE defines the type of ACE. There
are four types of ACEs: ALLOW, DENY, AUDIT, ALARM.


The ALLOW type ACEs permit access.
The DENY type ACES restrict access.
The AUDIT type ACEs audit accesses.
The ALARM type ACEs alarm accesses.

The ALLOW and DENY type of ACEs are implemented in ZFS.
AUDIT and ALARM type of ACEs are not yet implemented in ZFS.

The possibilities of the AUDIT and ALARM type ACEs are described below. I
wanted to explain the flags that need to be used in conjunction with them before
going into any detail on what they do, therefore, I gave this description its own
section.

2.) Access mask bits component:
The access mask bit component of the ACE defines the accesses
that are controlled by the ACE.

There are two categories of access mask bits:
1.) The bits that control the access to the file
i.e. write_data, read_data, write_attributes, read_attributes
2.) The bits that control the management of the file
i.e. write_acl, write_owner

For an explanation of what each of the access mask bits actually control in ZFS,
check out Mark's blog.

3.) Flags component:
There are three categories of flags:
1.) The bits that define inheritance properties of an ACE.
i.e. file_inherit, directory_inherit, inherit_only,
no_propagate_inherit
Again, for an explanation of these flags, check out Mark's blog.
2.) The bits that define whether or not the ACE applies to a user or group
i.e. identifier_group
3.) The bits that work in conjunction with the AUDIT and ALARM type ACEs
i.e. successful_access_flag, failed_access_flag.
ZFS doesn't support these flags since they don't support AUDIT and
ALARM type ACEs.

4.) who component:
The who component defines the entity that the ACE applies to.

For NFSv4, this component is a string identifier and it can be a user, group or
special identifier (OWNER@, GROUP@, EVERYONE@). An important thing to
note about the EVERYONE@ special identifier is that it literally means everyone
including the file's owner and owning group. EVERYONE@ is not equivalent to
the UNIX other entity. (If you are curious as to why NFSv4 uses strings rather
than integers (uids/gids), check out Eric's blog.)

For ZFS, this component is an integer (uid/gid).

What do AUDIT and ALARM ACE types do?

The AUDIT and ALARM type of ACES trigger an audit or alarm event upon the successful or failed accesses depending on the presence of the successful/failed access flags (described above) as defined in the access mask bits of the ACE. The ACEs of type AUDIT and ALARM don't play a role when doing access checks on a file. They only define an action to happen in the event that a certain access is attempted.

For example, lets say we have the following ACL:

lisagab:write_data::deny
lisagab:write_data:failed_access_flag:alarm
The first ACE affects the access that user, "lisagab", has to the file. The second ACE says if user, "lisagab", attempts to access this file for writing and fails, trigger an alarm event.

One important thing to remember is the fact that what we do in the event of auditing or alarming is still undefined. Although, you can think of it like this: when the access in question happens, auditing could be the logging the event to a file and alarming could be the sending of an email to an administrator.

How is access checking done?

To quote the NFSv4 specification:
 To determine if a request succeeds, each nfsace4 entry is processed
 in order by the server. Only ACEs which have a "who" that matches
 the requester are considered. Each ACE is processed until all of the
 bits of the requester's access have been ALLOWED. Once a bit (see
 below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer
 considered in the processing of later ACEs. If an ACCESS_DENIED_ACE
 is encountered where the requester's access still has unALLOWED bits
 in common with the "access_mask" of the ACE, the request is denied.
What this means is:

The most important thing to note about access checking with NFSv4 ACLs is that it is very order dependent. If a request for access is made, each ACE in the ACL is traversed in order. The first ACE that matches the who of the requester and defines the access that is being requested is honored.

For example, lets say user, "lisagab", is requesting the ability to read the data of file, "foo" and "foo" has the following ACL:

everyone@:read_data::allow
lisagab:write_data::deny

lisagab would be allowed the ability to read_data because lisagab is covered by "everyone@".

Another thing that is important to know is that the access determined is cumulative.

For example, lets say user, "lisagab", is requesting the ability to read and write the data of file, "bar" and "bar" has the following ACL:

lisagab:read_data::allow
lisagab:write_data::allow

lisagab would be allowed the ability to read_data and write_data.

How to use ZFS/NFSv4 ACLs on Solaris:

Many of you may remember the setfacl(1) and getfacl(1) commands. Well, those are still around, but won't help you much with manipulating ZFS or pure NFSv4 ACLs. Those commands are only capable of manipulating the POSIX-draft ACLs as implemented by UFS.

As a part of the ZFS putback, Mark has modified the chmod(1) and ls(1) command line utilities in order to manipulate ACLs on Solaris.

chmod(1) and ls(1) now give us the ability to manipulate ZFS/NFSv4 ACLs. Interestingly enough, these utilities can also manipulate POSIX-draft ACLs so, now there is a one stop shop for all your ACL needs.

[Sep 3, 2009] Working with filesystems using NFSV4 ACLs

THe NFSv4 (Network File System – Version 4) protocol introduces a new ACL (Access Control List) format that extends other existing ACL formats. NFSv4 ACL is easy to work with and introduces more detailed file security attributes, making NFSv4 ACLs more secure. Several operating systems like IBM® AIX®, Sun Solaris, and Linux® have implemented NFSv4 ACL in their filesystems.

Currently, the filesystems that support NFSv4 ACL in IBM AIX 5L version 5.3 and above are NFSv4, JFS2 with EAv2 (Extended Journaled Filesystem with Extended Attributes format version 2), and General Parallel Filesystem (GPFS). In Sun Solaris, this ACL model is supported by ZFS. In RedHat Linux, NFSv4 supports NFSv4 ACLs.

...ZFS supports the NFSv4 ACL model, and has implemented the commands in the form of new options to the existing ls and chmod commands. Thus, the ACLs can be set and displayed using the chmod and ls commands; no new command has been introduced. Because of this, it is very easy to work with ACLs in ZFS.

ZFS ACL format

ZFS ACLs follow a well-defined format. The format and the entities involved in this format are:


Syntax A
                
ACL_entry_type:Access_permissions/…/[:Inheritance_flags]:deny or allow
      

ACL_entry_type includes "owner@", "group@", or "everyone@".

For example:

group@:write_data/append_data/execute:deny

Syntax B
                
ACL_entry_type: ACL_entry_ID:Access_permissions/…/[:Inheritance_flags]:deny or allow
      

ACL_entry_type includes "user", or "group".

ACL_entry_ID includes "user_name", or "group_name".

For example:

user:samy:list_directory/read_data/execute:allow

Inheritance flags
          
f : FILE_INHERIT
d : DIRECTORY_INHERIT
i : INHERIT_ONLY
n : NO_PROPAGATE_INHERIT
S : SUCCESSFUL_ACCESS_ACE_FLAG
F : FAILED_ACCESS_ACE_FLAG

Listing ACLs of ZFS files and directories

ACLs can be listed using the ls command using the -v and -V options. For listing directory ACLs, use the -d option.

Operation Command
Listing ACL entries of files ls –[v | V] <file_name>
Listing ACL entries of dirs ls –d[v | V] <dir_name>

Example for listing ACLs of a file
ls -v file.1
-rw-r--r-- 1 root root 2703 Nov 4 12:37 file.1
0:owner@:execute:deny
1:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
       write_acl/write_owner:allow
2:group@:write_data/append_data/execute:deny
3:group@:read_data:allow
4:everyone@:write_data/append_data/write_xattr/execute/write_attributes/
       write_acl/write_owner:deny
5:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow

Example for listing ACLs of a directory
        
# ls -dv dir.1
drwxr-xr-x 2 root root 2 Nov 1 14:51 dir.1
0:owner@::deny
1:owner@:list_directory/read_data/add_file/write_data/add_subdirectory/
    append_data/write_xattr/execute/write_attributes/write_acl/write_owner:allow
2:group@:add_file/write_data/add_subdirectory/append_data:deny
3:group@:list_directory/read_data/execute:allow
4:everyone@:add_file/write_data/add_subdirectory/append_data/write_xattr /
    write_attributes/write_acl/write_owner:deny
5:everyone@:list_directory/read_data/read_xattr/execute/read_attributes /
    read_acl/synchronize:allow

Example for listing ACLs in a compact format
# ls -Vd dir.1
drwxr-xr-x   2 root     root           2 Sep  1 05:46 d
   owner@:--------------:------:deny
   owner@:rwxp---A-W-Co-:------:allow
   group@:-w-p----------:------:deny
   group@:r-x-----------:------:allow
everyone@:-w-p---A-W-Co-:------:deny
everyone@:r-xp--a-R-c--s:------:allow

In above examples, ACLs are displayed in a compact format. In this, access permissions and inheritance flags are displayed using masks. One ACL entry is displayed in each line, making the view easier to understand.


Modifying ACLs of ZFS files and directories

ACLs can be set or modified using the chmod command. The chmod command uses the ACL-specification, which includes the ACL-format (Syntax A or B), listed earlier.

Operation Command
Adding an ACL entry by index-ID # chmod Aindex_ID+acl_specification filename
Adding an ACL entry for a user # chmod A+acl_specification filename
Removing an ACL entry by index_ID # chmod Aindex_ID- filename
Removing an ACL entry by user # chmod A-acl_specification filename
Removing an ACL from a file # chmod A- filename
Replacing an ACL entry at index_ID # chmod Aindex_ID=acl_specification filename
Replacing an ACL of a file # chmod A=acl_specification filename

Examples of ZFS ACLs modifications


List ACL entries
# ls –v a
-rw-r--r--   1 root     root           0 Sep  1 04:25 a
0:owner@:execute:deny
1:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
2:group@:write_data/append_data/execute:deny
3:group@:read_data:allow
4:everyone@:write_data/append_data/write_xattr/execute/write_attributes/
    write_acl/write_owner:deny
5:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow


Add ACL entries
# chmod A+user:samy:read_data:allow a
# ls -v a
-rw-r--r--+  1 root     root           0 Sep  1 02:01 a
0:user:samy:read_data:allow
1:owner@:execute:deny
2:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
3:group@:write_data/append_data/execute:deny
4:group@:read_data:allow
5:everyone@:write_data/append_data/write_xattr/execute/write_attributes/
    write_acl/write_owner:deny
6:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow

# chmod A1+user:samy:execute:deny a
# ls -v a
-rw-r--r--+  1 root     root           0 Sep  1 02:01 a
0:user:samy:read_data:allow
1:user:samy:execute:deny
2:owner@:execute:deny
3:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
4:group@:write_data/append_data/execute:deny
5:group@:read_data:allow
6:everyone@:write_data/append_data/write_xattr/execute/write_attributes/
    write_acl/write_owner:deny
7:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow

Replace ACL entries
# chmod A0=user:samy:read_data/write_data:allow a
# ls -v
total 2
-rw-r--r--+  1 root     root           0 Sep  1 02:01 a
0:user:samy:read_data/write_data:allow
1:user:samy:execute:deny
2:owner@:execute:deny
3:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
4:group@:write_data/append_data/execute:deny
5:group@:read_data:allow
6:everyone@:write_data/append_data/write_xattr/execute/write_attributes/
    write_acl/write_owner:deny
7:everyone@:read_data/read_xattr/read_attributes/read_acl/synchronize:allow


# chmod A=user:samy:read_data/write_data/append_data:allow a
# ls -v a
----------+  1 root     root           0 Sep  1 02:01 a
0:user:samy:read_data/write_data/append_data:allow

ACLs can also be modified using the masks instead of specifying complete names.


Modifying ACL entries using masks
# ls -V a
-rw-r--r--+  1 root     root           0 Sep  5 01:50 a
user:samy:--------------:------:deny
user:samy:rwx-----------:------:allow
   owner@:--x-----------:------:deny
   owner@:rw-p---A-W-Co-:------:allow
   group@:-wxp----------:------:deny
   group@:r-------------:------:allow
everyone@:-wxp---A-W-Co-:------:deny
everyone@:r-----a-R-c--s:------:allow

# chmod A1=user:samy:rwxp:allow a

# ls -V a
-rw-r--r--+  1 root     root           0 Sep  5 01:50 a
user:samy:--------------:------:deny
user:samy:rwxp----------:------:allow
   owner@:--x-----------:------:deny
   owner@:rw-p---A-W-Co-:------:allow
   group@:-wxp----------:------:deny
   group@:r-------------:------:allow
everyone@:-wxp---A-W-Co-:------:deny
everyone@:r-----a-R-c--s:------:allow

Remove ACL entries
# ls -v a
-rw-r-----+  1 root     root           0 Sep  5 01:50 a
0:user:samy:read_data/write_data/execute:allow
1:owner@:execute:deny
2:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
3:group@:write_data/append_data/execute:deny
4:group@:read_data:allow
5:everyone@:read_data/write_data/append_data/write_xattr/execute/
    write_attributes/write_acl/write_owner:deny
6:everyone@:read_xattr/read_attributes/read_acl/synchronize:allow

# chmod A- a
# ls -v a
-rw-r-----   1 root     root           0 Sep  5 01:50 a
0:owner@:execute:deny
1:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
2:group@:write_data/append_data/execute:deny
3:group@:read_data:allow
4:everyone@:read_data/write_data/append_data/write_xattr/execute/
    write_attributes/write_acl/write_owner:deny
5:everyone@:read_xattr/read_attributes/read_acl/synchronize:allow

# chmod A5- a
# ls -v a
-rw-r-----   1 root     root           0 Sep  5 01:50 a
0:owner@:execute:deny
1:owner@:read_data/write_data/append_data/write_xattr/write_attributes/
    write_acl/write_owner:allow
2:group@:write_data/append_data/execute:deny
3:group@:read_data:allow
4:everyone@:read_data/write_data/append_data/write_xattr/execute/
    write_attributes/write_acl/write_owner:deny

[Sep 2, 2009] Working With ZFS Snapshots (pdf)

Helps to understand the capabilities of ZFS snapshots, a read-only copy of a Solaris ZFS file system. ZFS snapshots can be created almost instantly and are a valuable tool for system administrators needing to perform backups.

You will learn:

After reading this guide, you will have a basic understanding of how snapshots can be integrated into your system administration procedures.

See also

ZFS vs. Linux Raid + LVM

Comparison of ZFS and Linux RAID +LVM

iZFS doesn't support raid 5 but does support raid-z that has better features and less limitations

iiRaidZ - A variation on RAID-5 which allows for better distribution of parity and eliminates the "RAID-5 write hole" (in which data and parity become inconsistent after a power loss). Data and parity is striped across all disks within a raidz group. A raidz group with N disks of size X can hold approximately (N-1)*Xbytes and can withstand one device failing before data integrity is compromised. The minimum number of devices in a raidz group is 2; the recommended number is between 3 and 9.

ivA clone is a writable volume or file system whose initialcontents are the same as another dataset. As with snapshots, creating a clone is nearly instantaneous, and initially consumes no additional space.

v[Linux] RAID (be it hardware- or software-), assumes that if a write to a disk doesn't return an error, then the write was successful. Therefore, if your disk corrupts data without returning an error, your data will become corrupted. This is of course very unlikely to happen, but it is possible, and it would result in a corrupt filesystem. http://www.tldp.org/HOWTO/Software-RAID-HOWTO-6.html

[May 12, 2008] ZFS what the ultimate file system really means for your desktop -- in plain English!

Ashton Mills 21 June 2007327 days ago.
So, Sun's ZFS file system has garnered publicity recently with the announcement of its inclusion in Mac OS X and, more recently, as a module for the Linux kernel. But if you don't readFilesystems Weekly, what is it and what does it mean for you?

Now I may just be showing my geek side a bit here, but file systems are awesome. Aside from the fact our machines would be nothing without them, the science behind them is frequently ingenious.

And ZFS (the Zettabyte File System) is no different. It has quite an extensive feature set just like its peers, but builds on this by adding a new layer of simplicity. According to the official site, ZFS key features are (my summary):

All up, as a geek, it's an exciting file system I'd love to play with -- currently however ZFS is part of Sun's Solaris, and under the CDDL (Common Development and Distribution License), which is actually based on the MPL (Mozilla Public License). As this is incompatible with the GPLv2, this means the code can't be ported to the Linux kernel. However, this has recently been satisfied by porting it across as a FUSE module but, being userspace, is slow though there hope this will improve. Looks like it's time to enable FUSE support in my kernel!

Of course, (in a few months time) you could also go for Mac OS X where, in Leopard, ZFS is already supported and there are rumours Apple may be preparing to adopt it as the default filesystem replacing the aging HFS+ in the future (but probably not in 10.5).

[Jun 27, 2007] Solaris ZFS and Microsoft Server 2003 NTFS File System Performance - BigAdmin Description

Description: This white paper explores the performance characteristics and differences of ZFS in the Solaris 10 OS and the Microsoft Windows Server 2003 NTFS file system.

[Jun 12, 2007] Apple's Leopard will use ZFS, but not exclusively | Tech news blog ...

Jun 12, 2007

... Apple confirmed statements by Sun's Jonathan Schwartz that Leopard will use ZFS, correcting an executive who Monday suggested otherwise.

[Apr 6, 2007] ZFS committed to the FreeBSD base.

Pawel Jakub Dawidek pjd at FreeBSD.org
Fri Apr 6 02:58:34 UTC 2007
Hi.

I'm happy to inform that the ZFS file system is now part of the FreeBSD
operating system. ZFS is available in the HEAD branch and will be
available in FreeBSD 7.0-RELEASE as an experimental feature.

Commit log:

  Please welcome ZFS - The last word in file systems.
  
  ZFS file system was ported from OpenSolaris operating system. The code
  in under CDDL license.
  
  I'd like to thank all SUN developers that created this great piece of
  software.
  
  Supported by:	Wheel LTD (http://www.wheel.pl/)
  Supported by:	The FreeBSD Foundation (http://www.freebsdfoundation.org/)
  Supported by:	Sentex (http://www.sentex.net/)

Limitations.

  Currently ZFS is only compiled as kernel module and is only available
  for i386 architecture. Amd64 should be available very soon, the other
  archs will come later, as we implement needed atomic operations.

Missing functionality.

  - We don't have iSCSI target daemon in the tree, so sharing ZVOLs via
    iSCSI is also not supported at this point. This should be fixed in
    the future, we may also add support for sharing ZVOLs over ggate.
  - There is no support for ACLs and extended attributes.
  - There is no support for booting off of ZFS file system.

Other than that, ZFS should be fully-functional.

Enjoy!

-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-current/attachments/20070406/ee2df07b/attachment.pgp

[Apr 2, 2007] ZFS Overview and Guide - Features

[Aug 14, 2006] Techworld.com - ZFS - the future of file systems ? By Chris Mellor

Techworld

ZFS - the Zettabyte File System - is an enormous advance in capability on existing file systems. It provides greater space for files, hugely improved administration and greatly improved data security.

It is available in Sun's Solaris 10 and has been made open source. The advantages of ZFS look so great that its use may well spread to other UNIX distributions and even, possibly and eventually, to Windows.

Techworld has mentioned ZFS before. Here we provide a slightly wider and more detailed look at it. If you want to have even more information then the best resource is Sun's own website.

Why is ZFS a good thing?
It possesses advantages compared to existing file systems in these areas:-

- Scale
- Administration
- Data security and integrity

The key area is file system administration, followed by data security and file system size. ZFS started from a realisation that the existing file system concepts were hardly changed at all from the early days of computing. Then a computer knew about a disk which had files on it. A file system related to a single disk. On today's PCs the file systems are still disk-based with the Windows C: drive - A: and B: being floppy drives - and subsequent drives being D:, E:, etc.

To provide more space and bandwidth a software abstraction was added between the file system and the disks. It was called a volume manager and virtualised several disks into a volume.

Each volume has to be administered and growing volumes and file systems takes effort. Volume Manager software products became popular. The storage in a volume is specific to a server and application and can't be shared. Utilisation of storage is poor with any unused blocks on disks in volumes being unusable anywhere else.

ZFS starts from the concept that desktop and servers have many disks and that a good place to start abstracting this is at the operating system:file system interface. Consequently ZFS delivers, in effect, just one volume to the operating system. We might imagine it as disk:. From that point ZFS delivers scale, administration and data security features that other file systems do not.

ZFS has a layered stack with a POSIX-compliant operating system interface, then data management functions and, below that, increasingly device-specific functions. We might characterise ZFS as being a file system with a volume manager included within it, the data management function.

Data security
Data protection through RAID is clever but only goes so far. When data is written to disk it overwrites the current version of the data. There are instances of stray or phantom writes, mis-directed writes, DMA parity errors, disk driver bugs and accidental overwrites according to ZFS people, that the standard checksum approach won't detect.

The checksum is stored with the data block and is valid for that data block, but the data block shouldn't be there in the first place. The checksum is a disk-only checksum and doesn't cover against faults in the I/O path before that data gets written to disk.

If disks are mirrored then a block is simultaneously written to each mirror. If one drive or controller suffers a power failure then that mirror is out of synchronisation and needs re-synchronising with its twin.

With RAID if there is a loss of power between data and parity writes then disk contents are corrupted.

ZFS does things differently.

First of all it uses copy-on-write technology so that existing data blocks are not over-written. Instead new data blocks are written and their checksum stored with the pointer to them.

When a file write has been completed then the pointers to the previous blocks are changed so as to point to the new blocks. In other words the file write is treated as a transaction, an event that is atomic and has to be completed before it is confirmed or committed.

Secondly ZFS checks the disk contents looking for checksum/data mismatches. This process is called scrubbing. Any faults are corrected and a ZFS system exhibits what IBM calls autonomic computing capacity; it is self-healing.

Scale
ZFS uses a 128-bit addressing scheme and can store 256 quadrillion zettabytes. A zettabyte is 2 to the power 70 bytes or a billion TB. ZFS capacity limits are so far away as to be unimaginable. This is eye-catching stuff but unlikely to be a factor solving 64-bit file system capacity limitations for decades.

Administration
With ZFS all storage enters a common pool, called a zpool. Every disk or array added to ZFS disappears into this common pool. ZFS people characterise this storage pool as being akin to a computer's virtual memory.

A hierarchy of ZFS file systems can use that pool. Each can have its own attributes set, such as compression, a growth-limiting quota, or a set amount of space.

I/O characteristics
ZFS has its own I/O system. I/Os have a priority with read I/Os having a higher priority than write I/Os. That means that reads get executed even if writes are queued up.

Write I/Os have both a priority and a deadline. The deadline is sooner the higher the priority. Writes with the same deadline are executed in logical; block address order so that, in effect, they form a sequential series of writes across a disk which reduces head movement to a single sweep across the disk surface. What's happening is that random write I/Os are getting transformed into sets of sequential I/Os to make the overall write I/O rate faster.

Striping and blocksizes
ZFS stripes files automatically. Block sizes are dynamically set. Blocks are allocated from disks based on an algorithm that takes into account space available and I/O counts. When blocks are being written to the copy-on-write concept means that a sequential set of blocks can be used, speeding up write I/O.

ZFS and NetApp's WAFL
ZFS has been based in part of NetApp's write Anywhere File Layout (WAFL) system. It has moved on from WAFL and now has many differences. This table lists some of them. But do read the blog replies which correct some table errors.

There is more on the ZFS and WAFL similarities and differences here.

Snapshots unlimited and more
ZFS can take a virtually unlimited number if snapshots and these can be used to restore lost (deleted) files. However, they can't protect against disk crashes. For that RAID and backup to external devices are needed.

ZFS offers compression, encryption is being developed, and an initiative is under way to make it bootable. The compression is applied before data is written meaning that the write I/O burden is reduced and hence effective write speed increased further.

We may see Sun offering storage arrays with ZFS. For example we might see a SUN NAS box based on ZFS. This is purely speculative as is the idea that we might see Sun offered clustered NAS ZFS systems to take on Isilon and others in the high-performance, clustered, virtualised NAS area.

So what?
There is a lot of software engineering enthusiasm for ZFS and the engineers at Sun say that ZFS outperforms other file systems, for example the Solaris file system. It is faster at file operations and, other things being equal, a ZFS Solaris system will out-perform a non-ZFS Solaris system. Great, but will it out-perform other UNIX servers and Windows servers, again with other things being equal?

We don't know. We suspect it might but don't know by how much. Even then the popularity of ZFS will depend upon how it is taken up by Sun Solaris 10 customers and whether ports to apple and to Linux result in wide use. For us storage people the ports that really matter are to mainstream Unix versions such as AIX, HP-UX and Red Hat Linux, also SuSe Linux I suppose.

There is no news of a ZFS port to Windows and Vista's own advanced file system plans have quite recently been downgraded with its file system changes.

If Sun storage systems using ZFS, such as its X4500 'Thumper' server, with ZFS-enhanced direct-attached storage (DAS), and Honeycomb, become very popular and are as market-defining as EMC's Centera product then we may well see ZFS spreading. But their advantages have to be solid and substantial with users getting far, far better file-based application performance and a far, far lower storage system management burden. Such things need proving in practice.

To find out for yourself try these systems out or wait for others to do so.

How to reformat all of your systems and use ZFS.

1. So easy your mom could administer it

ZFS is administered by two commands, zpool and zfs. Most tasks typically require a single command to accomplish. And the commands are designed to make sense. For example, check out the commands to create a RAID 1 mirrored filesystem and place a quota on its size.

2. Honkin' big filesystems

How big do filesystems need to be? In a world where 640KB is certainly not enough for computer memory, current filesystems have reached or are reaching the end of their usefulness. A 64-bit filesystem would meet today's need, but estimates of the lifetime of a 64-bit filesystem is about 10 years. Extending to 128-bits gives ZFS an expected lifetime of 30 years (UFS, for comparison, is about 20 years old). So how much data can you squeeze into a 128-bit filesystem? 16 exabytes or 18 million terabytes. How many files can you cram into a ZFS filesystem? 200 million million.

Could anyone use a fileystem that large? No, not really. The topic has roused discussions about boiling the oceans if a real life storage unit that size was powered on. It may not be necessary to have 128 bits, but it doesn't hurt and we won't have to worry about running out of addressable space.

3. Filesystem, heal thyself

ZFS employs 256 bit checksums end-to-end to validate data stored under its protection. Most filesystem (and you know who you are) depend on the underlying hardware to detect corrupt data and then can only nag about it if they get such a message. Every block in a ZFS filesystem has a checksum associated with it. If ZFS detects a checksum mismatch on a raidz or mirrored filesystem, it will actively reconstruct the block from the available redundancy and go on about its job.

4. fsck off, fsck

fsck has been voted out of the house. We don't need it anymore. Because ZFS data are always consistent on disk, don't be afraid to yank out those power cords if you feel like it. Your ZFS filesystems will never require you to enter the superuser password more maintenance mode.

5. Compress to your heart's content

I've always been a proponent of optional and appropriate compression in filesystems. There are some data that are well suited to compression such as server logs. Many people get ruffled up over this topic, although I suspect that they were once burned by doublespace munching up an important document. When thoughtfully used, ZFS compression can improve disk I/O which is a common bottleneck. ZFS compression can be turned on for individual filesystems or hierarchies with a very easy single command.

6. Unconstrained architecture

UFS and other filesystems use a constrained model of fixed partitions or volumes, each filesystem having a set amount of available disk space. ZFS uses a pooled storage model. This is a significant departure from the traditional concept of filesystems. Many current production systems may have a single digit number of filesystems and adding or manipulating existing filesystems in such an environment is difficult.

In ZFS, pools are created from physical storage. Mirroring or the new RAID-Z redundancy exists at the pool level. Instead of breaking pools apart into filesystems, each newly created filesystem shares the available space in the pool, although a minimum amount of space can be reserved for it. ZFS filesystems exist in their own hierarchy, children filesystems inherit the properties of their parents, and each ZFS filesystem in the ZFS hierarchy can easily be mounted in different places in the system filesystem.

7. Grow filesystems without green thumb

If your pool becomes overcrowded, you can grow it. With one command. On a live production system. Enough said.

8. Dynamic striping

On by default, dynamic striping automatically includes all devices in a pool in writes simultaneously (stripe width spans all the avaiable media). This will speed up the I/O on systems with multiple paths to storage by load balancing the I/O on all of the paths.

9. The term "raidz" sounds so l33t
The new RAID-Z redundant storage model replaces RAID-5 and improves upon it. RAID-Z does not suffer from the "write hole" in which a stripe of data becomes corrupt because of a loss of power during the vulnerable period between writing the data and the parity. RAID-Z, like RAID-5, can survive the loss of one disk. A future release is planned using the keyword raidz2 which can tolerate the loss of two disks. Perhaps the best feature is that creating a raidz pool which is crazy simple.

10. Clones with no ethical issues

The simple creation of snapshots and clones of filesystems makes living with ZFS so much more enjoyable. A snapshot is a read-only point-in-time copy of a filesystem which takes practically no time to create and uses no additional space at the beginning. Any snapshot can be cloned to make a read-write filesystem and any snapshot of a filesystem can be restored to the original filesystem to return to the previous state. Snapshots can be written to other storage (disk, tape), transferred to another system, and converted back into a filesystem.

More information

For more information, check out Sun's official ZFS page and the detailed OpenSolaris community ZFS information. If you want to take ZFS out for a test drive, the latest version of Solaris Express has it built in and ready to go. Download it here.

[Sep 07, 2012] Linux Don't Need No Stinkin' ZFS BTRFS Intro & Benchmarks Linux Magazine

sysadmin:
One pretty good intro to ZFS is here: http://opensolaris.org/os/community/zfs/whatis/

It's a glorified bullet list with a brief description of what each feature means.

For example, one could quibble on the "snapshot of snapshots" feature:

ZFS backup and restore are powered by snapshots. Any snapshot can generate a full backup, and any pair of snapshots can generate an incremental backup. Incremental backups are so efficient that they can be used for remote replication â€" e.g. to transmit an incremental update every 10 seconds.

It also appears you've left some of the ZFS advantages out of the table – I'd encourage your readers to see the original blog posting. I don't know if this is in BTRFS, but it's a lifesaver:

ZFS provides unlimited constant-time snapshots and clones. A snapshot is a read-only point-in-time copy of a filesystem, while a clone is a writable copy of a snapshot. Clones provide an extremely space-efficient way to store many copies of mostly-shared data such as workspaces, software installations, and diskless clients.

mark_w

I am very glad to see this comparison, but I would like to make a few comments.

- the title/summary is poor. The comment "Linux don't need no stinkin' ZFS…“Butter FS” is on the horizon and it’s boasting more features and better performance" is misleading.

According to your summary table, you have one area in which BTRFS can be considered more featured than ZFS (snapshots of snapshots). Considering that you haven't even discussed clones (which maybe do the same thing or more) and that ZFS is not only a filesystem but it, and its utilities, integrate the LVM features and it has, right now, a wide selection of RAID modes that are actually functioning and it does 'automagic' adaptation to volumes of differing performance values, this seems to be a claim that you get nowhere near justifying. Maybe, eventually, it will be true, but the current version doesn't have the features and there is little evidence that 'first stable release' will either.

(Sorry, I'm not trying to say that BTRFS is 'bad', just that you are accidentally overclaiming; my guess would be that you haven't read much on ZFS and all of its features. I think if you had, you would have realised that BTRFS on its own or in its first stable release isn't going to do it. BTRFS plus on the fly compression, plus a revised MD, plus LVM may well, though, but I'm expecting to wait a bit for all of those to work ideally together.)

- the title leads you to believe that BTRFS will be shown to have a better performance than ZFS. The actual comparison is between BTRFS and ext filesystems with various configurations.

- one of the clever features of ZFS is protection against silent data corruption. This may not be vital to me now, but as filesystem sizes increase…a colleague deals with the situation in which a research project grabs nearly 1G of data every day and they have data going back over 5 years (and they intend to keep capturing data for another 50+ years). As he says "how do I ensure that the data that we read is the data that we wrote?", and its a good question. ZFS is good in that, admittedly obscure, situation, but I'm not sure whether BTRFS will be.

- You make (indirectly) the very good point that with the more advanced filesystems, benchmarking performance can be rather sensitive to set-up options. I am very glad that you did that testing and hope that you can do more thorough testing as we get closer to a version 1.0 release. I am not sure that I can help you with your funding submission, though…

It is also the case that the results that you get are heavily dependant on what you measure, so the more different tests you do, the better.

- another clever feature of ZFS (sorry about sounding like sounding like a stuck record, but I am impressed by ZFS and would love to be as impressed by BTRFS) is that you can make an array incorporating SSDs and hard disks and have the system automagically sort out the usage of the various storage types to give an agregate high performance, low-cost-per-TB, array. In a commercial setting this is probably one of the biggest advantages that ZFS has over earlier technology. I'm sure that I don't have to use the 'O' word here, but this is a vital advantage that Sun has, and I'm expecting to see some interesting work on optimising this system, in particular for database accesses, soon.

(BTW, this is so effective that a hybrid array, as used in Amber Road, can be so effective that comparing a hybrid array of SSDs and slow hard disks can be faster, cheaper and much more power efficient than one using the usual enterprise class SAS disks, so this can be an interesting solution, in enterprise apps)

If you wish to know more about ZFS, you can do worse than read Sun's whitepapers on the subject; search on "sun zfs discovery day" to get more info; the presentation is very readable. (I'm not, btw, suggesting that this is an unbiased comparison of ZFS to anything else; its a description of the features and how-to-use.)

softweyr

One (of the many) mis-understood features of ZFS is how the file metadata works. In most existing filesystems that support arbitrary metadata, the metadata space is limited and essentially allows key/data pairs to be associated with the file. In ZFS, the metadata for each filesystem object is another entire filesystem rooted at the containing file. ZFS is, in essence, a 3D filesystem. Consider Apple's application bundles, currently implemented as a directory that the Finder application recognizes and displays specially. Using ZFS, the application file could be the icon file for the application, all of the component files, including executables, libraries, localization packs, and other app resources, would be stored in the file metadata. To the outside world, the bundle truly appears as a single file.

ZFS has it's warts, for instance that storage of small files is quite inefficient, but actually besting it is going to be a long, hard slog. I suspect the big commercial Linux distributions will iron out the licensing issues somehow and including ZFS in their distributions in the next year. Linux is too big a business to be driven by the religious zealots anymore.

stoatwblr

2-and-a-bit years on and ZFS is still chugging along nicely as a working, robust filesystem (and yes, it _does_ have snapshots of snapshots) which I've found impossible to break.

Meantime, Btrfs has again trashed itself on my systems given scenarios as simple as power failure.

That's without even going into the joy of ZFS having working ssd read and write caching out front which substantially boosts performance while keeping the robustness (even 64Gb is more than adequate for a 10TB installation I use for testing)

If there's any way of reconciling CDDL and GPL then the best way forward would be to combine forces. BTRFS has a lot of nice ideas but ZFS has them too and it has the advantage of a decade's worth of actual real world deployment experience to draw on.

Less infighting among OSS devs, please.

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

NFSv4 ACLs

WAFL

Reference

If you want to learn more about the theory behind ZFS and find reference material have a look at ZFS Administration Guide, OpenSolaris ZFS, ZFS BigAdmin and ZFS Best Practices.

zfs-cheatsheet

ZFS Evil Tuning Guide - Siwiki

Recommended Papers

The Musings of Chris Samuel " Blog Archive " ZFS versus XFS with Bonnie++ patched to use random data

ZFS Tutorial Part 1

managing ZFS filesystems

Humor

For a humorous introduction to ZFS' features, see presentation given by Pawel at EuroBSDCon 2007: http://youtube.com/watch?v=o3TGM0T1CvE.



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: September 12, 2017