|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
News | See also | Recommended Links | Tutorials | Disk Repartitioning | Creating External Snapshots for JFS (not JFS2) | AIX Logical Volume Manager |
|
IBM's JFS was developed in the mid-1990s for AIX, then it found its way to OS/2 and then to Linux. It was open sourced by IBM in 1999 and available in the Linux kernel sources since 2002. It's therefore well tested, although the Linux version is rarely used.
|
IBM introduced JFS with the initial release of AIX 3.1. In May of 2001, IBM introduced
JFS2. Both filesystem types link their file and directory data to the structure
used by the AIX LVM for storage and retrieval. JFS2 is optimized for a 64-bit environment.
JFS2 is architected for filesystems up to four petabytes, but it has currently only
been tested up to 16 terabyte-sized filesystems. Also, file sizes are limited to
16 terabytes. The number of inodes that can be created in a filesystem is dynamic
and it is only limited by the amount of free space in the filesystem. JFS2 supports
buffered I/O, synchronous I/O (the file is opened with O_SYNC or O_DSYNC flags),
kernel asynchronous I/O (through the use of the Async I/O system calls), Direct
I/O (on a per-file basis if the file is opened with O_DIRECT, or on a per-filesystem
basis when the filesystem is mounted with the dio mount option), and Concurrent
I/O (on a per-file basis if the file is opened with O_CIO or when the filesystem
is mounted with the CIO mount option). With AIX, you can use either JFS or JFS2,
as they are both linked to the LVM. Both are journalized and no third party filesystems
are necessary. In AIX 5L Version 5.1, every filesystem corresponds to a logical
volume. In order to create a journaled filesystem, you need to use the smit fastpath:
smitty crfs
or crfs
from the command line. To increase
the size of a filesystem, use the chfs
command, in addition to using
SMIT.
JFS is a fully 64-bit filesystem. With a default block size of 4KB, it supports a maximum filesystem size of 4 petabytes (less if you use smaller block sizes). The minimum filesystem size supported is 16MB. he JFS transaction log has a default size of 0.4% of the aggregate size, rounded up to a megabyte boundary. The maximum size of the log is 32MB. One interesting aspect of the layout on disk is the fsck working space, a small area allocated within the filesystem for keeping track of block allocation if there is not enough RAM to track a large filesystem at boot time. Here is a comparison table reproduced from Notes about Linux file systems
\Feature | ext3 |
JFS | XFS |
---|---|---|---|
Block sizes | 1024-4096 | 4096 | 512-4096 |
Max fs size | 8TiB (243B) | 32PiB (255B) | 8EiB (263B)
16TiB (244B) on 32b system |
Max file size | 1TiB (240B) | 4PiB (252B) | 8EiB (263B)
16TiB (244B) on 32b system |
Max files/fs | 232 | 232 | 232 |
Max files/dir | 232 | 231 | 232 |
Max subdirs/dir | 215 | 216 | 232 |
Number of inodes | fixed | dynamic | dynamic |
Indexed dirs | option | auto | yes |
Small data in inodes | no | some | auto |
fsck speed |
slow | fast | fast |
fsck space |
? | 32B per inode |
2GiB
RAM per 1TiB + 200B per inode (half on 32b CPU) |
Redundant metadata | yes | yes | ? |
Bad block handling | yes | mkfs only |
no |
Tunable commit interval | yes | no | no |
Supports VFS lock | yes | yes | yes |
Has own lock/snapshot | no | no | yes |
Names | 8 bit | UTF-16 or 8 bit | 8 bit |
noatime |
yes | yes | yes |
O_DIRECT |
allegedly | allegedly | yes |
barrier |
yes | no | yes (checked) |
commit interval | yes | no | no |
EA/ACLs | both | both | both |
Quotas | both | patch | both |
DMAPI | no | patch | option |
Case insensitive | no | mkfs only |
no |
Supported by GRUB | yes | yes | mostly |
Can grow | online | online only | online only |
Can shrink | offline | no | no |
Journals data | option | no | no |
Journals what | blocks | operations | operations |
Journal disabling | yes | yes | no |
Journal size | fixed | fixed | grow/shrink |
Resize journal | offline | maybe | offline |
Journal on another partition | yes | yes | yes |
Special features or misfeatures | In place convert from ext2 . MS Windows drivers. |
Case insensitive option. Low CPU usage. DCE DFS compatible. OS2 compatible. |
Real time (streaming) section. IRIX compatible. Very large write behind. Superblock on block 0. |
Support for JFS has been added to the 2.4.20 and 2.5.6 kernels. JFS is no longer used by IBM. On AIX IBM uses JFS2.
The entire file system space is divided into logical blocks that contain file or directory data. For JFS, the logical blocks are always 4096 bytes (4K) in size, but can be optionally subdivided into smaller fragments (512, 1024 or 2048 bytes).
An i-node is a logical entity that contains information about a file or directory. There is a 1:1 relationship between i-nodes and files/directories. An i-node contains file type, access permissions, user/group ID (UID/GID - unused on OS/2), access times and points to actual logical blocks where file contents are stored. The maximum file size allowed in JFS is 2TB. It should be noted that the number of i-nodes is fixed. It is determined at file system creation time and depends on fragment size (which is user selectable). Users could run out of i-nodes, meaning that they would be unable to create more files even if there was enough free space. In practice this is extremely rare.
Fragments were already briefly mentioned in the discussion of logical blocks. The JFS logical block size is fixed at 4K. This is a reasonable default but it means that the file system cannot allocate less than 4K for file storage. If a file system stores large amounts of small files (< 2K), the disk space waste becomes significant. We've all got to know and hate this problem from FAT (cluster size of 32K leads to massive waste of space, in some cases over 50%). JFS attacks this by allowing fragmentation of logical blocks into smaller units, as small as 512 bytes (this is sector size on harddrives and it is not possible to read or write less than 512 bytes from/to disk). However users should be careful because fragmentation incurs additional overhead and hence slows down disk access. I would recommend using fragments smaller than 4K only when the users know for sure that they will store very large amounts of small files on the file system.
The entire JFS volume space is subdivided into allocation groups. Each allocation group contains i-nodes and data blocks. This enables the file system to store i-nodes and their associated data in physical proximity (HPFS uses a very similar technique). The allocation group size varies from 8MB to 64MB and depends on fragment size and number of fragments it contains.
JFS uses a special log device to implement circular journal. On AIX, several JFS volumes can share single log device. I'm not sure this is possible on OS/2, I believe each JFS volume (corresponding to a drive letter) has its own 'inline' log located inside the JFS volume - its size is selectable at FORMAT time.
It is important to note that JFS does not log (or journal) everything. It only logs all changes to file system meta-data. Simply speaking, the log contains a record of changes to everything in the file system except actual file data, ie. changes to the superblock, i-nodes, directories and allocation structures. It is clear that there must be some overhead here and indeed, performance may suffer when applications are doing lots of synchronous (uncached) I/O or creating and/or deleting many files in short amount of time. The performance loss is however not noticeable in most cases and is well worth the increased security.
The log (or journal) occupies a dedicated area on disk and is written to immediately when any meta-data change occurs. When the disk becomes idle, the actual file system structure is updated according to the log. After a crash, all it usually takes to restore the file system to full consistency is replaying the log, ie. performing the recorded transactions. Of course, if a process was in the middle of writing a file when the system crashed or power died, the file could be inconsistent (the app might not be able to read it again), but you will not lose this file nor other files, as is often the case with other file systems.
Characteristic | Journaled File System (JFS) | 386 High Performance File System (386HPFS) | High Performance File System (HPFS) | FAT File System |
Max volume size | 2TB (terabytes) | 64GB (gigabytes) | 64GB (gigabytes) | 2GB (gigabytes) |
Max file size | 2TB (terabytes) | 2GB (gigabytes) | 2GB (gigabytes) | 2GB (gigabytes) |
Allows spaces and periods in file names | Yes | Yes | Yes | No (8.3 format) |
Standard directory and file attributes | Within file system | Within file system | Within file system | Within file system |
Extended Attributes (64KB text or binary data with keywords) | Within file system | Within file system | Within file system | In separate file |
Max path length | 260 characters 1) | 260 characters | 260 characters | 64 characters |
Bootable | No 2) | Yes | Yes | Yes |
Allows dynamic volume expansion | Yes | No | No | No |
Scales with SMP | Yes | No | No | No |
Local security support | No | Yes | No | No |
Average wasted space per file | 256 to 2048 bytes | 256 bytes | 256 bytes | 1/2 cluster (1KB to 16KB) |
Allocation information for files | Near each file in its i-node | Near each file in its FNODE | Near each file in its FNODE | Centralized near volume beginning |
Directory structure | Sorted B+tree | Sorted B-tree | Sorted B-tree, must be searched exhaustively | Unsorted linear |
Directory location | Close to files it contains | Near seek center of volume | Near seek center of volume | Root directory at beginning of volume; others scattered |
Write-behind (lazy write) | Optional | Optional | Optional | Optional |
Maximum cache size | Physical memory available | Physical memory available | 2MB | 14MB |
Caching program | None (parameters set in CONFIG.SYS) | CACHE386.EXE | CACHE.EXE | None (parameters set in CONFIG.SYS) |
LAN Server access control lists | Within file system | Within file system | In separate file (NET.ACC) | In separate file |
1) JFS stores file and directory names in Unicode. This allows JFS to
always maintain proper sort order, regardless of active codepage.
2) This is not a permanent limitation. Only no one wrote a JFS micro- and
mini-IFS yet.
It might perhaps interest some users that JFS also seems to have built-in support for DASD limits. I have however never tried to use this feature. DASD limits, aka Directory Limits feature of LAN Server allows administrators to control how much space a directory can take, effectively enabling them to limit disk space usage of users. Previously this feature only worked on HPFS386 volumes. Obviously this is of no use to home users who have all their disk space for themselves but it can be very useful for system administrators.
June 2006 | by Shiv Dutta
Note: This article can also be found on the IBM Developerworks Web site (www-128.ibm.com/developerworks/eserver/library/es-aix5l-lvm.html).
When this article was first published in April 2005 under the title Logical Volume Manager in AIX 5L Version 5.3, it discussed a number of features that were introduced in AIX 5L* Version 5.3 to enhance the scope, functionality, and performance of the Logical Volume Manager (LVM). The next major enhancements to AIX 5L were introduced in the 5300-03 maintenance level, which was released in September 2005. This article is an updated and expanded version of the April 2005 publication. While the original content has been retained almost in its entirety, the article has been augmented by including a discussion of some of the LVM enhancements introduced in the 5300-03 maintenance level. Also, its scope has been broadened to cover a number of improvements, introduced both in the original release of the AIX 5L Version 5.3 and the 5300-03 maintenance level, to the Enhanced Journal File System (JFS2). In the following discussions, I use the expression (5300-03) to indicate that the referenced feature is available only for the 5300-03 maintenance level and beyond.
LVM command enhancements
In AIX 5L Version 5.3, changes have been made to the following LVM commands to enhance their performance, such as they require less execution time than their counterparts in prior releases of AIX*:
- extendvg
- importvg
- mkvg
- varyonvg
- chlvcopy
- mklvcopy
- lslv
- lspv
Concurrent mode (classical and enhanced)
The classical concurrent mode volume groups (VGs) only supported Serial DASD and SSA disks in conjunction with the 32-bit kernel. AIX 5L Version 5.1 overcame the restriction of supported disk types by introducing the so-called enhanced concurrent mode VG, which extended the concurrent mode support to all other disk types. While AIX 5L Version 5.2 did not allow creation of classical concurrent mode VGs, it did support them. The support for classical concurrent mode VGs has been completely removed from AIX 5L Version 5.3. When trying to import a classical concurrent mode VG in AIX 5L Version 5.3, an error message informs the user to convert the VG to enhanced concurrent mode.VGs (normal, big, and scalable)
The VG type, commonly known as standard or normal, allows a maximum of 32 physical volumes (PVs). A standard or normal VG is no more than 1016 physical partitions (PPs) per PV and has an upper limit of 256 logical volumes (LVs) per VG. Subsequently, a new VG type was introduced which was referred to as big VG. A big VG allows up to 128 PVs and a maximum of 512 LVs.AIX 5L Version 5.3 has introduced a new VG type called scalable volume group (scalable VG). A scalable VG allows a maximum of 1024 PVs and 4096 LVs. The maximum number of PPs applies to the entire VG and is no longer defined on a per disk basis. This opens up the prospect of configuring VGs with a relatively small number of disks and fine-grained storage allocation options through a large number of PPs, which are small in size. The scalable VG can hold up to 2,097,152 (2048 K) PPs. As with the older VG types, the size is specified in units of megabytes and the size variable must be equal to a power of 2. The range of PP sizes starts at 1 (1 MB) and goes up to 131,072 (128 GB). This is more than two orders of magnitude above the 1024 (1 GB), which is the maximum for both normal and big VG types in AIX 5L Version 5.2. The new maximum PP size provides an architectural support for 256 petabyte disks. Table 1 below shows the variation of configuration limits with different VG types. Note that the maximum number of user definable LVs is given by the maximum number of LVs per VG minus 1 because one LV is reserved for system use. Consequently, system administrators can configure 255 LVs in normal VGs, 511 in big VGs, and 4095 in scalable VGs.
The scalable VG implementation in AIX 5L Version 5.3 provides configuration flexibility with respect to the number of PVs and LVs that can be accommodated by a given instance of the new VG type. The configuration options allow any scalable VG to contain 32, 64, 128, 256, 512, 768, or 1024 disks and 256, 512, 1024, 2048, or 4096 LVs. You do not need to configure the maximum values of 1024 PVs and 4096 LVs at the time of VG creation to account for potential future growth. You can always increase the initial settings at a later date as required.
The System Management Interface Tool (SMIT) and the Web-based System Manager graphical user interface fully support the scalable VG. Existing SMIT panels, which are related to VG management tasks, have been changed and many new panels added to account for the scalable VG type. For example, you can use the new SMIT fast path _mksvg to directly access the Add a Scalable VG SMIT menu. The user commands mkvg, chvg, and lsvg have been enhanced in support of the scalable VG type.
Striped column support for LVs
AIX 5L Version 5.3 provides striped columns support for LVs. This new feature allows extension of a striped LV, even if one of the PVs in the disk array becomes full. In previous AIX releases, you could enlarge the size of a striped LV with the extendlv command, as long as enough PPs were available within the group of disks which defined the redundant array of independent disks (RAID) disk array. Rebuilding the entire LV was the only way to expand a striped LV beyond the hard limits imposed by the disk capacities. You needed to back up and delete the striped LV, and then recreate the LV with a larger stripe width followed by a restore operation of the LV data. To overcome the disadvantages of this time-consuming procedure, AIX 5L Version 5.3 has introduced the concept of striped columns for LVs.Prior to AIX 5L Version 5.3, the stripe width of a striped LV was determined at the time of LV creation by either of the following two methods:
Prior versions of AIX 5L do not allow you to configure a striped LV with an upper bound larger than the stripe width. In AIX 5L Version 5.3, the upper bound can be a multiple of the stripe width. One set of disks, as determined by the stripe width, is considered as one striped column. Note that the upper bound value is not related to the number of mirror copies in case you are using a RAID 10 configuration.
- Direct specification of all PV names
- Specification of the maximum number of PVs allocated to the striped LV
If you use the extendlv command to extend a striped LV beyond the physical limits of the first striped column, AIX uses an entire new set of disks to fulfill the allocation request for additional logical partitions. If you further expand the LV, more striped columns might get added as required, as long as you stay within the upper bound limit. The chlv -u command allows you to increase the upper bound limit to provide additional headroom for striped LV expansion. You can also use the -u flag of the enhanced extendlv command to raise the upper bound and extend the LV all in one operation.
The user commands mklv, chlv, extendlv, and mklvcopy have been enhanced to support the introduction of the striped column feature in AIX 5L Version 5.3.
Volume group pbuf pools
The LVM uses a construct named pbuf to control a pending disk I/O. A pbuf is a pinned memory buffer. The LVM always uses one pbuf for each individual I/O request, regardless of the amount of data that is transferred. AIX creates extra pbufs when adding a new PV to a VG. In previous AIX releases, the pbuf pool was a system-wide resource, but the LVM assigns and manages one pbuf pool per VG with AIX 5L Version 5.3. This enhancement supports advanced scalability and performance for systems with a large number of VGs and applies to all VG types. As a consequence of the new pbuf pool implementation, AIX displays and manages additional LVM statistics and tuning parameters.AIX 5L Version 5.3 now includes the lvmo command. It provides support for new pbuf pool-related administrative tasks. You can use the lvmo command to display pbuf and blocked I/O statistics and settings for pbuf tunables, regardless of whether the scope of the entity is system-wide or VG-specific. However, the lvmo command only allows the settings to change for the LVM pbuf tunables that are dedicated to specific VGs. The ioo command continues to manage the sole pbuf tunable with system-wide scope. Also, the vmstat -v command still displays the system-wide number of I/Os that were blocked due to lack of free pbufs like in prior releases of AIX.
Variable logical track group
When the LVM receives a request for an I/O, it breaks the I/O down into what is called logical track group (LTG) sizes before it passes the request down to the device driver of the underlying disks. The LTG is the maximum transfer size of an LV and is common to all the LVs in the VG. AIX 5L Version 5.2 accepted LTG values of 128 KB, 256 KB, 512 KB, and 1024 KB. However, many disks now support transfer sizes larger than 1 MB. To take advantage of these larger transfer sizes and get better disk I/O performance, AIX 5L Version 5.3 accepts values of 128 KB, 256 KB, 512 KB, 1 MB, 2 MB, 4 MB, 8 MB, and 16 MB for the LTG size.In contrast to previous releases, AIX 5L Version 5.3 also allows the stripe size of an LV to be larger than the LTG size in use and expands the range of valid stripe sizes significantly. Version 5.3 adds support for 2 MB, 4 MB, 8 MB, 16 MB, 32 MB, 64 MB, and 128 MB stripe sizes to complement the 4 KB, 8 KB, 16 KB, 32 KB, 64 KB, 128 KB, 256 KB, 512 KB, and 1 MB stripe size options available in prior releases of AIX. In AIX 5L Version 5.2, the LTG size was set by the -L flag on the chvg or mkvg command. In AIX 5L Version 5.3, it is set by the varyonvg command, using the flag -M. The LTG size thus created is called the variable LTG size.
The following command sets the LTG size of the tmpvg VG at 512 KB:
# varyonvg -M512K tmpvg
The LTG size is specified either in K or M units, implying KB or MB respectively. When the LTG size is set using the -M flag, the varyonvg and extendvg commands might fail if an underlying disk has a maximum transfer size that is smaller than the LTG size. To find out the maximum supported LTG size of your hard disk, you can use the lquerypv command with the -M flag. The output gives the LTG size in KB, as shown in example below.
# /usr/sbin/lquerypv -M hdisk0 256
The lspv command displays the same value as MAX REQUEST, as shown in Listing 1.
You can list the value of the LTG in use with the lsvg command, as shown in Listing 2.
Note that the LTG size for a VG created in AIX 5L Version 5.3 will be displayed as dynamic in the lsvg command output, as shown in Listing 2. By default, AIX 5L Version 5.3 creates VGs with a variable LTG size. If you want to import it to a previous release of AIX, you first need to disable the variable LTG by using the -I option for mkvg or chvg and then do a varyoffvg followed by exportvg, otherwise the importvg command on the previous release fails.
Geographic Logical Volume Manager (GLVM) (5300-03)
It extends the LVM mirroring function and supports copy of a logical volume on a remote AIX system connected using TCP/IP network. A complete copy of application data can be quickly and easily brought back online on a remote system.The mirscan command (5300-03)
This command searches for and corrects physical partitions that are stale or unable to perform I/O operations. This is useful for the following type of situations:
- A physical partition on the underlying storage is incapable of performing I/O operations but, for a long time, no I/O operations have been attempted for that physical partition. The customer needs a way to detect and correct this condition.
- A disk is about to be replaced. The customer needs to make sure they are not about to remove the last good copy of their data from the system.
Multiple instances of AIX on a single root volume group (multibos) (5300-03)
This feature allows the user to create a new instance of the AIX Base Operating System (BOS) within the running rootvg. This new instance, based on the running rootvg, contains private and shared data. A similar offering already available is Alternate Disk Installation. While somewhat similar, multibos varies in a few very important aspects:
- The new instance is built from the running root volume group (similar to the alt_disk_install clone operation).
- The new instance is housed within the current root volume group (for example, the same disks).
- Certain data within the rootvg might be shared between the instances.
Rollback function (available for JFS2 file system only) (5300-03)
Restores an entire file system to a valid point-in-time snapshot (target snapshot). Rollback attempts to restore the snapshots present at the time of the target snapshot. Snapshots taken after the target snapshot are lost.Disk quotas support for JFS2
AIX 5L Version 5.3 extends the JFS2 functionality by implementing disk usage quotas to control usage of persistent storage.Disk quotas might be set for individual users or groups on a per file system basis. Version 5.3 also introduces the concept of Limit Classes. It allows the configuration of per file system limits, provides a method to remove old or stale quota records, and offers comprehensive support through dedicated SMIT panels. It also provides a method to define a set of hard and soft disk block and file allocation limits and the grace periods before the soft limit becomes enforced as the hard limit.
The quota support for JFS2 and JFS can be used on the same system.
Shrink a file system
AIX 5L Version 5.3 supports shrinking a JFS2 file system dynamically. When the size of the file system is decreased, the LV on which the file system resides is also decreased.
JFS2 logredo scalability
AIX 5L Version 5.3 provides the following enhancements in the area of logredo to
improve performance and to support large numbers of file systems:
JFS2 file system check scalability
AIX 5L Version 5.3 enhanced the implementation of the helper, which specifically
performs the file system check for JFS2 file systems. The new code makes a better
use of system resources and includes algorithms that improve the scalability and
performance.
JFS2 ACL support for NFS V4
Starting with AIX 5L Version 5.3, the Enhanced Journaled File System now supports
ACLs for NFS version 4. This allows you to establish fine-grained access control
for file system objects and support inheritance features.
Conclusion
AIX 5L Version 5.3 has many more features than have been discussed here. I hope
this article has given you a flavor of the type of enhancements you can expect in
the latest release of AIX.
IBM Systems Magazine is a trademark of International Business Machines Corporation. The editorial content of IBM Systems Magazine is placed on this website by MSP TechMedia under license from International Business Machines Corporation.
©2009 MSP Communications, Inc. All rights reserved.
... JFS is a fully 64-bit filesystem. With a default block size of 4KB, it supports a maximum filesystem size of 4 petabytes (less if you use smaller block sizes). The minimum filesystem size supported is 16MB. The JFS transaction log has a default size of 0.4% of the aggregate size, rounded up to a megabyte boundary. The maximum size of the log is 32MB. One interesting aspect of the layout on disk is the fsck working space, a small area allocated within the filesystem for keeping track of block allocation if there is not enough RAM to track a large filesystem at boot time.JFS dynamically allocates space for disk inodes, freeing the space when it is no longer required. This eliminates the possibility of running out of inodes due to a large number of small files. As far as I can tell, JFS is the only filesystem in the kernel with this feature. For performance and efficiency, the contents of small directories are stored within the directory's inode. Up to eight entries are stored in-line within the inode, excluding the self (.) and parent (..) entries. Larger directories use a B+ tree keyed on name for faster retrieval. Internally, JFS uses extents to allocate blocks to files, leading to efficient use of space even as files grow in size. This is also available in XFS, and is a major new feature in ext4.
JFS supports both sparse and dense files. Sparse files allow data to be written to random locations within a file without writing intervening file blocks. JFS reports the file size as the largest used block, while only allocating actually used blocks. Sparse files are useful for applications that require a large logical space but use only a portion of the space. With dense files, blocks are allocated to fill the entire file size, whether data is written to them or not.
In addition to the standard permissions, JFS supports basic extended attributes, such as the immutable (i) and append-only (a) attributes. I was able to successfully set and test them with the lsattr and chattr programs. I could not find definitive information on JFS access control list support under Linux.
Logging
The main design goal of JFS was to provide fast crash recovery for large filesystems, avoiding the long filesystem check (fsck) times of older Unix filesystems. That was also the primary goal of filesystems like ext3 and ReiserFS. Unlike ext3, journaling was not an add-on to JFS, but baked into the design from the start. For high-performance applications, the JFS transaction log file can be created on an external volume if one is specified when the filesystem is first created.
JFS only logs operations on meta-data, maintaining the consistency of the filesystem structure, but not necessarily the data. A crash might result in stale data, but the files should remain consistent and usable.
Here is a list of the filesystem operations logged by JFS:
- File creation (create)
- Linking (link)
- Making directory (mkdir)
- Making node (mknod)
- Removing file (unlink)
- Rename (rename)
- Removing directory (rmdir)
- Symbolic link (symlink)
- Truncating regular file
Utilities
JFS provides a suite of utilities to manage its filesystems. You must be the root user to use them.
Utility Description jfs_debugfs Shell-based JFS filesystem editor. Allows changes to the ACL, uid/gid, mode, time, etc. You can also alter data on disk, but only by entering hex strings -- not the most efficient way to edit a file. jfs_fsck Replay the JFS transaction log, check and repair a JFS device. Should be run only on an unmounted or read-only filesystem. Run automatically at boot. jfs_fscklog Extract a JFS fsck service log into a file. jfs_fscklog -e /dev/hda6
extracts the binary log to file fscklog.new. To view, usejfs_fscklog -d fscklog.new
.jfs_logdump Dump the journal log to a plain text file that shows data on each transaction in the log file. jfs_mkfs Create a JFS formatted partition. Use the -j journal_device
option to create an external journal (1.0.18 or later).jfs_tune Adjust tunable filesystem parameters on JFS. I didn't find options that looked like they might improve performance. The -l
option lists the superblock info.Here is what a dump of the superblock information looks like:
root@slackt41:~# jfs_tune -l /dev/hda6 jfs_tune version 1.1.11, 05-Jun-2006 JFS filesystem superblock: JFS magic number: 'JFS1' JFS version: 1 JFS state: mounted JFS flags: JFS_LINUX JFS_COMMIT JFS_GROUPCOMMIT JFS_INLINELOG Aggregate block size: 4096 bytes Aggregate size: 12239720 blocks Physical block size: 512 bytes Allocation group size: 16384 aggregate blocks Log device number: 0x306 Filesystem creation: Wed Jul 11 01:52:42 2007 Volume label: ''
Crash testing
White papers and man pages are no substitute for the harsh reality of a server room. To test the recovery capabilities of JFS, I started crashing my system (forced power off) with increasing workloads. I repeated each crash twice to see if my results were consistent.
Crash workload Recovery Console (no X) running text editor with one open file About 2 seconds to replay the journal log. Changes I had not saved in the editor were missing but the file was intact. X window system with KDE, GIMP, Nvu, and text editor in xterm all with open files About 2 seconds to replay the journal log. All open files were intact, unsaved changes were missing. X window system with KDE, GIMP, Nvu, and text editor all with open files, plus a shell script that inserted records into a MySQL (ISAM) table. The script I wrote was an infinite loop, and I let it run for a couple of minutes to make sure some records were flushed to disk. About 3 seconds to replay the journal log. All open files intact, database intact with a few thousand records inserted, but the timestamp on the table file had been rolled back one minute. In all cases, these boot messages appeared:
**Phase 0 - Replay Journal Log -|----- (spinner appeared for a couple of seconds, then went away) Filesystem is cleanThroughout the crash testing, I saw no filesystem corruption, and the longest log replay time I experienced was about 3 seconds.
Conclusion
While my improvised crash tests were not a good simulation a busy server, JFS did hold up well, and recovery time was fast. All file-level applications I tested, such as tar and rsync, worked flawlessly, and lower-level programs like Truecrypt also worked as expected.
After 30 days of kicking and prodding, I have a high level of confidence in JFS, and I am content trusting my data to it. JFS may not have been marketed as effectively as other alternatives, but is a solid choice in the long list of quality Linux filesystems.
This is a summary in my own words of this more detailed description of JFS data structures. But there is a much better PDF version of the same document, with inline illustrations, also available inside this RPM from SUSE.
- Basic entities
- Partition
- A partition is a container, and has merely a size and a sector size, also called a partition block size, which defines IO granularity (and is usually the same for all partitions on the physical medium); a partition only contains an aggregate.
- Extent
- A contiguous sequence of blocks, wholly contained in one allocation group. The maximum size of an extent is 224-1 blocks, or almost 64GiB. There are a few types of extents, one of them is
ABNR
which describes an extent contaning zero bytes only.- Map
- A map is a collection of extents that contains a B+-tree index rooted in the first extent of the collection; for example it can be an index of extents for a file body, in which case it is an allocation map, or an index of inode names for a directory, in which case it is called a directory map; the extents in a map are described in the map itself. The root extent of the map is called
btree
and the leaf extents are calledxtree
s (and contain an array of entries called xads) if they are for an allocation map, anddtree
s if they are for a directory map.- File body
- A file body is a sequence of one or more extents, the extents being listed in an allocation map. The extents may be from different allocation groups.
- Inode
- An inode is a 512 byte descriptor for the attributes of a file or directory, and contains also the root of a file body's allocation map, or of a directory map.
- Aggregates
- Aggregate
- An aggregate is about allocating space, and has a size and an aggregate block size, which defines the granularity of allocation of space to files, and currently must be 4096.
- Aggregates have a primary and a backup superblock.
- Aggregates contain one or more allocation groups.
- Aggregates have a primary and backup aggregate inode tables, which must be exactly one 32 inodes long.
- Aggregates may contain one or more filesets, but currently only one is allowed.
- Aggregates also have some space reserved for use by
jfs_fsck
.- Allocation group
- An allocation group, also known as an AG, is merely a section of an aggregate. There is no data structure associated with an allocation group, all belong either to the aggregate or to a fileset.
- There can be up to 128 AGs in an aggregate, and each must be at least 8192 blocks or 32MiB.
- Each allocation group must contain a number of blocks that is a power of 2 of the number of block descriptors in a dmap page.
- If multiple files are growing, each allocates extents from a different allocation group if possible.
- Aggregate inode table
- The aggregate inode table is an inode allocation map for the inodes that are used internally by the aggregate, and are not user visible (that is, are not part of any fileset). The inodes defined in the table are:
Since the aggregate inode table file refers to itself, the first extent of its inode allocation map has a well known constant address (just after the superblock).
- Number 0 is reserved.
- Number 1 is the aggregate inode table itself.
- Number 2 is the block allocation map file.
- Number 3 is the inline log file.
- Number 4 is the bad blocks file.
- Number 16 is the fileset root file.
- Block allocation map
- The block allocation map, also called
bmap
, is a file (not a B+-tree, despite being calledmap) divided into 4KiB pages. The first block is the bmap control page, and then there are up to three levels of dmap control pages that point to many dmap pages. Each dmap page contains:The block allocation map contains information that is redundant with that of inode allocation maps, so it can be fully reconstructed, but only with a a full scan of the aggregate and fileset inode tables.
- Two arrays of 213 bits where each bit corresponds to a block of the aggregate, and the bit is 1 if the block is in use. Because of the limit of three levels of dmap control pages, there can be at most 230 dmap pages, and thus at most 243 blocks in an aggregate.
- Some metadata, includings a buddy tree that defines a buddy system of the free and allocated blocks. The buddy tree also extends upwards in the dmap control pages.
- Inline log
- A sequence of blocks towards the end of an aggregate that is used to record intended modifications to aggregate or fileset metadata.
- Bad blocks
- This is a file whose extents cover all the bad blocks discovered by
jfs_fsck
if any.- Inode allocation maps
- Inode allocation map
- An inode allocation map is the file body of an inode table file, not a map. This file body contains as the first 4KiB block a control page called
dinomap
, and after that a number of extents called inode allocation groups.
Thedinomap
contains:which segment the information held in the inode allocation map by allocation group.
- The AG free inode lists array.
- The AG free inode extents lists array.
- The IAG free list.
- The IAG free next.
- AG free inode lists array
- The AG free inode lists array contains a list headers for each AG. Each lists threads together all the IAGs in that AG that have some free inode entries.
- AG free inode extents lists array
- The AG free inode extents lists array contains a list header for each AG, and each list threads together all the IAGs in an AG that have some free inode extents.
- IAG free list
- The IAG free list array contains a list header for each AG, and each list contains the number of those IAGs in the AG whose inodes are all free.
- IAG free next
- The IAG free next is the number of the next IAG to append (if required) to an inode allocation map, or equivalently the number of IAGs in an inode allocation map plus 1.
- Inode allocation group
- An inode allocation group, also called IAG, is a 4KiB block that describes up to 128 inode table extents, for a total of up to 4096 inode table entries.
An inode allocation group can be in any allocation group, but all the inode table extents it describes must be in the same allocation group as the first one, unlike the extents of a general purpose file body, which can be in any allocation group; as soon as its first inode table extent is allocated in a allocation group, the inode allocation group istiedto it, until all such extents are freed.
Once allocated, inode allocation groups are never freed, but their inode table extents may be freed though.- Inode table extent
- Inode table extents are pointed to by inode allocation groups, and each must be 16KiB in length, and contains 32 inode table entries.
- Filesets
- Fileset
- A fileset is a collections of named inodes. Filesets are defined as and by a fileset inode table, which is an inode allocation map file. It contains these inodes:
- Number 0 is reserved.
- Number 1 is a file containing extended fileset information.
- Number 2 is a directory which is the root of the fileset naming tree.
- Number 3 is a file containing the ACL for the fileset.
- Number 4 and following are used for the other files or directories in the fileset, all must be reachable from the directory at number 2.
- File
- A file is an inode with an attached (optional) allocation map describing a file body that contains data; a particular case of a file is a symbolic link, where the data in the file is a path name.
- Directory
- A directory is a an inode with a list of name and corresponding inode numbers; the list is either contained entirely within the inode if it is small, or is an attached directory map, containing
dtree
entries.
Google matched content |
[PDF] Using JFS ACLs
This white paper gives an overview of the changes to meta-data structures that
JFS logs. Introduction. The Journaled File System (JFS) provides
a log-based, ...
[PDF] Using JFS ACLFile System Security -- HP-UX System Administrator's Guide: Security Management, HP Part Number '5991-6482', Publication Date 'E0207's
[PDF] JFS Tuning and Performance
Section 2. Introduction. What is JFS? JFS is an advanced journalling filesystem. It has been designed to provide excellent ...
ploug.eu.org/doc/jfs-a4.pdf -
Updated: 2006-10-22
Created: 2005-10-31
Older references are not quite accurate, because things in kernel 2.6 are quite better than in kernel 2.4 and filesystem maintainers have reacted to older unfavourable benchmarks by tuning their designs. So the references below are ordered by most recent first.
ext3
FAQ 2004-10-14. Feature | ext3 |
JFS | XFS |
---|---|---|---|
Block sizes | 1024-4096 | 4096 | 512-4096 |
Max fs size | 8TiB (243B) | 32PiB (255B) | 8EiB (263B)
16TiB (244B) on 32b system |
Max file size | 1TiB (240B) | 4PiB (252B) | 8EiB (263B)
16TiB (244B) on 32b system |
Max files/fs | 232 | 232 | 232 |
Max files/dir | 232 | 231 | 232 |
Max subdirs/dir | 215 | 216 | 232 |
Number of inodes | fixed | dynamic | dynamic |
Indexed dirs | option | auto | yes |
Small data in inodes | no | some | auto |
fsck speed |
slow | fast | fast |
fsck space |
? | 32B per inode |
2GiB
RAM per 1TiB + 200B per inode (half on 32b CPU) |
Redundant metadata | yes | yes | ? |
Bad block handling | yes | mkfs only |
no |
Tunable commit interval | yes | no | no |
Supports VFS lock | yes | yes | yes |
Has own lock/snapshot | no | no | yes |
Names | 8 bit | UTF-16 or 8 bit | 8 bit |
noatime |
yes | yes | yes |
O_DIRECT |
allegedly | allegedly | yes |
barrier |
yes | no | yes (checked) |
commit interval | yes | no | no |
EA/ACLs | both | both | both |
Quotas | both | patch | both |
DMAPI | no | patch | option |
Case insensitive | no | mkfs only |
no |
Supported by GRUB | yes | yes | mostly |
Can grow | online | online only | online only |
Can shrink | offline | no | no |
Journals data | option | no | no |
Journals what | blocks | operations | operations |
Journal disabling | yes | yes | no |
Journal size | fixed | fixed | grow/shrink |
Resize journal | offline | maybe | offline |
Journal on another partition | yes | yes | yes |
Special features or misfeatures | In place convert from ext2 . MS Windows drivers. |
Case insensitive option. Low CPU usage. DCE DFS compatible. OS2 compatible. |
Real time (streaming) section. IRIX compatible. Very large write behind. Superblock on block 0. |
These are pointers to some of the entries in my technical blog where file systems are discussed:
fsck
times ext2
for all my MS Windows filesystems except the boot one.
ext3
with and without extended attributes
and ext3
's new hash directory indices. fsck
. davtools
package to visualize
ext3
fragmentation. fsck
takes more than one month, and some filesystems being VLDBs.
ext2
for MS Windows.
noatime
. ext3
into something else.
worksmeans for file systems.
worksfor file systems.
rootfilesystem.
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: March 12, 2019