|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
|
Designed for educational purposes, the original Linux file system was limited to 64 MB in size and supported file names up to 14 characters.
In 1992, the ext file system was created, and increased the file system size to 2 GB and file name length to 255 characters. However, file access, modification, and creation times were missing from file system data structures and performance tended to be low.
Modeled after the Berkeley Fast File System, the ext2 file system used a better on disk layout, extended the file system size limit to 4 TB and file name sizes to 255 bytes, delivered improved performance, and emerged as the de facto standard file system for Linux environments. More information on the logging capabilities of the ext3 file system can be found in EXT3, Journaling File System by Dr. Stephen Tweedie located at http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html
|
Ext2 filesystem is supported on Windows using special drive Ext2 IFS For Windows. That can help to avoid 4G limitation of FAT32 if you need to exchange files larger then 4G, for example DVD ISO images. Features that are supported supported?
An evolution of the ext2 file system, the ext3 file system added logging capabilities to facilitate fast reboots following system crashes. Key features of the ext3 file system include:
Logging in the ext3 File System The ext3 file system supports different levels of journaling which can be specified as mount options. These options can impact data integrity and performance.
Here is ext3.txt document that comes with kernel:
Ext3 Filesystem
Ext3 was originally released in September 1999. Written by Stephen Tweedie for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie.
Ext3 is the ext2 filesystem enhanced with journalling capabilities.
Options
When mounting an ext3 filesystem, the following option are accepted: (*) == default
journal=update Update the ext3 file system's journal to the current format.
journal=inum When a journal already exists, this option is ignored. Otherwise, it specifies the number of the inode which will represent the ext3 file system's journal file.
journal_dev=devnum When the external journal device's major/minor numbers have changed, this option allows the user to specify the new journal location. The journal device is identified through its new major/minor numbers encoded in devnum.
noload Don't load the journal on mounting.
data=journal All data are committed into the journal prior to being written into the main file system.
data=ordered (*) All data are forced directly out to the main file system prior to its metadata being committed to the journal.
data=writeback Data ordering is not preserved, data may be written into the main file system after its metadata has been committed to the journal.
commit=nrsec (*) Ext3 can be told to sync all its data and metadata every 'nrsec' seconds. The default value is 5 seconds. This means that if you lose your power, you will lose as much as the latest 5 seconds of work (your filesystem will not be damaged though, thanks to the journaling). This default value (or any low value) will hurt performance, but it's good for data-safety. Setting it to 0 will have the same effect as leaving it at the default (5 seconds). Setting it to very large values will improve performance.
barrier=1 This enables/disables barriers. barrier=0 disables it, barrier=1 enables it.
orlov (*) This enables the new Orlov block allocator. It is enabled by default.
oldalloc This disables the Orlov block allocator and enables the old block allocator. Orlov should have better performance - we'd like to get some feedback if it's the contrary for you.
user_xattr Enables Extended User Attributes. Additionally, you need to have extended attribute support enabled in the kernel configuration (CONFIG_EXT3_FS_XATTR). See the attr(5) manual page and http://acl.bestbits.at/ to learn more about extended attributes.
nouser_xattr Disables Extended User Attributes.
acl Enables POSIX Access Control Lists support. Additionally, you need to have ACL support enabled in the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL). See the acl(5) manual page and http://acl.bestbits.at/ for more information.
noacl This option disables POSIX Access Control List support.
reservation
noreservation
bsddf (*) Make 'df' act like BSD. minixdf Make 'df' act like Minix.
check=none Don't do extra checking of bitmaps on mount. nocheck
debug Extra debugging information is sent to syslog.
errors=remount-ro(*) Remount the filesystem read-only on an error. errors=continue Keep going on a filesystem error. errors=panic Panic and halt the machine if an error occurs.
grpid Give objects the same group ID as their creator. bsdgroups
nogrpid (*) New objects have the group ID of their creator. sysvgroups
resgid=n The group ID which may use the reserved blocks.
resuid=n The user ID which may use the reserved blocks.
sb=n Use alternate superblock at this location.
quota noquota grpquota usrquota
bh (*) ext3 associates buffer heads to data pages to nobh (a) cache disk block mapping information (b) link pages into transaction to provide ordering guarantees. "bh" option forces use of buffer heads. "nobh" option tries to avoid associating buffer heads (supported only for "writeback" mode).
Specification
Ext3 shares all disk implementation with the ext2 filesystem, and adds transactions capabilities to ext2. Journaling is done by the Journaling Block Device layer.
Journaling Block Device layer
The Journaling Block Device layer (JBD) isn't ext3 specific. It was designed to add journaling capabilities to a block device. The ext3 filesystem code will inform the JBD of modifications it is performing (called a transaction). The journal supports the transactions start and stop, and in case of a crash, the journal can replay the transactions to quickly put the partition back into a consistent state.
Handles represent a single atomic update to a filesystem. JBD can handle an external journal on a block device.
Data Mode
There are 3 different data modes:
* writeback mode In data=writeback mode, ext3 does not journal data at all. This mode provides a similar level of journaling as that of XFS, JFS, and ReiserFS in its default mode - metadata journaling. A crash+recovery can cause incorrect data to appear in files which were written shortly before the crash. This mode will typically provide the best ext3 performance.
* ordered mode In data=ordered mode, ext3 only officially journals metadata, but it logically groups metadata and data blocks into a single unit called a transaction. When it's time to write the new metadata out to disk, the associated data blocks are written first. In general, this mode performs slightly slower than writeback but significantly faster than journal mode.
* journal mode data=journal mode provides full data and metadata journaling. All new data is written to the journal first, and then to its final location. In the event of a crash, the journal can be replayed, bringing both data and metadata into a consistent state. This mode is the slowest except when data needs to be read from and written to disk at the same time where it outperforms all other modes.
Compatibility
Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`. Ext3 is fully compatible with Ext2. Ext3 partitions can easily be mounted as Ext2.
External Tools
See manual pages to learn more.
tune2fs: create a ext3 journal on a ext2 partition with the -j flag. mke2fs: create a ext3 partition with the -j flag. debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer
References
kernel source: <file:fs/ext3/> <file:fs/jbd/>
programs: http://e2fsprogs.sourceforge.net/ http://ext2resize.sourceforge.net
useful links: http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html http://www-106.ibm.com/developerworks/linux/library/l-fs7/ http://www-106.ibm.com/developerworks/linux/library/l-fs8/
Main Features
Ext4 uses extents (as opposed to the traditional block mapping scheme used by ext2 and ext3), which improves performance when using large files and reduces metadata overhead for large files.
In addition, ext4 also labels unallocated block groups and inode table sections accordingly, which allows them to be skipped during a file system check. This makes for quicker file system checks, which becomes more beneficial as the file system grows in size.
The ext4 file system features the following allocation schemes:
Because of delayed allocation and other performance optimizations, ext4's behavior of writing files to disk is different from ext3.
In ext4, when a program writes to the file system, it is not guaranteed to be on-disk unless the program issues an fsync() call afterwards.
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
Jan 19, 2019 | www.tecmint.com
by Aaron Kili | Published: January 11, 2019 | January 11, 2019
Download Your Free eBooks NOW - 10 Free Linux eBooks for Administrators | 4 Free Shell Scripting eBooksext3grep is a simple program for recovering files on an EXT3 filesystem. It is an investigation and recovery tool that is useful in forensics investigations. It helps to show information about files that existed on a partition and also recover accidentally deleted files. In this article, we will demonstrate a useful trick, that will help you to recover accidentally deleted files on ext3 filesystems using ext3grep in Debian and Ubuntu.
Testing ScenarioHow to Recover Deleted Files Using ext3grep Tool
- Device name: /dev/sdb1
- Mount point: /mnt/TEST_DRIVE
- Filesystem type: EXT3
To recover deleted files , first you need to install ext3grep program on your Ubuntu or Debian system using APT package manager as shown.
$ sudo apt install ext3grepOnce installed, now we will demonstrate how to recover deleted files on a ext3 filesystem.
First, we will create some files for testing purpose in the mount point
/mnt/TEST_DRIVE
of the ext3 partition/device i.e./dev/sdb1
in this case.$ cd /mnt/TEST_DRIVE $ sudo touch files[1-5] $ ls -lCreate Files in Mount Point
Now we will remove one file called
file5
from the mount point/mnt/TEST_DRIVE
of the ext3 partition.$ sudo rm file5Remove a File in Linux
Now we will see how to recover deleted file using ext3grep program on the targeted partition. First, we need to unmount it from the mount point above (note that you have to use cd command to switch to another directory for the unmount operation to work, otherwise the umount command will show the error " that target is busy ").
$ cd $sudo umount /mnt/TEST_DRIVENow that we have deleted one of the files (which we'll assume was done accidentally), to view all the files that existed in the device, run the
--dump-name
option (replace/dev/sdb1
with the actual device name).$ ext3grep --dump-name /dev/sdb1View Files on Partition
To recover the above deleted file i.e.
file5
, we use the--restore-all
option as shown.$ ext3grep --restore-all /dev/sdb1Once the recovery process is complete, all recovered files will be written to the directory RESTORED_FILES , you can check if the deleted file is recovered or not.
$ cd RESTORED_FILES $ lsRecover a Deleted File
We may specify a particular file to recover, for example the file called
file5
(or specify the full path of the file within the ext3 device).$ ext3grep --restore-file file5 /dev/sdb1 OR $ ext3grep --restore-file /path/to/some/file /dev/sdb1In addition, we can also restore files within a given period of time. For example, simply specify the correct date and time frame as shown.
$ ext3grep --restore-all --after `date -d 'Jan 1 2019 9:00am' '+%s'` --before `date -d 'Jan 5 2019 00:00am' '+%s'` /dev/sdb1For more information, see the ext3grep man page.
$ man ext3grepThat's it! ext3grep is a simple and useful tool to investigate and recover deleted files on an ext3 filesystem. It is one of the the best programs to recover files on Linux. If you have any questions or any thoughts to share, reach us via the feedback form below.
Ext4 Filesystem
From Thomas-Krenn-Wiki
UncheckedJump to: navigation, search
Main Page > Server Software > Linux > Linux BasicsThis article will provide a brief introduction to Linux File System Ext4, the successor to Linux File System Ext3. Several tips regarding usage as well as additional links will be provided.
Contents
[hide]
- 1 Evolution
- 2 Advantages
- 3 Compatibility with Ext3
- 4 SSD Optimizations
- 5 Lazy Initialization
- 6 Potential Problems
- 7 References
- 8 Additional Information
Evolution
The ext3 source was forked and developed independently, in order to correct several existing limitations of the previous Ext2 and Ext3 file systems in the future. Ext4 has been accepted into the Linux kernel as of Kernel version 2.6.19 and finally declared stable as of Kernel version 2.6.28.
Current distribution files, like RedHat Enterprise Linux 6 (RHEL), Debian 6.0 (Squeeze) or Ubuntu 10.10 (Maverick Meerkat) provide stable Ext4 support and use it as the default file system, in some cases.
Advantages
- Improved performance through:
- Multi-block allocation
- Extent-based block mapping
- Delayed allocation
- Stripe-aware allocation
- Improved file security through:
- Write barriers
- Time stamps use the nanosecond range instead of the second range
- Increased e2fsck speed
- Unlimited number of sub-directories (32,000 sub-directory under a directory under Ext3)
- Journalized quota data, whereby a quota check will not be performed after a system crash
You can find additional details regarding these advantages in a white paper from Red Hat [1].
Compatibility with Ext3
Because there have been many changes in comparison with Ext3, migration to Ext4 is not as easy as from Ext2 to Ext3.
To take full advantage of the Ext4 file system, Red Hat recommends backing all of the data up for RHEL 6, re-creating the Ext4 file system and copying the data into the new Ext4 file system (see also [1]).
The Ext4 driver does support mounting an Ext3 file system, however only with limited functionality. On the other hand, as soon as one uses extent-based mapping, mounting an Ext4 file system as an Ext3 file system becomes impossible.
SSD Optimizations
ATA Trim
Ext4 supports ATA Trim for solid state drives (SSDs):
- Online discard from Kernel 2.6.33
- The
-o discard
mount option (for examplemount -o discard /dev/sdb1 /mnt/
. For permanent activation, the option must be entered in/etc/fstab
, because the discard capability is deactivated by default)[2]- Batched discard from Kernel 2.6.37
- Accelerated batched discard from Kernel 3.1
- Pre-discard during formatting from mke2fs 1.41.10[3][4][5]
- Extract from the man page for mke2fs: -E discard: Attempt to discard blocks at mkfs time (discarding blocks initially is useful on solid state devices and sparse / thin-provisioned storage). When the device advertises that discard also zeroes data (any subsequent read after the discard and before write returns zero), then mark all not-yet-zeroed inode tables as zeroed. This significantly speeds up file system initialization. This is set as default.
- Caution when using this feature with the device mapper with mixed physical volumes. discard_zeros_data will first be properly returned as of Kernel 3.0 - see patch block: Fix discard topology stacking and reporting)
- In our test, we observed significantly different time requirements for the discard operation during formatting. However, significant effects were not caused by discard also zeroes data (regarding this, see ATA Trim Performance).
- With discard also zeroes data,
hdparm -I
displayed Deterministic read ZEROs after TRIM (or deterministic read data after TRIM). The following example shows an OCZ Vertex 3 SSD and an Intel 320 Series SSD:[root@fedora15 ~]# hdparm -V hdparm v9.36 [root@fedora15 ~]# hdparm -I /dev/sda | grep 'Model\|TRIM' Model Number: OCZ-VERTEX3 * Data Set Management TRIM supported (limit 1 block) * Deterministic read data after TRIM [root@fedora15 ~]# hdparm -I /dev/sdb | grep 'Model\|TRIM' Model Number: INTEL SSDSA2CW160G3 * Data Set Management TRIM supported (limit 8 blocks) * Deterministic read ZEROs after TRIM [root@fedora15 ~]#File System Journal: Yes or No?
Not using the Ext4 file system journal[6][7] can increase file system performance, is however connected with disadvantages when the shutdown procedure is not completely clean (such as during a power failure). Theodore Tso, an Ext4 file system developer, determined in tests that the performance disadvantages caused by journaling were between four and twelve percent. Therefore, the journal should be used. Not using the atime journaling would be a more recommendable way of increasing performance.[8][9]
noatime
Using the
noatime
mount option improves performance when reading.[8][10][11]Should stride and stripe-width Parameters be used?
There are a number of different recommendations for the stride and stripe-width parameters[12][13] for using SSDs under Ext4.[14][15]
If certain values really do provide a benefit cannot be determined with certainty at this time. Individualized tests with the respective SSD would be required for making a determination.[16][17]
Lazy Initialization
When creating an Ext4 file system, the existing regions of the inode tables must be cleaned (overwritten with nulls, or "zeroed"). The "lazyinit" feature should significantly accelerate the creation of a file system, because it does not immediately initialize all inode tables, initializing them gradually instead during the initial mounting process in background (from Kernel version 2.6.37).[18][19] Regarding this see the extracts from the mkfs.ext4 man pages:[20]
- If enabled and the uninit_bg feature is enabled, the inode table will not be fully initialized by mke2fs. This speeds up file system initialization noticeably, but it requires the kernel to finish initializing the file system in the background when the file system is first mounted. If the option value is omitted, it defaults to 1 to enable lazy inode table zeroing.
One should be careful when testing the performance of a freshly created file system. The "lazy initialization" feature may write a lot of information to the hard disk after the initial mounting and thereby invalidate the test results. At first, the "ext4lazyinit" kernel process writes at up to 16,000kB/s to the device and thereby uses a great deal of the hard disk's bandwidth (see also I/O Statistics by Process). In order to prevent lazy initialization, advanced options are offered by the mkfs.ext4 command:[20]
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/mapper/fc-rootBy specifying these options, the inodes and the journal will be initialized immediately during creation.
Potential Problems
Poor Performance due to Write Barriers
The integrity of the file system can be guaranteed by write barriers, if the hard disk uses a volatile cache and the contents of the cache are lost due to a power failure. Thereby, data security is improved, however at the price of a small performance disadvantage. By default, write barriers are activated, however they can be deactivated with the nobarrier file system option. However, this is only recommended when the write cache for your RAID array has been secured using a battery backup unit, zero maintenance cache or similar measure.
Data Loss with Applications that do not use fsync() correctly
When Ext4 was first introduced with Linux distributions, notifications about sometimes massive data losses began to appear.[21]
The reason for this is delayed allocation, which first allocated the necessary storage space up to 60 seconds later. Thereby, file renaming entered the scenario, for example, so that the metadata properly represented the rename process, but the actual data had not yet been written. The file name pointed to a 0 byte file thereby. However, this only occurs when an application has not properly used the fsync() feature. According to developer Theodore Ts'o, Ext4 precisely implements the POSIX standard for file operations. The problem is that the "secure" behavior of Ext3 was undesirable, however was considered a given by many application developers due to its wide distribution.
Initially the "alloc_on_commit" mode was introduced as a workaround, which was replaced by the "auto_da_alloc" mode shortly thereafter in Kernel 2.6.30 (see Kernel Commit [22]). Thereby, attempts will be made to detect and avoid frequently occurring cases for potential data loss. This mode became the new default mode, but can be deactivated by "noauto_da_alloc".
In general, applications should improve their compliance with the POSIX standard in the future and use fsync() in the required positions. Ext4 has been subsequently optimized for this problem based on the prior history of Ext3. Other file systems using delayed allocation (such as XFS or Btrfs) do not take this into consideration.
References
- ↑ Jump up to: 1.0 1.1 White Paper: "Is It Time to Migrate Your File Systems to Ext4?" by Krista Guglielmeti, (Login required!)
- Jump up ↑ http://www.kernel.org/doc/Documentation/filesystems/ext4.txt
- Jump up ↑ e2fsprogs Release Notes e2fsprogs 1.41.10 (February 10, 2010): Mke2fs will use BLKDISCARD to pre-discard all blocks on an SSD or thinly-provisioned storage device.
- Jump up ↑ e2fsprogs Release Notes e2fsprogs 1.41.13 (December 13, 2010): Mke2fs now understands the extended option "discard" and "nodiscard", and the older option -K is deprecated. The default of whether discards are enabled by default can be controlled by the mke2fs.conf file., mke2fs: Deprecate -K option, introduce discard/nodiscard (commit)
- Jump up ↑ Man page correction (probably in e2fsprogs 1.41.15, as of July 3, 2011, 1.41.14 is the most current version): mke2fs: Simple man page nodiscard option correction (commit)
- Jump up ↑ Ext4: "No Journaling" mode (kernelnewbies.org)
- Jump up ↑ ext4: Allow ext4 to run without a journal (Kernel Commit)
- ↑ Jump up to: 8.0 8.1 SSDs, Journaling, and noatime/relatime (Theodore Tso's blog, 01.03.2009)
- Jump up ↑ Re: Ext4 on SSD Intel X25-M (Linux Ext4 Mailing List, 12.11.2009)
- Jump up ↑ Linux: Replacing atime with relatime (kerneltrap.org)
- Jump up ↑ Does noatime imply nodiratime? (lwn.net)
- Jump up ↑ [1] (Ext4 Wiki), see s_raid_stride, s_raid_stripe_width
- Jump up ↑ Creating and Tuning Ext4 Partitions (blog.peacon.co.uk)
- Jump up ↑ Optimizing Linux for SSD usage (searchenterpriselinux.techtarget.com)
- Jump up ↑ http://www.nuclex.org/blog/personal/80-aligning-an-ssd-on-linux
- Jump up ↑ Re: -E stride and stripe-width necessary for best performance of SSDs? (Linux Ext4 Mailing List, 01.07.2011)
- Jump up ↑ Re: -E stride and stripe-width necessary for best performance of SSDs? (Linux Ext4 Mailing List, 01.07.2011)
- Jump up ↑ Kernel Log: What Does 2.6.37 Add? Two File Systems (heise.de, 05.12.2010)
- Jump up ↑ ext4: add support for lazy inode table initialization (git.kernel.org, 28.10.2010)
- ↑ Jump up to: 20.0 20.1 mkfs.ext4 man Page (linux.die.net)
- Jump up ↑ Potential Data Loss with Ext4 (heise.de)
- Jump up ↑ GIT Kernel Commit for Ext4 auto_da_alloc Mode
Additional Information
- Ext4 (en.wikipedia.org)
- Ext4 Wiki (ext4.wiki.kernel.org)
- Chapter 9. The Ext4 File System (Red Hat Enterprise Linux 6 Storage Administration Guide)
October 24, 2012 | The H Open
Much ado about an ultimately irrelevant issue: a bug in the ext4 filesystem has turned out only to be exposed when several exotic options are combined. Apparently, the problem has only affected a single user.
A bug report from a user called "Nix" had caused a big stir last week; the user had lost data on his ext4 filesystem. Although the problem soon turned out to be an isolated case that involved a combination of several critical options, the public searching for the causes generated a lot of publicity and considerable commotion in the Linux world. The following article will, therefore, attempt to provide some clarification of the issue.
Ext4 is the current standard filesystem for Linux, and is considered robust, mature and well tested (see also "The Ext4 Linux filesystem" from The H Open). Robust, because the journal that was introduced with its predecessor, Ext3, guarantees the filesystem's integrity even if there is a power cut during a write operation; mature because Ext4 is the result of many years of development that began with Ext2 almost 20 years ago; and well tested because since then, the vast majority of Linux systems have been installed on Ext2, Ext3 or Ext4 mass deployment and practical use are still the best way of testing.
Lazy unmountNormally, a filesystem that's mounted under Linux can't be unmounted while a process is accessing a file within the filesystem, even if it's only a shell whose working directory is located in the filesystem. In this case, umount will report that the "
device is busy
". The usual solution is to use fuser or lsof to find the process that is blocking the unmount operation and then terminate that process.However, Linux processes can get stuck in such a way that they can no longer be terminated; or the unmount operation might be blocked by some other bug (such as an unresponsive NFS server) that can't easily be fixed. In such cases, a "lazy unmount" can be requested using the
-l
mount option: the filesystem will be unmounted immediately, and the system will attempt to fix the resulting chaos (file handles without files, etc.) at a later stage. Thatmount -l
is an option for special circumstances and may cause data loss should be quite obvious, really.Has all of that now gone down the drain because of the recent Ext4 bug? Of course not quite the opposite in fact.
Any program code as complex as the code that's required for a modern, powerful filesystem will contain bugs the Ext4 code consists of approximately 40,000 lines of source code. That a combination of several mount options (
nobarrier
,journal_checksum
andjournal_async_commit
) none of which are used by default as well as an added "lazy unmount" (see box) are required to trigger an error is an argument for, not against, the robustness of Ext4 and the quality of its code testing. That the Ext4 developers immediately began to investigate the bug even in a situation such as this is further testament to Ext4's maturity otherwise, the developers would have had more important things to do than check out some esoteric bug that only seems to have affected a single user. Even Ext4 lead developer Theodore Ts'o has so far been unable to reproduce the bug on his systems.We can, therefore, confidently assume that our data is safe on Ext4 or at least that it is safer than it would be on other Linux filesystems: the development of ReiserFS/Reiser4 has fizzled out, the Btrfs "Next Generation Filesystem" hasn't really become suitable for production use yet, and XFS, which was originally developed for Irix, has become stuck on the sidelines. Filesystem bugs also exist in other operating systems but they aren't such public discussion topics.
It is commendable that, despite this, the Ext4 developers want to draw useful conclusions from their handling of the Ext4 bug, although they could have done this much earlier: a year ago, Ted Ts'o had already warned at LinuxCon Europe in Prague that some mount options may cause problems with Ext4 because they haven't been extensively tested. After all, these options some of which were only ever intended for developers will now disappear from the production code, or they will at least trigger a warning. Also a kind of bug fix ...
See also Apparent serious progressive ext4 data corruption bug in 3.6.3 (and ot
Oct 24, 2012 | Theodore Ts'o
UPDATE: See update below; it now looks like this was caused by a very esoteric case indeed:
I suppose I should be glad the potential ext4 corruption that two users have reported potentially impacting v3.6.2 and v3.6.3 has promoted some minor increase in revenue (due to advertising hits) for Slashdot and Phoronix, but really, people should chill out.
First of all, it appears that my initial analysis was wrong; it appears that it's probably related to a unclean shutdown, and probably some other required exacerbating factor, since I haven't been able to trivially reproduce it. Secondly, Fedora 17 is currently using 3.6.2. So if this was an easily triggered bug, (a) lots of people would have been complaining, since the results of this bug is not subtle, and (b) Eric and I probably would have found a reliable repro for the problem by now.
extcarve is a ext2/ext3/ext4 file recovery and semantic file carving tool. It can recover a range of file formats, including PNG, JPG, GIF, PDF, C/C++ programs, PHP, and HTML.
hi about numa 0.2 Hi all,
numa 0.2 recovers gif as well as jpeg (of files less than 48kb)
It's still in alpha.please provide feedback.
Thanks
Everyday Linux Howtos
So you've been futzing round in the file system, and been over vigorous with the rm command and deleted a crucial file that you or (more scarily) a significant other, can't live without. What to do? After that initial hot flush has died down, you must be calm, and work fast. You have three options:
if the file is still open in a running applicationusing extundeleteTo be honest, I can't imagine this happening a lot, but it's possible. The scenario is this: you're editing/using a file in an application, and whilst the application is still open, you delete the file. If this happens to you, then recovery of the file is pretty simple using a tool that should already be installed on your system. So:
step1:Do not close any applications!
step2:Open a terminal, and type:
lsof | grep "/path/to/file"If the file is being used by a running application, then you should get something like this:
progname 5559 user 22r REG 8,5 1282410 1294349 /path/to/fileIf you get no output, then the file isn't being used by a running application, and you'll need to use one of the other methods.
step3:Looking at the output from step2, you'll need the numbers from the 2nd and 4th column (ie. 5559 and 22r). With the second number, drop the "r" (leaving "22″). Now type the following command:
cp /proc/5559/fd/22 /path/to/restored.fileAnd voila, your file should be back in the form of /path/to/restored.file.
Easy.
Extundelete is a very useful little program for restoring deleted files. It should be able to restore files from ext2, ext3 and ext4 partitions. Restoring ext3 and ext4 is a very difficult process, because the journalling part of the file system deletes the information that points to where a file's data is on the disk when it's deleted, unlike the ext2 system, which leaves it in place. But fortunately for us, very often, that pointer information is still in the file system's journal, and therefore this information can be used to find the data on the disk and restore it. So, with that said, on with the show.
First you'll need to install extundelete. If your distro doesn't have a pre-built package, then you'll have to compile it from source. You can get the source from here. If you need help compiling a tarball, check out this page.
step1:As soon as possible, you need to mount the relevant partition as read-only. The sooner you do this the better, so that there is no risk of another program writing data over the top of your recently deleted file. To do this, open a terminal, and type:
mount -o remount,ro /dev/partition... ... ...
Once you've mounted it read-only, you can relax a bit. At least you won't be risking making the situation worse.
step2:Now you can use the extundelete command to restore your file. extundelete can be used to restore all deleted files from a partition, or a specific file or directory. In this example we'll just be restoring a specific file, but for more options, look here. To restore the deleted file, type the following into a terminal:
extundelete --restore-file /path/to/deleted/filestep3:extundelete (if it works) should restore the file to a subdirectory in the current directory called RECOVERED_FILES. Voila problem solved.
using photorec
If extundelete doesn't work, then you can try photorec. Photorec works in a different way to extundelete. Instead of trying to find the information that points to where the deleted file data is on the disk, it tries to find the data by parsing the data itself to identify files. This method is less targeted than extundelete, but may still work if the information pointing to the file has been deleted from the journal. Photorec can find deleted files of a particular type. So if you've deleted a .mp3 file, you would get Photorec to find all deleted .mp3 files and hope that it finds the file you're looking for. Photorec seems to be well supported on the main distros, so you should be able to find a package for easy install. So:
Rather than list the process in detail here, there is a very good explanation of the steps on the photorec wiki.
That's it. Happy undeleting and remember prevention is always better than the cure. Backup.
The fourth extended file system, or ext4, is the next generation of journaling file systems, retaining backward compatibility with the previous file system, ext3. Although ext4 is not currently the standard, it will be the next default file system for most Linux distributions. Get to know ext4, and discover why it will be your new favorite file system.
EXT3 filesystem recovery in LVM2This is the bugzilla bug I started on the fedora buzilla: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=142737
Very good idea to do something like the following, so that you have an copy of the partition you're trying to recover, in case something bad happens: dd if=/dev/hda2 bs=1024k conv=noerror,sync,notrunc | reblock -t 65536 30 | ssh remote.host.uci.edu 'cat > /recovery/damaged-lvm2-ext3'
e2salvage died with "Terminated". I assume it OOM'd.
e2extract gave a huge list of 0 length files. Doesn't seem right, and it was taking forever, so I decided to move on to other methods. But does anyone know if this is normal behavior for e2extract on an ext3?
I wrote a small program that searches for ext3 magic numbers. It's finding many, EG 438, 30438, 63e438 and so on (hex). The question is, how do I convert from that to an fsck -b number?
Running the same program on a known-good ext3, the first offset was the same, but others were different. However, they all ended in hex 38...
I'm now running an "fsck -vn -b" with the -b argument ranging from 0 to 999999. I'm hoping this will locate a suitable -b for me via brute force.
Sent a post to gmane.linux.kernel 2004-12-16
Robin Green
very helpfully provided the following instructions, which appear to be getting somewhere: 1) Note down what the root= device is that appears on the kernel command line (this can be found by going to boot from hard drive and then examining the kernel command line in grub, or by looking in /boot/grub/grub.conf )
2) Be booted from rescue disk
3) Sanity check: ensure that the nodes /dev/hda, /dev/hda2 etc. exist
4) Start up LVM2 (assuming it is not already started by the rescue disk!) by typing:
lvm vgchange --ignorelockingfailure -P -a y
Looking at my initrd script, it doesn't seem necessary to run any other commands to get LVM2 volumes activated - that's it.
5) Find out which major/minor number the root device is. This is the slightly tricky bit. You may have to use trial-and-error. In my case, I guessed right first time: (no comments about my odd hardware setup please ;)
[root@localhost t]# ls /sys/block dm-0 dm-2 hdd loop1 loop3 loop5 loop7 ram0 ram10 ram12 ram14 ram2 ram4 ram6 ram8 dm-1 hdc loop0 loop2 loop4 loop6 md0 ram1 ram11 ram13 ram15 ram3 ram5 ram7 ram9 [root@localhost t]# cat /sys/block/dm-0/dev 253:0 [root@localhost t]# devmap_name 253 0 Volume01-LogVol02In the first command, I listed the block devices known to the kernel. dm-* are the LVM devices (on my 2.6.9 kernel, anyway). In the second command, I found out the major:minor numbers of /dev/dm-0. In the third command, I used devmap_name to check that the device mapper name of node with major 253 and minor 0, is the same as the name of the root device from my kernel command line (cf. step 1). Apart from a slight punctuation difference, it is the same, therefore I have found the root device.I'm not sure if FC3 includes the devmap_name command. According to fr2.rpmfind.net, it doesn't. But you don't really need it, you can just try all the LVM devices in turn until you find your root device. Or, I can email you a statically-linked binary of it if you want.
6) Create the /dev node for the root filesystem if it doesn't already exist, e.g.:
mknod /dev/dm-0 b 253 0using the major-minor numbers found in step 5.Please note that for the purpose of _rescue_, the node doesn't actually have to be under /dev (so /dev doesn't have to be writeable) and its name does not matter. It just needs to exist somewhere on a filesystem, and you have to refer to it in the next command.
7) Do what you want to the root filesystem, e.g.:
fsck /dev/dm-0 mount /dev/dm-0 /where/everAs you probably know, the fsck might actually work, because a fsck can sometimes correct filesystem errors that the kernel filesystem modules cannot.8) If the fsck doesn't work, look in the output of fsck and in dmesg for signs of physical drive errors. If you find them, (a) think about calling a data recovery specialist, (b) do NOT use the drive!
On FC3's rescue disk, what I actually did was:
1) Do startup network interfaces
2) Don't try to automatically mount the filesystems - not even readonly
3) lvm vgchange --ignorelockingfailure -P -a y
4) fdisk -l, and guess which partition is which based on size: the small one was /boot, and the large one was /
5) mkdir /mnt/boot
6) mount /dev/hda1 /mnt/boot
7) Look up the device node for the root filesystem in /mnt/boot/grub/grub.conf
8) A first tentative step, to see if things are working: fsck -n /dev/VolGroup00/LogVol00
9) Dive in: fsck -f -y /dev/VolGroup00/LogVol00
10) Wait a while... Be patient. Don't interrupt it
11) Reboot
Are these lvm1 or lvm2?
lvmdiskscan -v vgchange -ay vgscan -P vgchange -ay -P jeeves:~# lvm version LVM version: 2.01.04 (2005-02-09) Library version: 1.01.00-ioctl (2005-01-17) Driver version: 4.1.0I think you are making a potentially very dangerous mistake!Type 8e is a partition type. You don't want to use resize2fs on the PARTITION, which is not an ext2 partition, but an lvm partition. You want to resize the filesystem on the logical VOLUME.
And yes, resize2fs is appropriate for logical volumes. But resize the VOLUME (e.g. /dev/VolGroup00/LogVol00), not the partition or volume group.
On Fri, Mar 04, 2005 at 06:35:31PM +0000, Robert Buick wrote: > I'm using type 8e, does anyone happen to know if resize2fs is > appropriate for this type; the man page only mentions type2.
A method of hunting for two text strings in a raw disk, after files have been deleted. The data blocks of the disk are read once, but grep'd twice.
seki-root> reblock -e 75216016 $(expr 1024 \* 1024) 300 < /dev/mapper/VolGroup00-LogVol00 | mtee 'egrep --binary-files=text -i -B 1000 -A 1000 dptutil > dptutil-hits' 'egrep --binary-files=text -i -B 1000 -A 1000 dptmgr > dptmgr-hits'stdin seems seekable, but file length is 0 - no exact percentages Estimated filetransfer size is 77021200384 bytes Estimated percentages will only be as accurate as your size estimate Creating 2 pipespopening egrep --binary-files=text -i -B 1000 -A 1000 dptutil > dptutil-hits popening egrep --binary-files=text -i -B 1000 -A 1000 dptmgr > dptmgr-hits (estimate: 0.1% 0s 56m 11h) Kbytes: 106496.0 Mbits/s: 13.6 Gbytes/hr: 6.0 min: 1.0 (estimate: 0.2% 9s 12m 12h) Kbytes: 214016.0 Mbits/s: 13.3 Gbytes/hr: 5.8 min: 2.0 (estimate: 0.3% 58s 58m 11h) Kbytes: 257024.0 Mbits/s: 13.5 Gbytes/hr: 5.9 min: 2.4 ...references: http://stromberg.dnsalias.org/~strombrg/reblock.html http://stromberg.dnsalias.org/~strombrg/mtee.html egrep --helpPerforming the above reblock | mtee, my fedora core 3 system got -very- slow. If I were to suspend the pipeline above, performance would be great. If I resumed it, very quickly, performance would be bad again. This command seems to have left my sytem a little bit jerky, but it's -far- more usable now, despite the pipeline above still pounding the SATA drive my home directory is on.
seki-root> echo deadline > scheduler Wed Mar 09 17:56:58 seki-root> cat scheduler noop anticipatory [deadline] cfq Wed Mar 09 17:57:00 seki-root> pwd /sys/block/sdb/queue Wed Mar 09 17:58:31BTW, I looked into tagged command queuing for this system as well, but apparently VIA SATA doesn't support TCQ on linux 2.6.x.Eventually the reblock | mtee egrep egrep gave: egrep: memory exhausted
...using GNU egrep 2.5.1. ...so now I'm trying something closer to my classical method: seki-root> reblock -e 75216016 $(expr 1024 \* 1024) 300 < /dev/mapper/VolGroup00-LogVol00 | mtee './bgrep dptutil | ./ranges > dptutil-ranges' './bgrep dptmgr | ./ranges > dptmgr-ranges' Creating 2 pipes popening ./bgrep dptutil | ./ranges > dptutil-ranges popening ./bgrep dptmgr | ./ranges > dptmgr-ranges stdin seems seekable, but file length is 0 - no exact percentages Estimated filetransfer size is 77021200384 bytes Estimated percentages will only be as accurate as your size estimate (estimate: 1.3% 16s 12m 1h) Kbytes: 1027072.0 Mbits/s: 133.6 Gbytes/hr: 58.7 min: 1.0 (estimate: 2.5% 36s 16m 1h) Kbytes: 1913856.0 Mbits/s: 124.5 Gbytes/hr: 54.7 min: 2.0 (estimate: 3.7% 10s 17m 1h) Kbytes: 2814976.0 Mbits/s: 122.1 Gbytes/hr: 53.6 min: 3.0 (estimate: 4.9% 10s 17m 1h) Kbytes: 3706880.0 Mbits/s: 120.6 Gbytes/hr: 53.0 min: 4.0 ...I've added a -s option to reblock, which makes it sleep for an arbitrary number of (fractions of) seconds between blocks. Between this and the I/O scheduler change, seki has become very pleasant to work on again, despite the hunt for my missing palm memo. :)From Bryan Ragon
Here is a detailed list of steps that worked:
;; first backed up the first 512 bytes of /dev/hdb # dd if=/dev/hdb of=~/hdb.first512 count=1 bs=512 1+0 records in 1+0 records out ;; zero them out, per Alasdair # dd if=/dev/zero of=/dev/hdb count=1 bs=512 1+0 records in 1+0 records out ;; verified # blockdev --rereadpt /dev/hdb BLKRRPART: Input/output error ;; find the volumes # vgscan Reading all physical volumes. This may take a while... Found volume group "media_vg" using metadata type lvm2 # pvscan PV /dev/hdb VG media_vg lvm2 [111.79 GB / 0 free] Total: 1 [111.79 GB] / in use: 1 [111.79 GB] / in no VG: 0 [0 ] # lvmdiskscan /dev/hda1 [ 494.16 MB] /dev/hda2 [ 1.92 GB] /dev/hda3 [ 18.65 GB] /dev/hdb [ 111.79 GB] LVM physical volume /dev/hdd1 [ 71.59 GB] 0 disks 4 partitions 1 LVM physical volume whole disk 0 LVM physical volumes # vgchange -a y 1 logical volume(s) in volume group "media_vg" now active ;; /media is a defined mount point in fstab, listed below for future archive searches # mount /media # ls /media graphics lost+found movies musicSuccess!! Thank you, Alasdair!!!!/etc/fstab/dev/media_vg/media_lv /media ext3 noatime 0 0 home blee has:
hdc1 ext3 /big wdc sda5 xfs /backups 00/00 ext3 hda ibm fc3: too hot? 00/01 swap hda ibm 01/00 ext3 hdd maxtor fc4 01/01 swap hdd maxtor hdb that samsung dvd drive that overheats
Q. How can I Recover a bad superblock from a corrupted ext3 partition to get back my data? I'm getting following error:/dev/sda2: Input/output error mount: /dev/sda2: can't read superblock
How do I fix this error?
A. Linux ext2/3 filesystem stores superblock at different backup location so it is possible to get back data from corrupted partition.
WARNING! Make sure file system is UNMOUNTED.If your system will give you a terminal type the following command, else boot Linux system from rescue disk (boot from 1st CD/DVD. At boot: prompt type command linux rescue).
Mount partition using alternate superblock
Find out superblock location for /dev/sda2:
# dumpe2fs /dev/sda2 | grep superblock
Sample output:Primary superblock at 0, Group descriptors at 1-6 Backup superblock at 32768, Group descriptors at 32769-32774 Backup superblock at 98304, Group descriptors at 98305-98310 Backup superblock at 163840, Group descriptors at 163841-163846 Backup superblock at 229376, Group descriptors at 229377-229382 Backup superblock at 294912, Group descriptors at 294913-294918 Backup superblock at 819200, Group descriptors at 819201-819206 Backup superblock at 884736, Group descriptors at 884737-884742 Backup superblock at 1605632, Group descriptors at 1605633-1605638 Backup superblock at 2654208, Group descriptors at 2654209-2654214 Backup superblock at 4096000, Group descriptors at 4096001-4096006 Backup superblock at 7962624, Group descriptors at 7962625-7962630 Backup superblock at 11239424, Group descriptors at 11239425-11239430 Backup superblock at 20480000, Group descriptors at 20480001-20480006 Backup superblock at 23887872, Group descriptors at 23887873-23887878Now check and repair a Linux file system using alternate superblock # 32768:
# fsck -b 32768 /dev/sda2
Sample output:fsck 1.40.2 (12-Jul-2007) e2fsck 1.40.2 (12-Jul-2007) /dev/sda2 was not cleanly unmounted, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #241 (32254, counted=32253). Fix? yes Free blocks count wrong for group #362 (32254, counted=32248). Fix ? yes Free blocks count wrong for group #368 (32254, counted=27774). Fix ? yes .......... /dev/sda2: ***** FILE SYSTEM WAS MODIFIED ***** /dev/sda2: 59586/30539776 files (0.6% non-contiguous), 3604682/61059048 blocks Now try to mount file system using mount command:
# mount /dev/sda2 /mnt
You can also use superblock stored at 32768 to mount partition, enter:# mount sb={alternative-superblock} /dev/device /mnt # mount sb=32768 /dev/sda2 /mnt
Try to browse and access file system:# cd /mnt # mkdir test # ls -l # cp file /path/to/safe/location
You should always keep backup of all important data including configuration files.
Linux Magazine OnlineModern filesystems make forensic file recovery much more difficult. Tools like Foremost and Scalpel identify data structures and carve files from a hard disk image.
IT experts and investigators have many reasons for reconstructing deleted files. Whether an intruder has deleted a log to conceal an attack or a user has destroyed a digital photo collection with an accidental rm ‑rf, you might someday face the need to recover deleted data. In the past, recovery experts could easily retrieve a lost file because an earlier generation of filesystems simply deleted the directory entry. The meta information that described the physical location of the data on the disk was preserved, and tools like The Coroners Toolkit (TCT [1]) and The Sleuth Kit (TSK [2]) could uncover the information necessary for restoring the file. Today, many filesystems delete the full set of meta information, leaving the data blocks. Putting these pieces together correctly is called file carving forensic experts carve the raw data off the disk and reconstruct the files from it. The more fragmented the filesystem, the harder this task become.
04 Jun 2008 |You can define journaling file systems in many ways, but let's get right to the point. Journaling file systems are for people who tire of watching the boot-time
fsck
, or file system consistency check process. (Journaling file systems are also for anyone who likes the idea of a fault-resilient file system.) When a system using a traditional, non-journaling file system is improperly shut down, the operating system detects this and performs a consistency check using thefsck
utility. This utility scans the file system (which can take a considerable amount of time) and fixes any issues that can be safely corrected. In some cases, the file system can be in such bad shape that the operating system boots into single user mode to allow the user to further the repair process.
Pronouncing fsck To add insult to injury, the
fsck
process can be initiated automatically by the operating system at mount time to ensure that the file system metadata is correct (even if no corruption is detected). Therefore, removing the need for file system consistency checks is an obvious area for improvement.So, now you know for whom journaling file systems were created, but how do they obviate the need for
fsck
? In general, journaling file systems avoid file system corruption by maintaining a journal. The journal is a special file that logs the changes destined for the file system in a circular buffer. At periodic intervals, the journal is committed to the file system. If a crash occurs, the journal can be used as a checkpoint to recover unsaved information and avoid corrupting file system metadata.To sum up, journaling file systems are fault-resilient file systems that use a journal to log changes before they're committed to the file system to avoid metadata corruption (see Figure 1). But like many Linux solutions, more than one option is available to you. Let's take a short walk through journaling file system history, and then review the file systems available and how they differ.
... ... ...
Fourth extended file system
The fourth extended journaling file system (ext4fs) is the evolution of ext3fs. The ext4 file system is designed as a backward- and forward-compliant replacement for ext3fs but with many new advanced features (some of which break the compatibility). This means that you can mount an ext4fs partition as ext3fs or vice versa.
First, ext4fs is a 64-bit file system and is designed to support very large volumes (1 exabyte). It has also been designed to use extents, but if this is used, then compatibility with ext3fs is lost. Like XFS and Reiser4, ext4fs includes delayed allocation to allocate blocks on the disk only when needed (which reduces fragmentation). The contents of the journal are also checksummed to make the journal more reliable. Instead of the standard B+ or B* tree, ext4fs uses a variation of the B tree, called the H tree, which allows much larger subdirectories (ext3 was limited to 32KB).
Although the delayed allocation method reduces fragmentation, over time, a large file system can become fragmented. An online defragmentation tool (e4defrag) has been developed to address this. You can use the tool to defragment individual files or an entire file system.
Another interesting difference between ext3fs and ext4fs is the date resolution for files. In ext3, the minimum resolution for timestamp was one second. Ext4fs is looking toward the future: Where processor and interface speeds continue to increase, better resolution is needed. For this reason, the minimum timestamp resolution in ext4 is 1 nanosecond.
Ext4fs has been in the Linux kernel since 2.6.19 but is yet to be called stable. Development continues on this next generation; given its heritage, it will be the next generation in Linux journaling file systems.
Resources
- The list of file systems on Wikipedia ranges from the earliest DEC file systems of the 1960s to the latest BufferFS from Oracle. To round out your file system knowledge, also check out this file system reading list, which covers a wide range of file system topics.
- JFS (and its successor, JFS2) were the earliest journaled file systems. They continue to be used in Linux and the AIX operating systems.
- XFS was the earliest journaling file system that focused on high performance. Learn more about the development and future of XFS at the SGI home page.
- The current leader in Linux journaling file systems (as far as deployments go) is the third extended file system (successor to the second extended file system). Read more about the transformation of ext2 to ext3 in the interesting paper, "Journaling the Linux ext2fs Filesystem" (PDF), or in this talk given by the ext3fs designer, Dr. Stephen Tweedie.
- Tim's "Anatomy of the Linux file system" (developerWorks, Oct 2007) introduces you to the VFS and its major structures. The Linux VFS layer provides an abstraction using a common application program interface (API) to the various supported underlying file systems.
- The future of journaling file systems is ext4fs. The paper, "The new ext4 filesystem: current status and future plans" (PDF), along with the presentation, "Ext4: The Next Generation of Ext2/3 Filesystem" (PDF), provide a wealth of technical details for ext4fs. Finally, you can learn more about the development of ext4fs from the development wiki and also about the online defragmentation (PDF) approach.
- Read all of Tim's Anatomy of... articles on developerWorks.
- Read all of Tim's Linux articles on developerWorks.
To the first few replies to this article, have you ever had to build a multi-GB/s filesystem that can handle arbitrary workloads and stay up at least 99.9% of the time? Henry has. I have complaints about his article, but he brings up good points:
- Other filesystems besides ext3 and XFS aren't supported or tested as well necessary for the types of loads he is describing (and ZFS/Fuse is alpha code). XFS is very good, but the biggest Linux vendor (Redhat) doesn't even support it.
- Yes, you can fix Linux code yourself, but filesystem are hard. You can't expect some random code jocky to pickup the kernel source and undertsand filesystems. What about the XFS+NFS bug that existed in Linux around 2004 ( http://www.linux.sgi.com/archives/xfs/2004-06/msg0 0100.html) It took over a year for SGI to fix the problem. Open source worked, because the original patch came from a guy at Sony in Japan, but for a year there was silent corruption on any XFS filesystem that was exported via NFS.
He says that an LT04 tape drive can push 240 MB/s. Now put 20 of those drives in your system (4.8 GB/s). Now, design your filesystem so that you have extra capacity so that you can interact with the filesystem while the tape drives are banging away (4.8x2=9.6GB/s). This is much more than your home software-raid setup to store your mp3s and pictures.
This is High-performance I/O. This is a very common-setup at many of the HPC centers around the world (except they may not be using LTO drives, but enterprise drives). Expecting Linux to push data at that rate is a stretch.
However, it is getting better. I would still rather use Linux than bring in one Solaris server that I have to hire or retrain staff because we migrated to Linux a decade ago.
I am frequently asked by potential customers with high I/O requirements if they can use Linux instead of AIX or Solaris.
No one ever asks me about high-performance I/O - high IOPS or high streaming I/O - on Windows or NTFS because it isn't possible. Windows and the NTFS file system, which hasn't changed much since it was released almost 10 years ago, can't scale given its current structure. The NTFS file system layout, allocation methodology and structure do not allow it to efficiently support multi-terabyte file systems, much less file systems in the petabyte range, and that's no surprise since it's not Microsoft's target market.
And what was Linux's initial target market? A Microsoft desktop replacement, of course. Linux has since moved from the desktop to run on many large SMP servers from Sun, IBM and SGI. But can Linux as an operating system and Linux file systems meet the challenge of high-performance I/O?
You may think you don't need high-performance I/O, but every server needs this type of I/O performance for something as simple as backup and restoration. Current LTO-4 tape drives can operate at 120 MB/sec without compression and can support data rates up to 240 MB/sec with compression. If your file system cannot support I/O at these streaming data rates, then the time to backup and restore will take much longer than expected. For large environments with multiple tape drives, not being able to use the tape drives at their full data rate might require additional tape drives to meet the backup time window, which affects restoration too. Therefore, it seems to me that everyone should be interested in the performance of Linux file systems, if only for backup and restore.
Can Linux file systems, which I will define as ext-4, XFS and xxx, match the performance of file systems on other UNIX-based large SMP servers such as IBM and Sun? Some might also inquire about SGI, but SGI has something called ProPack, which has a number of optimizations to Linux for high-speed I/O, and SGI also has their open proprietary Linux file system called CxFS, which is not part of standard Linux distributions. Because SGI ProPack and CxFS are not part of standard Linux distributions, we won't consider them here. We'll stick to standard Linux because that is what most people use.
We'll focus on two areas:
- Linux as an operating system, and
- Linux file systems.
Linux Operating System Issues
We'll set aside what might happen with Linux in the future and instead focus on what is available today. Linux has a number of features that match the I/O performance of AIX and Solaris, such as direct I/O, but the bottom line is that Linux wasn't designed around high-performance multi-threaded I/O.
There are a number of areas that limit performance in Linux, such as page size compared with other operating systems, the restrictions Linux places on direct I/O and page alignment, and the fact that Linux does not allow direct I/O automatically by request size - I have seen Linux kernels break large (greater than 512 KB) I/O requests into 128 KB requests. Since the Linux I/O performance and file system were designed for a desktop replacement for Windows, none of this comes as much of a surprise.
Linux has other issues, as I see it; for starters, the lack of someone to take charge or responsibility. With Linux, if you find a problem, groups of people are going to have to agree to fix it, and the people writing Linux might not necessarily be responsive to the problems you're facing. If a large vendor of Linux agrees with your problem and provides a fix, that doesn't mean it will be accepted - or accepted anytime soon - by the Linux community. And getting a patch for your problem could pose maintenance problems.
The goals for Linux file systems and the Linux kernel design seem to be trying to address a completely different set of problems than AIX or Solaris, and IBM and Sun are far more directly responsible than the Linux community if you have a problem. If you run AIX or Solaris and complain to IBM or Sun, they can't say we have no control.
Linux File Systems
Remember that most Linux file systems were designed around replacing NTFS, not some of the high-performance file systems such as GPFS (IBM), StorNext (Quantum) or QFS (Sun). These file systems were designed for streaming I/O, which we now know is important for everyone and for some high-speed IOPS, and in some cases for database access.
The Linux file systems that are commonly used today (ext-3 today and likely soon ext-4 and xfs) have not had huge structural changes in a long time. Ext-4 improves upon ext-3 and ext-2 for some improved allocation, but simple things like alignment of the superblock to the RAID stripe and the first metadata allocation are not considered.
Additionally, things like alignment of additional file system metadata regions to RAID stripe value are not considered, nor are simple things like indirect allocations (see File Systems and Volume Managers: History and Usage), which are fixed values so with the small allocations supported (4 KB maximum), large numbers of allocations are required. Take a 200 TB file system, which will require 53.7 billion allocations to represent the 200 TB using the largest allocation size of 4 KB supported by ext-3. Using 8 MB, which is feasible on enterprise file systems, it becomes a manageable 26.2 million allocations. The bitmap or allocation map could even fit in memory for this number of allocations! The xfs file system has very similar characteristics to ext-3. Yes, allocations can be larger, up to 64 KB if the Linux page size is 64 KB, but the alignment issues for the superblock, metadata regions and other issues still exist.
Linux Has Its Place
That's not to say I am anti-Linux, just as I am not pro-AIX or pro-Solaris. I am not even anti-Windows, since I use a Windows laptop as my main computer. But I do believe that the default Linux file systems are not yet up to the task of replacing the high-performance, highly scalable SMP file systems. Computers are tools, and operating systems and file systems are also tools in the toolbox. No one uses a chainsaw in place of a jigsaw, and the same analogy can be used for operating systems, file systems and the hardware they run on.
Many of the people I deal with daily use MS Word, MS Excel, MS PowerPoint and MS Visio. I could run some if not all of these applications on a Windows emulator from someone, but I routinely get incompatibilities with fonts, and I just decided long ago to live with Windows until someone can prove to me that it all works together with no problems. My point here is that every computer is a tool and has its use. Currently no single computer or file system can meet all application requirements. This should not come as a surprise. Linux has a place, but as far as I can tell, that place does not support single instances of large file systems and scaling well from large to small file systems with high-performance requirements. And I don't see this changing anytime soon.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 27 years experience in high-performance computing and storage. See more articles by Henry Newman.
Table 1. Current and upcoming features of ext4 that provide advantages over ext3
Feature Advantage Larger file systems Ext3 tops out at 32 tebibyte (TiB) file systems and 2 TiB files, but practical limits may be lower than this depending on your architecture and system settings-perhaps as low as 2 TiB file systems and 16 gibibyte (GiB) files. Ext4, by contrast, permits file systems of up to 1024 pebibyte (PiB), or 1 exbibyte (EiB), and files of up to 16 TiB. This may not be important (yet!) for the average desktop computer or server, but it is important to users with large disk arrays. Extents An extent is a way to improve the efficiency of on-disk file descriptors, reducing deletion times for large files, among other things. Persistent preallocation If an application needs to allocate disk space before actually using it, most file systems do so by writing 0s to the not-yet-used disk space. Ext4 permits preallocation without doing this, which can improve the performance of some database and multimedia tools. Delayed allocation Ext4 can delay allocating disk space until the last moment, which can improve performance. More subdirectories If you've ever felt constrained by the fact that a directory can only hold 32,000 subdirectories in ext3, you'll be relieved to know that this limit has been eliminated in ext4. Journal checksums Ext4 adds a checksum to the journal data, which improves reliability and performance. Online defragmentation Although ext3 isn't prone to excessive fragmentation, files stored on it are likely to become at least a little fragmented. Ext4 adds support for online defragmentation, which should improve overall performance. Undelete Although it hasn't been implemented yet, ext4 may support undelete, which, of course, is handy whenever somebody accidentally deletes a file. Faster file system checks Ext4 adds data structures that permit fsck
to skip unused parts of the disk in its checks, thus speeding up file system checks.Nanosecond timestamps Most file systems, including ext3, include timestamp data that is accurate to a second. Ext4 extends the accuracy of this data to a nanosecond. Some sources also indicate that the ext4 timestamps support dates through April 25, 2514, versus January 18, 2038, for ext3.
Copyright (c) 2005 Peter Gordon
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license can be found here.
Overview I'm a big fan of the Third Extended ("ext3") filesystem. It's in-kernel and userspace code has been tried, tested, fixed, and improved upon more than almost every other Linux-compatible filesystem. It's simple, robust, and extensible. In this article I intend to explain some tips that can improve both the performance and the reliability of the filesystem.
In the document, /dev/hdXY will be used as a generic partition. You should replace this with the actual device node for your partition, such as /dev/hdb1 for the first partition of the primary slave disk or /dev/sda2 for the second partition of your first SCSI or Serial ATA disk.
I: Using The tune2fs and e2fsck Utilities
Before we begin, we need to make sure you are comfortable with using the tune2fs utility to alter the filesystem options of an ext2 or ext3 partition. Please make sure to read the tune2fs man page:
It's generally a good idea to run a filesystem check using the e2fsck utility after you've completed the alterations you wish to make on your filesystem. This will verify that your filesystem is clean and fix it if needed. You should also read the manual page for the e2fsck utility if you have not yet done so:
Code: $ man tune2fs WARNING: Make sure any filesystems are cleanly unmounted before altering them with the tune2fs or e2fsck utilities! (Boot from a LiveCD such as Knoppix if you need to.) Altering or tuning a filesystem while it is mounted can cause severe corruption! You have been warned!
Code: $ man e2fsck II: Using Directory Indexing
This feature improves file access in large directories or directories containing many files by using hashed binary trees to store the directory information. It's perfectly safe to use, and it provides a fairly substantial improvement in most cases; so it's a good idea to enable it:
This will only take effect with directories created on that filesystem after tune2fs is run. In order to apply this to currently existing directories, we must run the e2fsck utility to optimize and reindex the directories on the filesystem:
Code: # tune2fs -O dir_index /dev/hdXY Note: This should work with both ext2 and ext3 filesystems. Depending on the size of your filesystem, this could take a long time. Perhaps you should go get some coffee
Code: # e2fsck -D /dev/hdXY III: Enable Full Journaling
By default, ext3 partitions mount with the 'ordered' data mode. In this mode, all data is written to the main filesystem and its metadata is committed to the journal, whose blocks are logically grouped into transactions to decrease disk I/O. This tends to be a good default for most people. However, I've found a method that increases both reliability and performance (in some situations): journaling everything, including the file data itself (known as 'journal' data mode). Normally, one would think that journaling all data would decrease performance, because the data is written to disk twice: once to the journal then later committed to the main filesystem, but this does not seem to be the case. I've enabled it on all nine of my partitions and have only seen a minor performance loss in deleting large files. In fact, doing this can actually improve performance on a filesystem where much reading and writing is to be done simultaneously. See this article written by Daniel Robbins on IBM's website for more information:
http://www-106.ibm.com/developerworks/linux/library/l-fs8.html#4
In fact, putting /usr/portage on its own ext3 partition with journal data mode seems to have decreased the time it takes to run emerge --sync significantly. I've also seen slight improvements in compile time.
There are two different ways to activate journal data mode. The first is by adding data=journal as a mount option in /etc/fstab. If you do it this way and want your root filesystem to also use it, you should also pass rootflags=data=journal as a kernel parameter in your bootloader's configuration. In the second method, you will use tune2fs to modify the default mount options in the filesystem's superblock:
Please note that the second method may not work for older kernels. Especially Linux 2.4.20 and below will likely disregard the default mount options on the superblock. If you're feeling adventurous you may also want to tweak the journal size. (I've left the journal size at the default.) A larger journal may give you better performance (at the cost of more disk space and longer recovery times). Please be sure to read the relevant section of the tune2fs manual before doing so:
Code: # tune2fs -O has_journal -o journal_data /dev/hdXY IV: Disable Lengthy Boot-Time Checks
Code: # tune2fs -J size=$SIZE /dev/hdXY WARNING: Only do this on a journalling filesystem such as ext3. This may or may not work on other journalling filesystems such as ReiserFS or XFS, but has not been tested. Doing so may damage or otherwise corrupt other filesystems. You do this AT YOUR OWN RISK.
Hmm..It seems that our ext3 filesystems are still being checked every 30 mounts or so. This is a good default for many because it helps prevent filesystem corruption when you have hardware issues, such as bad IDE/SATA/SCSI cabling, power supply failures, etc. One of the driving forces for creating journalling filesystems was that the filesystem could easily be returned to a consistent state by recovering and replaying the needed journalled transactions. Therefore, we can safely disable these mount-count- and time-dependent checks if we are certain the filesystem will be quickly checked to recover the journal if needed to restore filesystem and data consistency. Before you do this please make sure your filesystem entry in /etc/fstab has a positive integer in its 6th field (pass) so that it is checked at boot time automatically. You may do so using the following command:
Code: # tune2fs -c 0 -i 0 /dev/hdXY V: Checking The Filesystem Options Using tune2fs
Well, now that we've tweaked our filesystem, we want to make sure those tweaks are applied, right? Surprisingly, we can do this options iusing the tune2fs utility quite easily. To list all the contents of the filesystem's superblock, we can pass the "-l" (lowercase "L") option to tune2fs:
Unlike the other tune2fs calls, this can be run on a mounted filesystem without harm, since it doesn't access or attempt to change the filesystem at such a low level.
Code: # tune2fs -l /dev/hdXY This will give you a lot of information about the filesystem, including the block/inode information, as well as the filesystem features and default mount options, which we are looking for. If all goes well, the relevant part of the output should include "dir_index" and "has_journal" flags in the Filesystem features listing, and should show a default mount option of "journal_data".
This concludes this filesystem tweaking guide for now. Happy hacking! _________________ ~~ Peter: GNU/Linux geek, caffeine addict, and Free Software advocate.
Comments
Nice thing! Just one thing: woundn't it be:
in place of:
Code: tune2fs -O dir_index,has_journal /dev/hdXY tune2fs -o journal_data /dev/hdXY
Code: tune2fs -O dir_index /dev/hdXY tune2fs -o has_journal,journal_data /dev/hdXY Editer for clarification: notice that 'has_journal' parameter is valid for '-O' (upper case 'o') modifier, not for '-o' (lower case 'o').
====
jetsaredim wrote: is there a way to list the options for a particular ext3 fs options that have been set?` -k` will list the current contents of the filesystem's superblock.
===
Is there any way to do defragmentation on ext3? My experience is that havily used ext3 partitions become slower and slower while the amount of files in the filesystem and used diskspace don't significally change.
====
Q: when does one benefit from using orlov? and what exactly does commit=9999 mean?
A: You do not need to use orlov as a mount option, since, according to the mount(8) man page, it is the default if neither oldalloc nor orlov is specified. This option would tell ext2/ext3 whether to use the old inode allocator or the new Orlov inode allocator.
I also highly recommend against using commit=9999. This mount option specifies how often (in second intervals) to sync the data to disk. Setting this too high may cause excessive usage of memory and possibly CPU/swap resources. This really is not needed (and from my experience) will not give you a large performance increase at all.
===
What command do you use to set the immutable attribute under reiserfs? Man pages for chattr and lsattr indicate only functioning with ext2,3.
A:
Code: hera etc # grep /dev/hdc3 /etc/fstab /dev/hdc3 / reiserfs noatime,notail,acl 0 0 hera etc # chattr +i /etc/shadow hera etc # lsattr /etc/shadow ----i-------- /etc/shadow hera etc # Another little tidbit that needs to be discussed when talking about these immutable bits is the following.
Now I can simply chattr -i /etc/shadow anytime I want as root and it'll be like it never happened. However with seclvl (A linux implementation of BSD Secure Levels) the behavior mimics that of BSD. So when using these attributes remember to echo "2" > /sys/seclvl/seclvl if you have this support built into your kernel.
The hardened kernel series I know supports this and is always a great idea to use in any secure server implementation.
Ok, acl and immutable are completely different. ACL is access control list, thats just a major enhancement to rwx. The file system flags mantained by chattr are just like the BSD ones that interact with the Secure level. Now if you chattr +i the file is immutable but just a simple chattr -i can just make that entire concept null. Using th BSD secure level implementation for linux accually enforces the rules you set by the file.
xenoterracide Says: May 16th, 2007 at 3:08 pmdata journaling on ext3 is much better than writeback.
will tell you how to do it, it's not specific to gentoo either it will work on any linux distribution, that supports ext3, which I believe is all of them (unless they are using really, really, old kernels).
Features of ext3 File System
The ext3 file system is essentially an enhanced version of the ext2 file system. These improvements provide the following advantages
Availability
After an unexpected power failure or system crash (also called an unclean system shutdown), each mounted ext2 file system on the machine must be checked for consistency by the e2fsck program. This is a time-consuming process that can delay system boot time significantly, especially with large volumes containing a large number of files. During this time, any data on the volumes is unreachable.
The journaling provided by the ext3 file system means that this sort of file system check is no longer necessary after an unclean system shutdown. The only time a consistency check occurs using ext3 is in certain rare hardware failure cases, such as hard drive failures. The time to recover an ext3 file system after an unclean system shutdown does not depend on the size of the file system or the number of files; rather, it depends on the size of the journal used to maintain consistency. The default journal size takes about a second to recover, depending on the speed of the hardware.
Data Integrity
The ext3 file system provides stronger data integrity in the event that an unclean system shutdown occurs. The ext3 file system allows you to choose the type and level of protection that your data receives. By default, Most Linux Distributions configures ext3 volumes to keep a high level of data consistency with regard to the state of the file system.
Speed
Despite writing some data more than once, ext3 has a higher throughput in most cases than ext2 because ext3's journaling optimizes hard drive head motion. You can choose from three journaling modes to optimize speed, but doing so means trade offs in regards to data integrity.
Easy Transition
It is easy to change from ext2 to ext3 and gain the benefits of a robust journaling file system without reformatting. See the Section called Converting to an ext3 File System for more on how to perform this task.
Q ~ Tuning ext3 reads down
Hello,I'm using the ext3 file system and it schedules a "read" from my disk every 5 seconds, which is annoying. How do you adjust this to something like once every minute. Will tune2fs do this? TIA.
February 27th, 2007
#2 ChrisNiemy Just Give Me the Beans!
Join Date: Mar 2006
Location: near Cologne, GermanyPosts: 65
Thanks: 0Thanked 1 Time in 1 Post
Re: Q ~ Tuning ext3 reads down
you could try adding the "noatime"-Option to your /etc/fstab.before:
Code:
# /etc/fstab: static file system information. # # <file system> <mount point> <type> <options> <dump> <pass> proc /proc proc defaults 0 0 # /dev/hdc8 UUID=e8980666-9087-478a-89bb-710a00186e25 / ext3 defaults,errors=remount-ro 0 1 # /dev/hdc10 UUID=0fd4d361-6065-4c84-8dc0-04d07743e0ad /home ext3 defaults 0 2 # /dev/hdc7 UUID=5d241d90-4001-4a53-aede-b4cd76a4eef7 none swap sw 0 0 /dev/hdd /media/cdrom0 udf,iso9660 user,noauto 0 0after:Code:
# /etc/fstab: static file system information. # # <file system> <mount point> <type> <options> <dump> <pass> proc /proc proc defaults 0 0 # /dev/hdc8 UUID=e8980666-9087-478a-89bb-710a00186e25 / ext3 defaults,noatime,errors=remount-ro 0 1 # /dev/hdc10 UUID=0fd4d361-6065-4c84-8dc0-04d07743e0ad /home ext3 defaults.noatime 0 2 # /dev/hdc7 UUID=5d241d90-4001-4a53-aede-b4cd76a4eef7 none swap sw 0 0 /dev/hdd /media/cdrom0 udf,iso9660 user,noauto 0 0__________________Join Date: Mar 2006
Location: near Cologne, GermanyRe: Q ~ Tuning ext3 reads down
Hi there!
Here's the solution (I guess):
The 5 seconds are the commit interval. This is the standard behaviour. You can check this in your syslog. here the ext3.txt from the kernel documentation (<kernel dir>/Documentation/filesystems/ext3.txt:
Quote:(in the following mini-HOWTO are added more performance options, if you don't want them then only add the "commit=seconds" option (in the same order though)
(...)commit=nrsec Ext3 can be told to sync all its data and metadata every 'nrsec' seconds. The default value is 5 seconds. This means that if you lose your power, you will lose as much as the latest 5 seconds of work (your filesystem will not be damaged though, thanks to the journaling). This default value (or any low value) will hurt performance, but it's good for data-safety. Setting it to 0 will have the same effect as leaving it at the default (5 seconds). Setting it to very large values will improve performance. 1st step Take your /etc/fstab and add these options for your /root (and/or /home etc) partition:
Code:
(previous options...),noatime,nodiratime,nobh,data=writeback,commit=100I guess you will also be very happy with the "data=writeback" and "nobh" option. This works for ext3. I guess for reiser also, but please check this before..2nd step To make data=writeback and the new commit interval work get your /boot/grub/menu.lst See the "defoptions=" line and add (e.g. after "ro quiet splash") -->
Code:
quiet splash rootflags=data=writeback,nobh,commit=100also add (only) "rootflags=data=writeback" to the altoptions=-line!Then
Code:
sudo update-grub3rd step For data=writeback, the last step before rebooting is (works with mounted filesystem )Code:
sudo tune2fs -o journal_data_writeback /dev/hd(...)For all your partitions, e.g. if you have /root and /home seperated.finally... Then do a reboot. However, the specific option you were looking for is the "commit=sec" options. The value is measured is seconds.
caution! I had several crashes (not linux' fault ) and my data is still there, although these options increases a possible risk of data loss!!! Note: You are not disabling journaling with this. so it's still pretty safe. (however, own risk)
Appendix PS: My posting seems quite confusing, I guess. So here are the specific example lines/files:
/etc/fstab:
Code:
/dev/hdc2 / ext3 defaults,errors=remount-ro,data=writeback,noatime,nodiratime,nobh,commit=100 0 1(do the same for if you have a seperated /home)/boot/grub/menu.lst
Code:
(...) ## additional options to use with the default boot option, but not with the ## alternatives ## e.g. defoptions=vga=791 resume=/dev/hda5 # defoptions=quiet splash rootflags=data=writeback,nobh,commit=100 (...) ## altoption boot targets option ## multiple altoptions lines are allowed ## e.g. altoptions=(extra menu suffix) extra boot options ## altoptions=(recovery) single # altoptions=(recovery mode) single rootflags=data=writeback ####for the alt options only the data=writeback options is necessary (...)don't forget to run a "sudo update-grub"!Be sure, to have e.g. a live cd to access the system if you make at typing error or so in one of these config files.
WARNING (again) of possible several data loss. Do at your own risk. This is recommended for laptops and/or desktop systems. Don't do this on servers!
DON'T MAKE A TYPING ERROR BY MIXING UP tune2fs with mke2fs!!! This happened once to me and will erase all your data.
more information Kernel-Documentation (mostly <directories to kernel>/Documentation/filesytems/ext3.txt very interesting
manpages: tune2fs
__________________ i just love ubuntu coffee | http://www.last.fm/user/chrisniemy | http://www.ubuntuusers.de
Solaris ZFS and Red Hat Enterprise Linux Ext3 File System Performance White Paper
data=writeback While the writeback option provides lower data consistency guarantees than the journal or ordered modes, some applications show very significant speed improvement when it is used. For example, speed improvements can be seen when heavy synchronous writes are performed, or when applications create and delete large volumes of small files, such as delivering a large flow of short email messages. The results of the testing effort described in Chapter 3 illustrate this topic.
When the writeback option is used, data consistency is similar to that provided by the ext2 file system. However, file system integrity is maintained continuously during normal operation in the ext3 file system.
In the event of a power failure or system crash, the file system may not be recoverable if a significant portion of data was held only in system memory and not on permanent storage. In this case, the filesystem must be recreated from backups. Often, changes made since the file system was last backed up are inevitably lost.
How to make ext3 or reiserfs use journal data writebackFirst you need to take fstab file using the following command
sudo cp /etc/fstab /etc/fstab.orig
Edit the /etc/fstab file using the following command
sudo vi /etc/fstab
Add the thing marked in bold to your fstab root mount line.
/dev/hda1 / ext3 defaults,errors=remount-ro,atime,auto,rw,dev,exec,suid,nouser,data=writeback 0 1
Save that file and exit
You need to take Grubmenu file backup using the following command
sudo cp /boot/grub/menu.lst /boot/grub/menu.lst.orig
Now you need to edit the grub menu list file using the following command
sudo vi /boot/grub/menu.lst
look for the following two lines
# defoptions=quiet splash # altoptions=(recovery mode) single
change to
# defoptions=quiet splash rootflags=data=writeback # altoptions=(recovery mode) single rootflags=data=writeback
Save that file and exit
Now you need to update the grub using the following command
sudo update-grub
the added flags will automatically be added to the kernel line and stay there in case of kernel update
Changes to Ext3 FileSystem Only
Note:- tune2fs only works for ext3. Reiserfs can't change the journal method
Before rebooting change the filesystem manually to writeback using the following command
sudo tune2fs -o journal_data_writeback /dev/hda1
Check that it is running or not using the following command
sudo tune2fs -l /dev/hda1
Remove update of access time for files
Having the modified time change you can understand but having the system updating the access time every time a file is accessed is not to my liking. According to the manual the only thing that might happen if you turn this off is that when compiling certain things the make might need that info.
To change this do the following
sudo vi /etc/fstab
add the following marked in bold
/dev/hda1 / ext3 defaults,errors=remount-ro,noatime,auto,rw,dev,exec,suid,nouser,data=writeback 0 1
Now reboot and enjoy a much faster system
[Aug 7, 2007] Linux Replacing atime
August 7, 2007 | KernelTrap | Last updated 01/02/2020 02:49:04Submitted by Jeremy on August 7, 2007 - 9:26am.
In a recent lkml thread, Linus Torvalds was involved in a discussion about mounting filesystems with the
noatime
option for better performance, "'noatime,data=writeback' will quite likely be *quite* noticeable (with different effects for different loads), but almost nobody actually runs that way."He noted that he set O_NOATIME when writing git, "and it was an absolutely huge time-saver for the case of not having 'noatime' in the mount options. Certainly more than your estimated 10% under some loads."
The discussion then looked at using the
relatime
mount option to improve the situation, "relative atime only updates the atime if the previous atime is older than the mtime or ctime. Like noatime, but useful for applications like mutt that need to know when a file has been read since it was last modified."Ingo Molnar stressed the significance of fixing this performance issue, "I cannot over-emphasize how much of a deal it is in practice. Atime updates are by far the biggest IO performance deficiency that Linux has today. Getting rid of atime updates would give us more everyday Linux performance than all the pagecache speedups of the past 10 years, _combined_." He submitted some patches to improve
relatime
, and noted aboutatime
:"It's also perhaps the most stupid Unix design idea of all times. Unix is really nice and well done, but think about this a bit: 'For every file that is read from the disk, lets do a ... write to the disk! And, for every file that is already cached and which we read from the cache ... do a write to the disk!'"
Linux Ext2 filesystem for Windows NT driver
Ext2 0.04 for NT4 read-write Contacts and feedback: Andrey Shedel [email protected] Primary site: http://www.chat.ru/~ashedel CAUTION!!! this is nt kernel-mode driver and you are using it at your own risk. It is highly recommended to use sync utility to flush regular volumes first. >> You should be aware of the fact that ext2.sys might << >> damage the data stored on your hard disks. << If you cannot agree to these conditions, you should NOT use ext2.sys ! installation (you should be the member of administrators group): copy ext2.sys to your %systemroot%\system32\drivers directory merge ext2.reg file reboot to update driver information edit go.cmd to point to your Linux drive run go.cmd Known features: Non-regular files are converted to regular at first write attempt. Mounting partitions: NT4: Instead of loading the driver manually or automatically (by setting startup mode to 1) you can use fs_rec.sys (recognizer driver). This driver is a superset of the recognizer that comes with NT4 and can be used instead of it. In addition to CDFS, NTFS and FAT (standatd set for NT4) it includes recognision modules for HPFS (for pinball.sys from NT3.51), FAT32-enabled fastfat and Ext2. It is not recommended to use this recognizer on NT5 because support for UDFS is not included. Unfortunately even in this case you still have to set persistent links in DosDevices namespace UNLESS YOU ARE USING NT5. For example: [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\DOS Devices] "E:"="\\Device\\Harddisk0\\Partition2" NT5: On NT5 you can use Disk Management utility to assign drive letter. Files included: readme.txt - this file ext2.sys - driver dosdev.exe - Define/RemoveDosDevice utility. kloader.exe - utility to load kernel-mode driver. ln.exe - hardlink creation utility. SYNC.EXE - Flush write-behind cache utility. fs_rec.sys - Recognizer driver. Changes: 0.04: pagefile support initial security implementation.[Dec 12, 1999] Slashdot Ask Slashdot EXT3
A great way to follow kernel development is to read the excellent kernel mailing list synopses written by Zack Brown at:
Ext3fs is a journaled version of ext2fs written by Stephen Tweedie. It's in beta form right now but works pretty well. Stephen and Ted Ts'o talked about ext3fs at our Linux Storage Management Workshop in Darmstadt, Germany (you can get the slides for this workshop at ftp://linux.msede.com/lsmws_talks/) The ext3 filesystem, of which early alphas are ready (version 0.0.2c, the excitement !!). Development is on the linux-fsdevel mailing list, archived here. Hello, I've been running ext3 on my laptop computer for about two months now. It works great. Just sync the disks and turn it off. No shutdown. No data loss either. If you look at e.g Solaris disk-suite you are able to control where your should store your metadata. Say that you want to have journaling file data also, this is normally slowing the system down. But if you can specify that all file metadata should be on a separate solidstate disk (naturally mirrored for safety). Then journaling of file data will be quick and swift. This is in my view quite important. If I understand everything correctly you can do that with ext3. One of the major problems with ext2fs (IMHO) is that it doesn't resize well. This is because there is a copy of every group descriptor in every group [a g.d. contains metadata for a group of blocks/inodes, typically 8M in size]. Therefore enlarging or shrinking the drive causes a major reshuffle of ALL the data; so far, the only utility I know that can do this is resiz2fs, which comes with Partition Magic (there are no doubt others now).
This redundancy is good in theory (backups), but keeping a copy of a constant number of group descriptors (perhaps the previous and next 32) in a given group would still give you a lot of redundancy plus make resizing simpler.
Granted, resizing isn't something you do a lot, but having had my system lock up and die while resizing and having to recover using Turbo C++ and the ext2fs spec (code and info on my ext2fs page), it would be nice if ext3fs (or XFS) made this easier.
The Reiser Filesystems by Hans Reiser, a very ambitious project to not only improve performance and add journaling, but to redefine the filesystem as a storage repository for arbitrarily complex objects. Reiserfs is faster than ext2/3 because it uses balanced trees for it's directory-structures. The project is now released for 2.2.11 - 2.2.13. Mailing list archive here.
The Xfs site has some docs. The work to unencumber the code is accelerating, and February is the target date for source code release. XFS is the one that I think has the most potential. It's a full logging filesystem from the ground up, not an extension (not that EXT3 or DTFS are bad or misguided efforts) I'm betting it will be the highest performance filesystem for linux when it goes gold. I think the tight integration of the log could be a huge plus. It's been a while since filesystem 101 but I would think that there are a ton of ways to optimize performance with log write back tricks and useage optimizations.. You could include a hit counter in metadata and have an optimizer that moves higher hit files closer to the log in the center of the disk making your more frequently used files closer to where the head is supposed to be. Those kinds of optimizations (if practical, maybe I'm full of it) wouldn't be nearly as easy with ext3 since the FS doesn't have any knowldege of the log. Plus xfs has ACLs and big file support already.
Hi,ext3fs is a journaled version of ext2fs written by Stephen Tweedie. It's in beta form right now but works pretty well. Stephen and Ted Ts'o talked about ext3fs at our Linux Storage Management Workshop in Darmstadt, Germany (you can get the slides for this workshop at ftp://linux.msede.com/lsmws_talks/) -
Stephen also gave a talk on ext3fs at the Linux Kongress in Augsburg, Germany. He is predicting Summer 2000 for production use of ext3fs. Nice features include the fact that ext3fs is backwards compatible with older versions of ext2. In addition, ext3fs uses asynchronous journaling, which means the performance will be as good or better than ext2fs. -
I am involved with the SGI effort to port XFS to Linux. The work to unencumber the code is accelerating, and February is the target date for source code release. The read path is working at this time. More work remains however, so stay tuned to -
From Slashdot
Q: I hate these "/dev/hda5 has reached maximal mount count; check forced". I hope they too go away with journaling...
A: Easy fix: raise the max-mount-counts and interval-between-checks for the filesystem with tune2fs.
Example: tune2fs -c 200 /dev/sda1 -i 700
The -l flag will show you, among other things, the current settings. Be aware you are defeating a built-in safeguard to protect your data.
Recommended Links
Google matched content
Softpanorama Recommended
Top articles
Sites
Internal
ext3 - Wikipedia, the free encyclopedia
Filesystems HOWTO Extended filesystems (Ext, Ext2, Ext3)
Ext2 IFS For Windows FAQ
Gentoo Forums View topic - Some ext3 Filesystem Tips
Solaris ZFS and Red Hat Enterprise Linux Ext3 File System Performance White Paper
Explores the performance characteristics and differences of Solaris ZFS and the ext3 file system through a series of benchmarks based on use cases derived from common scenarios, as well as the IOzone File System Benchmark (IOzone benchmark) which tests specific I/O patterns.
Ext2fs Home Page Design and Implementation of the EXT/2 Filesystem - by RИmy Card - HTML
Linux is a Unix-like operating system, which runs on PC-386 computers. It was implemented first as extension to the Minix operating system [Tanenbaum 1987] and its first versions included support for the Minix filesystem only. The Minix filesystem contains two serious limitations: block addresses are stored in 16 bit integers, thus the maximal filesystem size is restricted to 64 mega bytes, and directories contain fixed-size entries and the maximal file name is 14 characters.
We have designed and implemented two new filesystems that are included in the standard Linux kernel. These filesystems, called ``Extended File System'' (Ext fs) and ``Second Extended File System'' (Ext2 fs) raise the limitations and add new features.
In this paper, we describe the history of Linux filesystems. We briefly introduce the fundamental concepts implemented in Unix filesystems. We present the implementation of the Virtual File System layer in Linux and we detail the Second Extended File System kernel code and user mode tools. Last, we present performance measurements made on Linux and BSD filesystems and we conclude with the current status of Ext2fs and the future directions.
Analysis of the Ext2fs structure - Table of Contents Copyright (C) 1994 Louis-Dominique Dubeau.
- Introduction
- Blocks and Fragments
- Groups
- Superblock
- Group Descriptors
- Bitmaps
- Inodes
- Directories
- Allocation algorithms
- Error Handling
- Formulae
- Invariants
- References
- Concept Index
A Non-Technical Look Inside the EXT2 File System Issue 21
Ext2fs Undeletion of Directory Structures mini-HOWTO
Transparent compression for the ext2 filesystem
EXT2 Futures -- by Theodore Ts'o
These slides focus specifically on the EXT/2 filesystem, talking about its evolution, philosophy, planned new features, and relation to other linux filesystems.
Design and Implementation of the Second Extended Filesystem -- an excellent paper
A tour of the Linux VFS
ext2fs Utilities (utilities for the ext2 file system - Linux' primary fs) by Theodore Ts'o ext2 Partitions Re ext2 partitions
Filesystem Hierarchy Standard - This page is the home of two standards, the Linux Filesystem Standard (FSSTND) and its successor, the Filesystem Hierarchy Standard (FHS).
Etc
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Haters Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
You can use PayPal to to buy a cup of coffee for authors of this site Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: January 02, 2020-