BS' Blog: Filesystem Fundamentals and Practices

- Microsoft (MS) DOS 1.0

MS-DOS 1.0 was a direct (and illegal) port of Digital Research's CP/M from the 8080 to 8088. There is a long history on that (MS bought it from Seattle Computer Products, the original piraters, for $50,000, which IBM later settled out-of-court with DR for $800,000). But the limitations of CP/M were clear, no directories, only 1,024 files in the filesystem, and filesystem reference was by drive letter (e.g., A:, B:, C:).

The File Allocation Table (FAT) approach was simple, but effective. The filesystem was a simple set of sectors, with two (2) file allocation tables, one original, one backup. The allocation tables were to track allocation of sectors. If a file was allocated space, if it only took up one sector, then the relative FAT entry for that sector would be noted as the end of the file. If the file took up more than one sector, then the initial FAT entry would note the next sector of the file. Each file is a chain of entries in the FAT referencing the next sector.

The FAT references were 12-bit, allowing up to 4,096 sectors to be addressed. With sector sizes of 512, 1,024 or 2,048 bytes, FAT12 could handle up to a 8MiB device. With up to 1,024 filenames, the FAT of a FAT12 only took up 1.5KiB (12,144 bits) of space.

NTFS is, in essence, a modified version of FAT. It still uses a FAT design, but has far fewer limitations (e.g., no more 8.3 limitations), uses a more intelligent approach. One is that the FATs are located closer to the middle of the filesystem, to reduce seek times (FAT filesystems allocate them at the start). And there are now formal approaches to discover which copies of the FAT are correct when they differ. Lastly, like HPFS, NTFS marks and forces filesystem integrity checks when the system is not properly shutdown and the filesystem taken off-line (and uses the same CHKDSK.EXE program, although radically different than the legacy DOS program of the same name). NTFS one-ups HPFS by adding journaling, which reduces the recovery time requires for brining the filesystem on-line as consistent.

Windows Millenium Edition (ME) was a Microsoft experiment to remove a lot of the legacy DOS 20-3Fh services and various interface options to force its own software application developers and indepenent software vendors (ISVs) to stop using the legacy DOS interfaces and start using the native NT/Win32 filesystem interfaces (among others). It was an utter-failure as it did little to force change, all while destroying compatibility.

Most people think it is lack of filesystem journaling (i.e., recording transactions and ensuring they are completed, somewhat like the "Atomicity" -- the first part of ACID in a good database design -- in a filesystem) is the issue, but it's actually not at all. Because even non-journaling filesystems in UNIX have at least a mechanism to not only ensure consistency, but force a check to make the filesystem consistent. All of Microsoft's FAT filesystems, even in NT5 (200x/XP), never force the user to fully check a filesystem for consistency before mounting. In UNIX, we only allow filesystems to be mounted "read-only," if at all, until they are checked for consistency -- so no changes could occur on a possibly inconsistent filesystem.

NT systems have no concept of a "read-only" mount. During start-up, NT expects everything but the "System" volume (the "System" volume is BIOS fixed disk 80h with the MBR, NTLDR, BOOT.INI, optional NTBOOTDD.SYS 3rd party disk driver, etc...) to be read/write, including the "Boot" volume (the "Boot" volume, which ironically comes after the "System" volume/stage, is the volume with \WINNT or \WINDOWS). This is due to the fact that many NT services expect the filesystem to be writeable during boot, including before any filesystem integrity check and/or journal replay of NTFS is made -- which could result in corruption.

Adding insult into injury, NTFS is very, very aggressive in its journal playback. Unless forced by explicity user option, NTFS often replays its journal. In rare, but eventually probable cases over extended usage and time, NTFS will self-destruct and leave itself unable to recover from a manual CHKDSK. Therefore, it is important that NT system administrators regularly force a manual CHKDSK at next boot to enforce regular, full filesystem integrity checks and minimize the chance of a future, improper journal replay.

- inode filesystems

UNIX systems use inode filesystems. Each filesystem entry, typically a directory or file, has an inode that stores both meta-data, and points to the data blocks. A key difference and mindshift from a FAT design is that FAT has a dedicated allocation table with a 1:1 reference to data blocks -- whereas inode filesystems actually use two different data block types, the data blocks and the inode blocks that point to them. FAT uses a dedicated allocation table of all possible blocks that could be allocated, inodes do not -- in fact, some filesystems (that pre-allocate inodes) could "run out of inodes" when a filesystem contains lots of small files and there is not a 1:1 inode to data block (e.g., run "df" and "df -i" and note the actual data blocks and inodes used).

Pretty much every data block in an inode filesystem has an inode pointer using it, or reserving it (although designs differ), except the rare Superblocks. The Superblocks contains the core filesystem information (basic filesystem values, location of key inodes, free blocks, etc...) and only type of a few kilobytes (typically one data block, 4KiB is commonplace), and several, redundant copies are spread all over the disk (typically at fixed locations that are easy for experienced administrators to find in case a filesystem can and should be mounted with an alternate superblock).

A FAT filesystem makes it easy to check for free blocks. Inode filesystems are more arbitrary, and a filesystem consistency check is always recommended on a regular basis to ensure the number of free blocks, as well lists of blocks that are available or have been freed, are consistent. A FAT filesystem also makes it much easier to check for cross-linked files, whereas all inodes need to be inspected to see if multiple data blocks are referenced by different inodes (although there is an interesting bonus to this, as we'll discuss). Inode operations definitely make checks longer and more involved, although there are bonuses to this in consistency (which we'll discuss).

A big one to start is actually a surprise to many. Most Windows users assume that the "root inode" makes an inode filesystem more suseptible to corruption than a FAT filesystem, because it points to everything else. In recovery, it's actually the opposite, a major benefit. FAT filesystems separate the allocation entries (in the fixed FAT) from the directory references (in special data blocks) which means there are 2 different points where a failure could destroy the same data. Again, this is because the FAT design comes from MS-DOS 1.0 before directories were added in MS-DOS 2.1. Although NTFS improves somewhat on this separation, by allocating directories separate from files, it's still 2 points where either can cause the same, severe damage. Anyone who has had even a CHKDSK on a NTFS filesystem result in unknown "FILE####.CHK" files that no one knows about has experienced this issue first hand.

In an inode filesystem, if directory links are severed between the root inode, or any parent directory inodes below, at least the inodes below that inode are now their own tree. This is because inodes store both the directory tree and pointers to data blocks in one structure, the inode itself. If a filesystem integrity results in the portion of the tree being "severed," the portion of the tree typically shows up a its own, self-contained tree under the typicaly "lost+found" directory -- names, directories, subdirectories, etc... intact. It all depends on the locality of the corruption or other fixed inconsistency, but if there was only a few points of actual "corruption," inode filesystems tend to be much easier to "piece back together" than FAT designs as a result of the "reference and allocation information as one" inode design.

Most UNIX/Linux filesystems do many things to reserve usage of a filesystem. A big one is the common 2-10% (Ext2/3 use 5% by default) reservation of a filesystem. When a filesystem reaches 90-98% full (95% full on Ext2/3 by default), the kernel will prevent any further writing to the filesystem by anyone but root. Not only are the regular users, but most processes, are not running as root, so the disk stops allowing writes at 90-98%. At first this seems foolish and, in fact, many people complain about it, but it is for one very big reason -- fragmentation.

Fragmentation exponentially increases as a filesystem fills up. This reservation is a long taught, long learned lesson for UNIX/Linux administrators that should be very respected. Anyone who has filled up a Windows server volume should appreciate this given how poorly a Windows server performs afterwards, which is the same problem a UNIX server would suffer if it allowed it too. But unlike Windows servers, almost all UNIX/Linux distributions and filesystems (with a few, notable exceptions) enact this reservation -- to combat sudden and horrendous fragmentation that occurs as a filesystem becomes nearly full.

I never make just one data filesystem, I make at least two. That way, if a full fsck is required on one, I can bring the server up and let half my users work while the other half waits 15-60 minutes, instead of having all my users wait on the 30-120 minute fsck required on one, big data filesystem.

I leave it up to individual sysadmins to decide for themselves, but I encourage you not to avoid learning LDM and LVM/LVM2 because there are sound reasons for doing so.

In fact, the common physical volume (pv), volume group (vg), logical volume (lv) 3-level approach i Linux's LVM is basically ubiquitous across a host of UNIX flavors and their various platforms. Learning the elementary terminology, and how to do basic, harmless operations like allocation new space, is highly recommended. If you don't know your way around any UNIX LVM, then learn it so you are ready to deal with most implementations.

I absolutely and positively will not tolerate /tmp and /var on anything else. If I am absolutely hurting for space, I will symlink /tmp -> /var/tmp. I like to avoid putting even /tmp on the root filesystem. When implementing these "bare minimum" Linux filesystems, I assume /home will be mounted remotely. If not, then a separate /home filesystem is of great consideration, although there are one or two workarounds (see /usr/local below).

ReiserFS continues to builds a revolutionary filesystem that lacks traditional UNIX inode layout and interfaces, which is why ReiserFS lacks a lot of kernel feature compatibility, and not all of the Linux Virtual Filesystem (VFS) layers can abstract these features to ReiserFS that just isn't of the same, traditional design. This prevents me from using ReiserFS. As an additional consideration, by his own admission, Hans Reiser has stated that filesystems should be redesigned every 5 years. As much as I've seen ReiserFS handle dynamic changes without incident, as much as I've never seen ReiserFS make a journal misplay, the fact remains that with a continually fluid design, or significant changes on a regular basis, the off-line tools continue to lag the on-line kernel implementation. So while I might be okay as long as a ReiserFS filesystem is matched against the proper kernel, the second ReiserFS does properly not trust its journal replay, I'm at the mercy of the off-line tools. And so far, I've had horrendous luck when that happens.

I adopted Ext3 in early 2000 for kernel 2.2 when it was still only the "[full data] journaling" mode. It was little more than simple "double-buffer" commit. It was easily converted to Ext2, as well as back, and it did the job to drastically reduce fsck times. Probably the biggest sell for Ext3 was the ability to drop into a full fsck when necessary -- something that saved me dearly when a physical disk error occured (and my RAID card firmware and driver were not compatible -- long story). To use a trusted fsck of 10 years on a filesystem whose structure had not changed in the same period of time was convincing enough.

Since then, I have only trusted my "essential" filesystems to Ext3 without reservation. I have never lost a Ext3 filesystem, and I have had no unexpected data loss with either "journal" or "ordered writes" mode. I purposely avoid "write back" mode due to its inherent issues that it could affect files that are not being modified. With newer directory indexing features, I find the performance of Ext3 to be more than adequate for filesystems under 100GB. It should be noted that I purposely avoid using Ext3 on filesystems greater than 1TB (even though newer versions support up to 8.8TB/8TiB).

The Ext3 base feature set -- full NFS compatibility, most other, standard Linux features in mid-to-late 2.4 (quotas, POSIX EAs/ACLs, etc...) were sufficient for most operations -- especially in the early days of Ext3 back in kernel 2.2.

the major, key differentiation of XFS is built upon its existing, proven, stable structure on Irix. That included the full suite of off-line tools with 5+ years deployment -- xfs_repair, xfsdump/xfsrestore, xfs_growfs, etc... The off-line repair tool was very trusted. The dump/restore , combined with the native inode storage of any EAs/ACLs info directly in the inode**, but it could be safely run against a mounted XFS filesystem and did not require a snapshot or other volume management "freeze" (unlike Ext3). It already had the ability to be grown, managed, reorganized (defragmentor), etc... with the existing suite of off-line tools that pre-existed, not what was being promised to be developed, etc...

For data filesystems, I was sold on XFS and started using it immediately. I tested XFS for other filesystems as well, but quickly stopped considering it after both the performance of "temporary" filesystems was not optimal combined with the fact that I had two /var filesystems get hit by the XFS 1.0 bug. The bug was an oversight in the design of the one additional requirement for the Linux port, the paging facility that was previous tied to Irix -- something that has been long fixed and is now trusted (especially in 2.6 where the paging facilities are part of the stock kernel code).

The only issue of major concern with Ext3 is the pre-allocation of inodes. The ratio of inodes to blocks is typically 8-16 or so (one inode for every 32-64KB on the typical filesystem with 4KB blocks). On the /var filesystem, or another temporary filesystem with lots of small files -- possibly a mail or news spooler (although not nearly as much in the case of mail these last few years with MS-TNEF flying around ;-), this is not ideal. It is very often the case that a "df -i" will result in twice as many inodes used than actual blocks -- although newer logging defaults in most distributions/services are not nearly as bad as of late. So using the "-i" or "-T" option to "mke2fs" when creating /var or a /var/"spool" directory is recommended for Ext3 /var and /var/spool filesystems. E.g. (1:1 inode-to-data block assuming a default data block is 4KB):
# mke2fs -i 4096 -j -L var /dev/vg00/lv04
# mke2fs -j -L var -T news /dev/vg00/lv04

See "man 8 mke2fs" for more information.

However, I typically deploy XFS on user data filesystems, and the rare, large service directory (e.g., database, IMAP spool, etc...). On user data filesystems, I typically wish to take full advantage of Extended Attributes (EAs) like Quotes, ACLs, SELinux, etc... support. The default 256 byte size of a XFS inode is not ideally suited for storing POSIX ACLs, as less than 64 bytes are typically left for EAs. Should an inode need more space, a full data block (typically 4KB) would be allocated, which is not always ideal, plus it means not all of the meta-data is stored in a single inode. So when using ACLs and/or SELinux, increasing the inode size in XFS to 512 bytes (possibly 1024 bytes when using both heavily) is recommended, at only a small disk penalty overall (a tad more noticable with 1024 bytes). The option to use a larger inode size when creating a XFS filesystem is "-i size=value" such as follows:
# mkfs.xfs -i size=512 -L engr_unclass /dev/vg01/lv01
# mkfs.xfs -i size=1024 -L engr_secret /dev/vg02/lv01

1. Kernel 2.4

I have _never_ used the XFS backport to kernel 2.4. Frankly, I don't trust it. Not because of XFS, but because of kernel 2.4, and because it doesn't come directly from SGI, tested and blessed.

I have only used the official XFS releases for kernel 2.4, largely XFS 1.2 for Red Hat 7.x, with limited use of 1.3 on Red Hat Linux 9. In fact, I kept deploying only Red Hat Linux 7.3 and, to a lesser extent, Red Hat Linux 9 with XFS until late last year (once Fedora Core 3 came out), tapping FedoraLegacy.ORG for updates. Again, this means I'm back at kernel 2.4.20 -- and I really never trusted newer 2.4 kernels anyway! I have not had the NFS issues others have complained about, and I've had a real crutch on xfsdump for backups includingACL information, as well as quota support.

3. LVM/MD Usage

I limit my use to LVM to volume slicing. Let me start by saying that I'm a huge fan of volume management. I use both LVM and LVM2 for flexible, on-line additions/modifications of logical volumes. In a nutshell, Ilargely use it to slice my disks with more flexibility -- reserving space, create new volumes as necessary and theoccassional expansion (although I typically try to stick to new mounts/symlinks).

But with that said, let it be known that I don't trust LVM and especially not LVM2 with snapshots, more complex resizing and definitely not any RAID operations. I do not trust DeviceMapper (DM) with either LVM2 or EMVS right now. Why? All I keep reading is about is race condition after racecondition after race condition. And in each case, it's not limited to XFS.

When it comes to MD, I really avoid it. I always have. I've seen a lot of people talk about how software RAID is better, faster, etc... I've seen people state that it allows them to use different disk controllers and other hardware, and not be tied to a vendor. They also claim its more flexible and gives them more options. While I believe they are sincere, I can quickly and easily point out they are not comparing software RAID to solid, proven hardware RAID products from select vendors. I've just had a different set of hardware RAID experiences.

First off, I've limited myself to only 3Ware and select LSILogic (including former Mylex) products over the last 5 years. 3Ware uses an ASIC-driven "storage switch" and I have only deployed LSI Logic (and former Mylex) products thatare XScale (which is based on StrongARM). These are very, very high performing -- able to move a lot of data with not only little CPU overhead, but more importantly, without the extensive use and duplication of data streams through the CPU-memory interconnect. I.e., it's not the XORs that get you, but the duplicated data streams tying up the interconnect that data services could be using. It's the same reason why hardware switches/routers are better networking equipment than PCs -- these "storage switch - I/O processors"are the same. Their on-board RAID intelligence is self-contained meaning their drivers are simple, GPL block drivers. Even Intel is moving to put its XScale I/O Processors (IOP) on Xeon mainboards, possibly in the I/O Controller Hub (ICH), directly -- to off-load these unnecessary operations for today's network/storage (RAID, layer 2/3/4 frames/packets/transports, iSCSI overhead, etc...) off of the CPU-memory which it is not designed for (and only unnecessarily duplicates data streams taking time away from actual data processing).

BS' Blog: Filesystem Fundamentals and Practices - The Diigo Meta page

Would you like to comment?

Top Tags

Check out another URL