Presented by Benjamin J. McMillan on September 25, 2002
Table of Contents
- 1. Definitions
- 1.1. What is a filesystem?
- 1.2. Metadata
- 1.3. I-nodes
- 1.4. Directories
- 1.5. B-Trees (and B+trees, B*trees)
- 1.6. Journaling (only for 2.4.x+ kernels)
- 2. Filesystem Types
- 3. Conclusion
1. Definitions
1.1. What is a filesystem?
Most hard drives contain numerous tracks, each track
containing thousands of sectors (or blocks), each sector/block
containing 512 bytes, each byte containing 8 bits. That’s a lot
of 1s and 0s! These 1s and 0s are useless, however, if they are
not written to the hard drive in order and with some
organization. Filesystems facilitate not only the storage, but
also the location, retrieval and manipulation of data. By
telling where the data should be stored and how it should be
read and manipulated, filesystems enable us to make good use out
of those expensive hunks of metal, glass, ceramic, silicone, and
more.
1.2. Metadata
Metadata is crucial to filesystems. Metadata, "data about
data," is information stored in reference to, but not a part of,
the main data written to the disk. For instance, whenever a 4
kilobyte file, "foo.bar," is written to the disk, the metadata
remembers it’s size, position on the disk, name, and more.
1.3. I-nodes
Metadata is stored in sections of the hard disk referred to
as i-nodes. I-nodes alsa contain block maps, which store
detailed information as to exactly where the data is on the
disk. Unfortunately, not all files consist of contiguous bits on
the disk. Rather, a file might exist in many contiguous sections
found all over the disk (aka fragmentation).
1.4. Directories
Directories are simply containers for any number of files
(units of data). Directories can be easily stored as linear
lists, containing at least the name of the file and that file’s
inode. Of course there can be directories that contain
sub-directories. And aha! Linked lists
These linked lists
therefore form hiearchical structures, which should be familiar
to most computer users.
1.5. B-Trees (and B+trees, B*trees)
More advanced filesystems utilize B-trees (or sometimes hash
tables) which store the directory’s contents using a better,
mostly sorted structure. This makes indexing and searching much
easier and faster. They are also relatively compact (few keys)
and scalable.
1.6. Journaling (only for 2.4.x+ kernels)
Journaling is the coolest part of most modern filesystems
(IMHO). A relatively recent addition, journaling allows
filesystems to be more accurate, reliable, and less corrupt. If
your system ever crashes (not that Linux would ever crash, but
if you experience a power failure, or worse, you’re using
Windows), a journaling filesystem will be able to repair
itself. Metadata is the essential component of all journaling
filesystems, because it is the metadata that these
filesystems use to make sure the data is accurate and/or
complete. During bootup after a mishap, only the metadata
that has been manipulated recently (immediately before crash)
is analyzed. Therefore, the filesystem is repaired (brought
to a consistent state) very quickly, no matter how big the
drive is! In addition to how filesystems write/read the data
to and from the disk, the manner in which filesystems utilize
journaling makes them different from one another.
This presentation is a comparison of how XFS, Ext2/3, and
ReiserFS manage data and metadata.
2. Filesystem Types
2.1. Ext2
An oldie but a goodie. Ext2 is the most used filesystem for Linux because it has been around for a long while. The great quality of Ext2 is that it is really fast. Unlike other filesystems, however, Ext2 lets the hard drive handle the cylinder groups. Instead, it refers to the hard drive in separate block quotes. The disadvantage of Ext2 is that it does not implement in itself any sort of journaling. In the rare case that the system reboots unexpectanctly, it is up to a filesystem check program (fsck) to analyze and repair any damage during the next boot. However, one should not immediately turn his or her head away from this filesystem. It is supported very easily by most Unix, Linux, FreeBSD, etc operating systems. So if you want compatibility and speed, and are willing to sacrifice the awesome journaling capabilities of the not yet mentioned filesystems, then Ext2 is for you!!!
Pros: Most used, tried and true, fast, awesome compatibility, well-rounded, "comfortable", solid
Cons: Can’t journal
2.2. Ext3
However, if you do want journaling maybe you should consider
Ext3. Ext3 uses the same code Ext2 does, so it is just as
compatible as Ext2. The only real difference is the addition of
journaling capabilities. Ext2 and Ext3 are irreversible; one can
easily upgrade an Ext2 filesystem to Ext3, and vice versa (but
why would ya?).
The people who created Ext3 were a little creative with
journaling techniques. First of all, to ensure the integrity of
both the metadata and data, they recorded changes to both
metadata and data. While some other filesystems like XFS use
logical journaling, Ext3 uses physical journaling. Physical
journaling stores the complete replicas of the modified blocks -
which also contain unmodified data. This might seem a little
wasteful, but it has its advantages, discussed later. Logical
journaling, on the other hand, records only modified spans of
bytes (an impartial snapshot). Physical journals are generally
less complex than logical journals. Also, physical journaling
allows some optimization, like being able to write the changes
to disk in one write operation (increasing speed and CPU
overhead). Finally, after all this, both the data and metadata
will be consistent.
Unfortunately this is still a bit slow. Recently, Ext3
started using an alternative. This new method journals metadata
only (bear with me). The new driver combines the writes to data
and metadata into one entity, called a transaction. Basically,
each transaction keeps track of the data blocks that correspond
to each journal update, and consists of first writing the data,
then the journal (metadata). This provides the same
data/metadata consistency without the performance sacrifice.
Pros: All of those of Ext2, plus journaling! Easy to deploy!
Cons: Can be slow, depending on the type of journaling being used,
and creates unnecessary disk activity
2.3. ReiserFS
This filesystem is very often talked about. Hans Reiser, the
creator, wanted to create a filesystem that would meet the
performance and features needs of its users, without having them
create special solutions like databases that operate on top of
the filesystem (which degrades speed and efficiency). ReiserFS
is very good at handling small files. It does this by using
balanced B*trees, which boosts performance and is more scalable,
flexible, and efficient. Instead of having a fix space for
inodes set during the creation of the filesystem (Ext2 does
this), ReiserFS dynamically allocates the inodes.
Another cool feature of ReiserFS deals with tails. Tails are files (or ends of files) that are smaller than the filesystem block/sector (512 bytes). Filesystems like Ext2 write these files to the data like all other data, but since Ext2 allocates storage space in blocks of 1k or 4k, the rest of that reserved section is wasted. ReiserFS, alternatively, stores these files in the B*tree leaf nodes instead of writing the address of the data in the nodes and the files on the disk like all other files. This is the trick to increasing small file performance, since both the data and metadata are in one place and can therefore be read in one swoop. ReiserFS also packs the tails together. Not only does the way ReiserFS manages tails make it faster, but it also saves space (typical 6% increase of storage capacity over Ext2). Unfortunately, it’s not all great, because whenever files are changed, ReiserFS must repack the tails; this causes a decrease in performance. Tail packing can be turned off, for those speed freaks out there.
As for journaling, ReiserFS uses logical journaling. Unlike
Ext3, it does not ensure that data is consistent with metadata.
This can potentially create a security risk since (although
rare) recently modified files could contain portions of
previously deleted files.
Pros: fast as hell, stable as of 2.4.18
Cons: not as reliable as Ext3 (in reference to data integrity)
2.4. SGI’s XFS
And finally there is XFS
Written by Silicon Graphics Inc in the early 90s,
this filesystem was based on the philosophy to "think big."
Accordingly, XFS is the fastest of the 3 journaling filesystems
discussed when dealing with large files. It’s speed was very
close to that of ReiserFS when handling medium to small files,
unless certain optimizing parameters are passed during the
creation and mounting of the filesystem.
XFS also likes to cache data a lot, eliminating the
unnecessary disk activity that ails Ext3.
The really cool characteristic of XFS lies in what SGI refers
to as allocation groups. The block device is split into 8 or
more sections (allocation groups) depending on the size of your
partition, each allocation groups being its own filesystem with
its own inodes. This allows multiple threads and processes to
run in parallel! Since XFS was designed for high-end hardware,
couple XFS with high-end hardware and you’ll get really nice
speed.
XFS fully utilizes B+trees, because of the incredible speed
and scalability advantages associated with them. In fact, XFS
uses 2 B+trees for each allocation group, one containing the
extents of free space ordered by size, and the other regions
ordered by their starting physical location. XFS is great at
maximizing write performance because of its ability to locate
free space quickly and efficiently. XFS also uses B+trees to
keep track of all the inodes on the disk, which are allocated
dynamically like ReiserFS, only in groups of 64.
One cool thing worth mentioning is XFS allows the journal to
exist on another block device, which improves speed even
more!
Unlike ReiserFS, when an XFS filesystem recovers from a
crash, it writes nulls (0s) to any unwritten blocks. This fixes
the security issue known to plague ReiserFS (although it isn’t
that frequent and significant).
Another feature of XFS (which is unique to XFS) is delayed
allocation. Instead of writing to the disk immediately, it waits
and saves the data to RAM. Basically, it waits so that it can
optimize the number of actual IO operations it will have to
make. This not only improves speed, but also allows data to be
written contiguously (reducing fragmentation). For instance, if
the data was going to be appended to a single file in the end,
XFS writes this file to one contiguous chunk, instead of that
file being here, there, and everywhere!
Also, this delayed allocation
eliminates the need to write volatile temporary files to
disk.
Procrastination pays off! See, that’s exactly why I
procrastinate with my assignments – so I can wait until I can do
all of the assignment in one chunk of time! Maybe it’s a good
habit afterall!
Pros: fast, not much disk activity, more secure than XFS (somewhat), smart, scalable
Cons: Slow when deleting files (should be fixed soon via patches), not as reliable as Ext3, Gentoo is starting not to like it that much – they recommend ext3 or reiser
# mke2fs /dev/hda1
Ext2
# mke2fs -j /dev/hda1
w/ journaling)
# mkfs.xfs /dev/hda1
XFS
Options:
number of allocation groups it creates - default is 1 every 4gb
(36gb = 9 AGs)
- 32mb is a good size
# mkreiserfs /dev/hda3
ReiserFS
If upgrading ReiserFS to XFS, zero out the partition first.
*Make sure your kernel supports the filesystem(s) you have chosen to
use!
Try to use Ext2 or Ext3 for the boot partition. If you use
ReiserFS you must mount it with ‘-o notail’ option which
disables tail packing.
3. Conclusion
Ext2 = Standard FS
Ext3 = Rugged Journaling FS
ReiserFS = Speedy Journaling FS
XFS = Quick and smart, but Gentoo believes it to be flaky ("fry lots of data" – hmmm)