Linux file system benchmarks

I have a penchant for comparative analysis of file systems. It probably stems from my job and the desire to see a clear winner emerge from the morass of file system choices. I don’t have any illusions about benchmarks being nothing more than a game, really. However, recently in order to divine the state of the art, I took matters into my own hands and conducted some benchmarks. I had a spare 2.0Ghz dual core Dell 510s workstation equipped with a middling quality Western Digital SATA drive (WD7500AAKS-0). Although I have access to much more exotic and high performance gear at my employer’s lab, I prefer to use something a little more down to Earth for these tests. The point, really, is that the all the tests were done on the same partition on the same system with the same kernel (2.6.29). The OS was Debian 5.0 x86 (32-bit). The name of the system is “vmhost” but don’t let that confuse you – it’s a physical system.

Four types of tests were conducted:

  1. A simple “dd” style write test of 4Gb of data from /dev/zero. I’m sorry, but yes, this type of test still matters. Synchronous writes are still a huge performance barrier for servers and workstations alike in many situations.
  2. A set of tests using a large tarball containing 25,438 directories and 104,761 small (mostly 1k-4k) files (pkgsrc-2009Q1 tarball)
  3. A test using the synthetic multi-threaded benchmark “dbench” usually used for benchmarking Samba servers.
  4. Good old bonnie++

The tests were conducted under the following conditions:

  1. No special mount options. Ie.. how will it work for the average joe? Tuning for benchmarks is a lame trick I will leave to the vendors, since few users or even admins tune their FS mount options other than the occasional notail or noatime options. Okay, so it’s not such a lame trick if you are a sysadmin and have a clue. However, it makes benchmarks quickly descend into “you didn’t tune it properly” territory. This is a blog entry, not a Garner whitepaper for Microsoft’s latest feature.
  2. Each test was repeated three times. I didn’t take the time to average the results unless they differed by more than 1%.
  3. The same Debian 5.0 x86 32-bit system was used each time. Perhaps using a 64-bit machine would have radically altered the results. I seriously doubt it.
  4. All tests were performed on the same partition. No short stroking tricks were used.
  5. The kernel was a stock 2.6.29 kernel.org kernel with the reiser4 patch applied (it’s not in the kernel tree yet).

Granted my tests are not as extensive as the ones done by the Phoronix test suite. However, I was disappointed that they did not test reiserfs version 4 (reiser4), since I’ve been curious about it ever since the row over “sabotage” a couple of years ago. Was it all a bunch of 9/11 truth-style BS? Was there really something to all those source -analysis fingers pointing at Andrew Morton and supposed sabotage and counter fixes? How would my personal favorite Linux file system, XFS, stand up to some empiricism? I’m one of those people who just can’t take the word of others at face value very often. I want to see the evidence myself. I want an honest test.

Well, without further ado, let me present my result summary and I’ll follow up with a post of the raw data.

Dbench (dbench -t 60 4)

Dbench with 4 clients

Dbench with 4 clients

Well, the results here were unexpected to me. Dbench uses a file from which it replays a lot of transactions from simulated file server clients. It’s generally used as a stand-in to test file server performance for Linux CIFS servers running Samba. It’s a way to understand how your server will perform before you ever start tuning Samba. It does not actually use smbd or nmbd, but rather posits infinite efficiency for the sake of understanding your hardware and software rig before you even mess with Samba. In this case, I let it run for one-minute and simulated 4 clients. I tried 10 minute runs but the results were so close, it just wasn’t worth the extra time. My old reliable XFS file system (I’m a former SGI admin and enthusiast) didn’t do so hot. It wasn’t the worst, but it got it’s doors blown off by JFS and Reiserfs 3.6. Just to demonstrate my resistance to apply bias, I have to disclose I’m an IBM AIX hater from way back. I used to work at IBM for about four years some time ago. I was forced to admin AIX systems which I now hate with a passion (though their LVM was nice). I’m irritated by the fact that JFS did so well. See what baggage technology biases are? What is strange to me is how well Reiser 3.6 does when compared to the extremely poor Reiser4 result. I have no good explanation or even a theory for this.

Simple “dd” Write Test

1Mb blocks @ 4.2G

1Mb blocks @ 4.2G

It would seem that the disk’s performance is the largest constraint on this test. Nevertheless, reiser 3.6 and 4.0 are weaker than the rest of the pack. The casual observer might wonder why reiser4 with gzip compression has such disproportionately higher results. The enigma isn’t too tough to solve when you consider the source of the data is /dev/zero. A stream of zeros should be eminently compressible and it would seem reiser4’s algorithms decide that it’d be better to compress it first before it flushes the data to the disk. Thus, a lot less data is actually written, but the job gets done. Your decision to peremptorily write this off as a cheap trick or laud it as a useful feature is probably going to come from the type of data and access patterns you are accustomed to. If you are looking for a file system to store the hard disk images for your virtual machines and you don’t use sparse files or qcow2 files then Reiser4 might be an awesome choice. It could save you significantly on your storage allocation outlay. If you are storing a bunch of multi-gigabyte MP4 movie files, then you might find Reiserfs of any version to be a real dog. Like most questions of performance, it’s relative to your application and data.

Massive Tarball Unpack (smaller is better)

tarball unpack with 25438 directories and 104761 files.

tarball unpack with 25438 directories and 104761 files.

Sheesh. What happened to XFS!? Well, I suspect that it’s default is synchronous writing ala UFS without logging or something similar. It doesn’t seem to show much ability with default mount options to make use of any write caching. In my experience that makes for a safer file system when power outages and non-battery backed up RAID controllers are concerned. However, who would be dumb enough to use a server in such a state? Wait, don’t answer that….. JFS makes a pretty poor showing here, as well. Btrfs must also be working out the steel with the devil, since it’s running pretty far in the rear. I really suspect that tuning mount options (especially turning on writeback for ext4) might make a huge difference in this test. From a purely quantitative standpoint, this is probably a terrible test. However, remember that I am testing the _default_ configuration here. No special mount options were employed at all. That’s how the majority of folks are going to deploy a file system. Developers and distribution maintainers should take special care with the default cases. I think Microsoft’s example of bundling IE and the subsequent effect it’s had on the browser market is a good example of the “bad default effect”. NOTE: Incidentally while testing Reiser4 I experienced a hang when I tried to run “sync”. Killing the process was successful and a test with “md5sum” showed there was no damage, but it was something that made me a bit nervous. I tried another ‘sync’ afterwards and it was fine.

Inode cost – Smaller is better

Space occupied by Pkgsrc unpack

Space occupied by Pkgsrc unpack

It seems that when you add up the cost of many thousands of directories and small files those block sizes can add up quickly. The moral of this story is that those who need to store a lot of small files such as Oracle Applications admins, NNTP operators, or maildir users should make an informed decision about their file system. Reiser4 shines here, especially with compression turned on. I’m not sure if I’ll ever trust a compressed file system since my puppy trauma with Stacker back in the DOS 6.22 days. However, this is a test of performance, not reliability. I’ve got separate plans for testing reliability and recoverability.

Delete time

Delete time with 25438 directories and 104761 files

Delete time with 25438 directories and 104761 files

Aww, damn. XFS gets killed on this one, too. I guess I’m going to have to give up my bias about it being the best performer out there. I’m consoled only by the fact that both my former employers (IBM and Oracle) take a big Ike Turner palm slap on this test and the unpack/create test. Reiserfs variants clearly deserve the accolades here. I’m surprised that even deleting the files off the compressed Reiser4 file system has snappy results. After doing a lot of reading I have to make another comment regarding the Wikipedia page for Reiser4 where a KernelTrap article is referenced saying that Ted Tso suggests fans of Reiser4 should jump on the Btrfs bandwagon as an alternative to Reiser4 following the conviction of Hans Reiser for first degree murder. Okay, Ted, thanks for the suggestion. However, it’s clear that Btrfs can’t keep up with Reiser4 in some ways, yet. Time will tell. I suspect Btrfs, being the darling of all the most favored Linux developers, will probably out pace Reiser4 with time. That time has not yet come, but Btrfs has many features which Reiser4 would refer you to LVM or mdadm to attain. Again, it’s all about application and data set.

Bonnie++

Bonnie gives a lot of data to sift through, and a lot of is very nearly worthless. Who in their right mind would use putc() to write out any significant amount of data at speed? There are certainly applications that make use of this function, but anyone who expects it to be high performance is smoking something expensive. Even in the Bonnie run output it differentiates writing with putc() with “Writing intelligently” with blocks of data and write() and fwrite(). I’ve never seen any significant performance difference between the POSIX and ANSI open() and fopen() semantics. There is a helluva lot of difference with per-character I/O, though. I’d assert that the sequential block numbers are going to give you the results that will matter for 90% of the applications out there. Keep in mind that that Bonnie is not multi-threaded in any fashion (no fork() or pthreads are used). It’s a pretty simplistic benchmarking approach, but one that’s still not unlike a lot of code out there that doesn’t use threads or fork()’d child workers either. You can run multiple bonnie++ instances at once, but then the onus is on you to collect and average the data. It’s a lot of work considering something like dbench does it for you. I’m going to select a few benchmarks here that I believe to be the most significant. I’ll say at the outset that my data here is not meant to be fair, it’s meant to be germane to everyday usage in desktops and servers that I have experience with. In general, read speeds on any modern hard disk are pretty good. If you look at the way media is marketed, it’s toward the constraining factor of write speed. I’m going to brave the inevitable flames and assert that it’s write speeds that matter most in a world where read rates are MUCH faster and rarely constrain all but the most demanding applications (Oracle comes to mind). Let’s choose the most common and intelligent writing condition, the sequential block write.

Sequential Write Speeds

Sequential Write Speeds

Results look pretty similar to the “dd” test results you say? That’s because it’s pretty much the same I/O pattern and the data is highly compressable.

So, let’s examine another case: the re-write. You have a large database file or virtual machine hard disk image and you want to seek to the middle and re-write a bunch of blocks. This is certainly not an uncommon scenario. Bonnie can be used to test the filesystem cache and the well-localized speed of the implementation to transfer data to the disk without the need of any space allocation. So, here goes:

Block re-writes

Block re-writes

I can only theorize that the lseek() distance much shorter in practice for the Reiser4 compressed file system. However, I suspect the answer is a lot more complex for why all the competitors are swept from the field in this test. Whatever the reason, the results are impressive. My poor, XFS. Damn.

In case you are wondering if I cherry picked only the results which favored Reiser4 the answer is “not really”. Reiser4 pretty much smoked the others in the bonnie++ results. It lost by a narrow margin on the getc() read test which is arguably the least real-world of any of the tests. The results for random seeks was probably a cache-hit anomaly, since reiser4 was about 20 times faster than any file system in that test.

Conclusion

Well, I’m sad that XFS was such a poor performer in all my tests. I’ll have to experiment with ways it can be tuned to see if it’s salvageable for my purposes. I’ll also want to setup OpenSolaris and/or FreeBSD to test ZFS in the same fashion as the others shown here. Being a NetBSD user from way back, I’m also wanting to see how the standard UFS2 implementation can stand up. DragonFly BSD also has a shiny new file system called HAMMER and Linux has a few exotics on the horizon like Tux3. Perhaps I’ll trot out results from those as soon as I get a chance to test them. In these tests, however, I have to give ReiserFS and Btrfs the crown. They both performed very well in nearly all the tests with a few warts here and there. I’m also now convinced that for a lot of folks with special applications Reiser4 is a great choice with unparalleled performance. The whole written-by-a-felon thing doesn’t really phase me, since I can separate the technology from the person easily enough. To make it about Hans, though the FS bares his name, is unfair to the others that continue development. I’m not sure if I believe the conspiracy theories about Morton’s intentional sabotage. The diffs out there clearly show he was responsible for some corruption-causing coding errors. However, to say they were intentional is speculation until proven otherwise. The technical facts are not altered by the politics. The file system delivers results. My next endeavor is to stress test, disconnect, power down, and otherwise batter the contestants here to determine the extent of their reliably and recoverability. Stay tuned for that….

Advertisements

~ by aliver on June 20, 2009.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: