NetBSD and RAIDframe

(written and maintained by Greg Oster <oster@cs.usask.ca> )

RAIDframe is a framework for rapid prototyping of RAID structures. RAIDframe was developed by the folks at the Parallel Data Laboratory at Carnegie Mellon University. RAIDframe, as distributed by CMU, provides a RAID simulator for a number of different architectures, and a user-level device driver and a kernel device driver for Digital Unix.

NetBSD now supports RAIDframe as a kernel-level device driver. The only major changes from the original RAIDframe code have been the addition of a NetBSD kernel interface to the driver, and the fixing of a few bugs. The RAIDframe functionality is primarily the same as was originally distributed in the RAIDframe 1.1 distribution from CMU, with the addition of component labels, hot adding of components, and a few other things (see below for details).

Some of the features of RAIDframe in NetBSD:

Support for root filesystems on any RAID level.
RAIDframe handles a large number of different RAID levels and configuration options including RAID 0, 1, 4, 5, 6, hot spares, parity logging, and a number of other goodies. At this point, unfortunately, only a subset of these have been extensively tested in a NetBSD environment. Some of the options, like RAID 6 and parity logging, are still in a highly developmental stage, and are not suitable for even experimental use. Most RAID levels (RAID 0, 1, and 5), however, are well-tested, and are being used in production settings.
Independence from lower-level devices. The current driver has been tested with vnd, IDE, SCSI, and even other RAID devices as the underlying components. Because the driver uses the VNODE interface to access its components it gains a great deal of flexibility in this area.
The driver does not have any restrictions on how RAID devices may be combined. This allows, for example, a number of RAID 5 sets to be striped using a RAID 0.
Other functionality:
- - on-demand failing of disks (in-place, or to spares)
- - on-demand parity regeneration
- - on-demand data/parity reconstruction and copyback
- - hot-adding of spare disks
Currently un-explored areas:
- Parity logging
- Roll-backward transaction commits.
- Accounting

All RAID levels (with the exceptions of 4 and 6) have received a lot of testing.

Work on documentation is still in-progress. Comments on current documentation are welcome.

Recent Changes

Most recent changes to the RAIDframe driver in NetBSD, and other tidbits about the RAIDframe development I've been doing include:

October 31, 2007 - With the help of jnemeth, implemented support for drvctl in RAIDframe.
October 4, 2007 - Believe it or not, every once in a while something new happens in RAIDframe. Today we had the addition of the ability to dump kernel cores to RAID 1 sets.
February 5, 2005 - Improved the error handling in the case of a read/write error that occurs during a reconstruction. We go from zero error handling and likely panicing if something goes amiss, to gracefully bailing and leaving the system in the best, usable state possible.
November 16, 2004 - If a read/write fails to a component, and that failure would make the RAID set completely dead, then don't immediately mark the component as failed. Instead, retry the IO some number of times, and then return EIO if the IO fails.
June 27, 2004 - Added a bunch of "emergency IO buffers" to be used if/when the system runs out of kernel memory. Makes the RAID bits far more robust in the event of memory resource shortages.
May 22, 2004 - Add support for the word "absent" in the "disks" section of RAID config files. "absent" can be used as a placeholder for a component that will eventually be added to the set. This makes configuring RAID sets with missing (or "absent" :) ) disks much easier.
April 9, 2004 - After effectively removing all mallocs from the write paths in RAIDframe, it is now possible to swap to RAID 5. These changes make the RAID sets much more robust to low-memory conditions.
March 21, 2004 - Clobbered a really nasty bug that would occasionally rear its head when doing a 'raidctl -f' on a RAID set when IOs were in flight. This bug has been around for a couple of years (or more!), but only started really biting me when I was doing some extended testing of the recent changes.
March 20, 2004 - It appears that RAIDframe was removed from FreeBSD a few days ago. It doesn't work with their GEOM framework, and it appears that no-one stepped forward to convert it.
March 19, 2004 - This list hasn't been updated in a while, but I've actually been hacking on RAIDframe. Of the most interesting fixes, the I/O code paths are now nearly malloc-free. Most of the mallocs have been converted to using pooled structures, and the few remaining mallocs now have a set of "emergency buffers" to use in the event of low-memory situations.
December 31, 2003 - Fix a whole bunch more little things. A lot of dead code has been removed, and the i386 kernel size has dropped by 14K or so.
December 28, 2003 - Check-in a bunch of memory allocation changes. First kick at removing malloc() from any of the critical paths.
December 28, 2003 - Remove most of the "rows" stuff from the internals. The code was only pretending to support it, and having it removed makes things easier to read.
November 13, 2003 - Happy Anniversary! It's been 5 years since RAIDframe was officially incorporated into the NetBSD sources!
April 12, 2003 - Fixed a bug where components that are not suitable for use as hot spares were not being closed.
November 17, 2002 - More bug fixing. Fortunatly the problems being fixed don't occur with "normal operation" of the RAID set.
November 14, 2002 - Fixed a couple of obscure bugs. Some of them are really just making sure the user doesn't do something stupid. Others just reduce the risk of lossage in the event of an inopportune reboot of a machine (e.g. after a reconstruct). Amazingly enough, after these fixes, there was actually a net reduction in the number of items on the todo list!
November 12, 2002 - Fixed a little bug that was causing nasty problems with reconstructs. A little more testing to do, but RAIDframe and SMP should now play nicely together. (They've been working well together for some time, but a bug that's been there since at least the RAIDframe 1.1 release was keeping me from declaring things "fixed".)
October 11, 2002 - Have 'poolified' quite a few structures. Things are closer to working with LOCKDEBUG.
October 4, 2002 - RAIDframe and LOCKDEBUG don't play well together, so we need to get that fixed ASAP. This is important for SMP. Introduced a new rf_RaidIOThread() to take care of calling rf_DiskIOComplete and the CompleteFunc() for each completed request. This should result in a more responsive system, as much less work is being done at splbio().
September 23, 2002 - Cleaned up a whole bunch of unused code/variables.
September 8, 2002 - When failing a component, close the component. This lets us hot-swap out the component, or do other things with the closed component. Also fixed was the issue of being able to initate a 'reconstruct-in-place' on an already failed (and reconstructed to a hot-spare) component.
August 7, 2002 - Checked in a bunch of locking fixes. These take care of 95% of the LOCKDEBUG problems that I've seen. There are probably more fixes needed yet, but these changes go a long way to making things happier for SMP.
August 2, 2002 - Ok, so this list hasn't been kept as up-to-date as one might have liked. But at least I've been hacking on RAID code. Mostly doing little fixes and nuking unused bits of code.
June 12, 2002 - Wow... hasn't been much to report in a long time. And still nothing more to report today.
January 2, 2002 - Finally getting back to more RAID stuff. RAIDframe is now enabled by default for a number of different GENERIC kernels in NetBSD. At the same time, a number of little-used RAID types were turned off, shrinking the RAIDframe-related parts of the kernel down to a mere 230K (i386).
October 25, 2001 - Christoph Kaegi wrote a really nice little article about setting up mirrors with NetBSD-1.5.2 and RAIDframe.
October 4, 2001 - disentangled the header files of RAIDframe. Two new headers (raidframevar.h and raidframeio.h) now contain the stuff needed by both the kernel and by raidctl (and other userland bits). "About time."
August 29, 2001 - There is a nice little article by David Kwok appearing in BSD Today. Gotta like good press :)
July 16, 2001 - Turns out that unconfiguring multi-level RAID sets at shutdown time was a lot easier to do than expected. A half-dozen lines of real code, and multi-level RAID sets now get unconfigured they way one might expect them to at shutdown time. Thanks to Bill Squier for testing.
July 10, 2001 - Luke Mewburn added a '-G' option to raidctl. This option generates a configuration file for the RAID set. This is most useful for autoconfigured RAID sets where you've somehow managed to lose the original raid0.conf file. Thanks Luke!
June 14, 2001 - Fixed a long-standing silly condition where even after a reconstruct you would need to rebuild the parity. On all but RAID 6, if you reconstruct, then your parity bits are known to be correct then too!
June 8, 2001 - More press on Daemon News with the story of RAIDframe for FreeBSD.
June 7, 2001 - RAIDframe as a kernel driver has been ported to FreeBSD by Scott Long. More info can be found here. This is yet another testament to the portability of the original CMU code.
May 24, 2001 - Sheesh.. nothing much got done in April, and May has been about the same. At least I've had time to look at the RAIDframe code this past week. Now that I appear to have the kinks worked out of my main box, maybe I can start doing some serious development again.
March 9, 2001 - Wow! Have I ever been slacking off on RAID stuff :-/ And to make matters worse, I havn't been getting much of anything else done either... At least today I have some excellent performance numbers to report:
```
     -------Sequential Output-------- ---Sequential Input-- --Random--
     -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
2000 24566 46.0 33565 23.8  6515  7.8 26492 86.3 43542 41.1  96.9  1.8
```
These numbers are from a 1GHz AMD Thunderbird on a ABIT KT7A-Raid motherboard, using three IBM DLTA-307045 45GB UltraDMA/100 Drives. (Two drives on UDMA100, one on UDMA66 (until our driver does UDMA100 on that channel too :) ). The three drives are pulled into a RAID 5 set with a stripe width of 32. The frag/block settings are 4096/32768. I'd like to use smaller values for these, but a) it's *much* slower (ok, 15MB/sec writes and 38MB/sec reads) and b) it takes forever to newfs the filesystem. (w/ the above parameters, newfs'ing the 87GB filesystem takes about a minute.) I played with a bunch of parameters, but this seems to be about the best I can get, for now. The filesystem looks like:
```
/dev/raid3e  89270792        4 84807248     0%       1 5631997     0%   /u3
```
The other difficulty here is the limited number of files I can have on this set... The 45GB set I had before had some 10 million inodes. But getting just that many on this set means I have to chop the block size to 16384, and that eats into my performance too much. So I guess I just put "big files" (larger than 32K) on this filesystem.
Nov. 13, 2000 - It's the 2nd anniversary of RAIDframe being an official part of NetBSD!
Nov. 13, 2000 - after I spent a while trying to figure out what was wrong, Chuck Silvers pointed out a code ordering problem in src/sys/uvm/uvm_swap.c which was preventing "swap on RAID" from being unconfigured correctly. With this code now fixed, RAID sets that have swap on them can now be unconfigured, and won't be dirty after a reboot.
Nov. 2, 2000 - Matt Thomas just added support for booting off a RAID 1 set on the VAX architecture. Thanks Matt! :)
Oct. 30, 2000 - Luke Mewburn just added support for booting off RAID 1 sets for i386. It's not quite complete yet (needs some changes to installboot to get the /boot stuff to work correctly) but it's mostly there. A big "Thank You" to Luke for doing this.
Oct. 26, 2000 - fixed up the raidctl.8 man-page a fair bit. Added a new section on Performance Tuning.
Oct. 19, 2000 - fixed up the disk_busy()/disk_unbusy() calls so that the IO statistics are now more sane. Also fixed a few problems related to accidentally touching the labels failed disks in certain (very obscure) cases.
Sept. 26, 2000 - Simon Burge was at it again, and committed changes to the alpha bootblocks, allowing the bootblocks to pull a kernel directly from a RAID1 set. Again, thanks to Simon for getting this done. (two ports down, many more to go :) )
Sept. 13, 2000 - I've been asked a couple of times "How stable is RAIDframe, and is it good enough for / on RAID?" Well, my main machine at home has /, swap, and /u2 each on their own RAID 1 set, and the machine has been up for the past 65 days without a hitch or hiccup (NetBSD-1.5_ALPHA on i386). And yes, this machine gets used lots :)
Sept. 13, 2000 - Simon Burge just committed changes to the pmax bootblocks, allowing the bootblocks to pull a kernel directly from a RAID1 set. Thanks to Simon for doing all the hard work on this one :) The pmax port is the first to have this feature. It is hoped that the other ports can have this feature soon too.
Sept. 12, 2000 - Yes, I'm still working on RAIDframe :) I've spent the last week trying to figure out why a copyback seems to be completely locking up my system. It turns out that the copyback does not support accessing the RAID set while the copyback is in progress. This is not that great, and really needs to be addressed at some point. :(
I've also fixed a bunch of locking issues in the RAIDframe code -- a few ltsleeps() have been added, and now the kernel doesn't complain much when running with LOCKDEBUG turned on.
August 23, 2000 - More press for RAIDframe! :) BSD Today has an article about RAIDframe. Thanks to Peter Clark for writing it!
August 19, 2000 - Yes, I'm still doing RAIDframe stuff :) Checked in a couple of bug fixes and have been trying to find time to get a few other things wrapped up before NetBSD 1.5 is cut.
June 3, 2000 - Received an email from a RAIDframe user, pointing me at this web-page where he looks at some of the performance implications of various stripe sizes. I want to do something similar (and have a paper about this in the works) but I've been too busy with other things :(
May 28, 2000 - Spent the weekend fixing bugs. The most notable ones are: 1) 'raidctl -u' no longer causes a panic if a parity re-write is taking place (the re-write now gets aborted before the unconfigure) and 2) the auto-configuration code is more agressive, and will do more to ensure that previously failed disks do not mess up the auto-configuration process.
May 17, 2000 - I've had the opportunity to work on an Athlon 650 with 384MB RAM, 4 AdvanSys ASB3940U2W-00 SCSI controllers and 24 Seagate ST150176LW 50GB Ultra2Wide SCSI drives (6 per chain). The machine was configured with four sets of 6 disks per RAID 5 set (5 + 1 spare) with the disks for any particuar RAID set being load balanced across all 4 controllers. For a single RAID 5 set, the following performance was achieved with a bit of filesystem and RAID parameter tweaking:
```
      -------Sequential Output-------- ---Sequential Input-- --Random--
      -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
   MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
  100 13656 92.1 31653 18.8  4533  7.4  8853 97.8 61635 40.4 144.3 3.0
```
Unfortunately, I didn't get a chance to benchmark this set with a larger test size. For the RAID 0 over 4 RAID 5's, the following performance was observed:
```
     -------Sequential Output-------- ---Sequential Input-- --Random--
     -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
 100  9939 71.0 13590  8.4  6725 11.0  8580 89.5 54228 40.6 346.7  5.7
```
Interestingly, performance gets better with a larger test size:
```
     -------Sequential Output-------- ---Sequential Input-- --Random--
     -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
2000 11818 84.2 21219 13.7  5858 11.3  8295 91.9 59475 44.0 120.6  3.2
```
The filesystems on this critter (which has 2 IDE drives setup w/ boot partitions, / on a RAID 1, and swap on a different RAID 1) look like:
```
Filesystem  1K-blocks     Used    Avail Capacity  Mounted on
/dev/raid0a   2441256   973870  1345323    41%    /
/dev/raid6e 778113184    72640 739134880     0%    /build
```
Checking/fixing parity a single RAID 5 set takes about 1.5 hours if the parity is nearly completely correct. It can take up to 13 hours if it has to rebuild all the parity. If a disk fails, it takes 45 minutes to get the data reconstructed onto the new disk. A newfs of the above filesystem takes about 45 minutes. There are actually 7 different RAID sets on this machine (2 RAID 1, 4 RAID 5, and 1 RAID 0).
Special thanks to Dante Profeta for his dedicated work to making the AdvanSys driver work so very well.
May 1, 2000 - Made the May edition of Daemon News with a RAIDframe article. You can find the article here.
March 21, 2000 - flakey disk replaced, and now /, swap, and /u2 are all running on (individual) RAID 1 sets.
March 15, 2000 - Now have my home box running with / on a RAID 1 set. Discovered yesterday that one of the 18.2GB disks I got is flakey (read errors!) so it'll be going back for a replacement. The up-side to the disk having problems was that RAIDframe performed like a champ, and the system didn't even notice the disk was 'failed'.
March 14, 2000 - ARG! Checked in a change that I neglected to check in when I did the RAID_AUTOCONFIG stuff. Now / on RAID should really work for everyone, not just for me :)
March 13, 2000 - Picked up an AdvanSys ASB3940U2W-00 SCSI controller, and two 18.2GB Fujitsu U2W 7200RPM (MAE3182) drives. The performance of this combination is pretty good:
```
           -------Sequential Output-------- ---Sequential Input-- --Random--
           -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
           K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
   RAID 0  11819 87.5 24502 55.0  4676 16.5 10697 94.9 30217 89.4  89.8  5.7
   RAID 1  10088 79.6 14948 32.7  4648 18.3 11051 95.3 19874 42.6  85.0  5.7
```
Re-syncing the mirrors happens at about 20MB/sec, so a re-sync of a 14GB set takes less than 15 minutes. I also upgraded my main box to NetBSD -current, which means that I can play with all the new RAID toys (and actually use the above card! :) )

Project History

In case anyone cares about how such a project might evolve, here is a brief outline of some of the key points in the development effort.

FAQ

Ok, so some of these havn't been asked of me yet: I'm just doing a bit of preventative maintenance... :-)

Q Where do I get it?

RAIDframe is an integrated part of NetBSD, and is available in all NetBSD releases since NetBSD 1.4. You'll note that it's only been compiled on i386, pmax, sparc, sun3, hp300, and alpha boxes at this point [May 24, 2001: This information is rather old - it's been compiled on a lot more boxes than this.], so if you're having problems with getting it going, PLEASE send me an email ( oster@cs.usask.ca), and I'd be more than happy to help out. (This is especially true if you are attempting to get the RAIDframe kernel driver going on a platform I havn't tested it on yet.)

Q Is it stable? Will I lose data? Can I trust it?

A Mostly. Not likely. Perhaps. :-) There is absolutely no way that I can guarantee that you will not lose data from using this software. But then, that holds true for just about anything. That being said, my stress-testing has involved load averages of 64+ over only 3 SCSI disks, and not a single byte has been lost in "normal operation". I've also re-built the NetBSD world a couple of times off a filesystem living on a RAID device, so in that sense at least it's self-hosting. How well the driver fares under "component failure" is less well tested, and is the subject of current testing. Update (Feb. 14, 2000):My home machine (had an uptime of 120+ days until it suffered a 2 drive failure) has a RAID 5 setup on it that I use every day for Real Stuff. I didn't even notice that the first drive had failed :( Update (March 22, 2000) My /, swap, and /u2 partitions on my main home machine are now all on RAID 1.

Q Is NetBSD-1.3.x supported?

A No, and while a back-port should not take more than a few hours to do, I have no intention nor inclination to do it.

Q What can I do to help?

A Play with the driver, and try to break it. If folks would like to help identify more bugs for me, that would be great. Stories of success, especially successful recoveries after a component dies, would be most welcome.

Q Why are kernels with RAIDframe so huge?

A RAIDframe is not small. It typically requires at least 500K of extra space [May 24, 2000?: Gee.. this is old too... on i386, I believe the additional space required is only about 320K these days. Feb. 22, 2004: RAIDframe weighs in at 150K for i386]. This makes it tight to get it working on something like a Sun 3/50 (kernels must be less than 1MB in size), but if you're going to be running this on a server, chances are an extra 500K isn't going to be much of a problem. Doing RAID well and with as many options as RAIDframe provides takes a lot of space. I am working on pruning out unnecesary bits from the driver.

Q Can I use disks of different sizes?

A Your physical disks can be of different sizes, but each component in the RAID set must be the same size. If you have say one 9GB SCSI disk, two 4.5GB SCSI disks, and two 10GB disks, you might do something like the following:

          RAID5#1   RAID5#2
   9GB:  /dev/sd0e /dev/sd0f
 4.5GB:  /dev/sd1e
 4.5GB:  /dev/sd2e
  10GB:  /dev/wd0e /dev/wd0f (1GB unused on /dev/wd0g)
  10GB:  /dev/wd1e /dev/wd1f (1GB unused on /dev/wd1h)

where each component in each RAID set is 4.5GB. This would give you 18GB on one RAID5 set, and 9GB on the other.

           RAID5#1   RAID5#2 
   9GB:  /dev/sd0e /dev/sd0f
 4.5GB:  /dev/sd1e
 4.5GB:            /dev/sd2e
  10GB:  /dev/wd0e /dev/wd0f (1GB unused on /dev/wd0g)
  10GB:  /dev/wd1e /dev/wd1f (1GB unused on /dev/wd1g)

and end up with 2 raid sets, each with 13.5GB of total storage. (you could also pull /dev/wd0g and /dev/wd1g into a RAID 0 or 1).

If you wanted all of the partitions into a single RAID set, then you might try to pull the above two sets together with a RAID 0. For RAID5#1, you might use:

START disks
/dev/sd0e
/dev/sd1e
/dev/wd0e
/dev/wd1e

and (for RAID5#2) you might use:

START disks
/dev/wd1f
/dev/wd0f
/dev/sd2e
/dev/sd0f

Note that the drives are ordered here in an attempt to avoid having a single IO go to two different partitions on the same disk. The reality is you don't want to use more than one partition from any one disk if you are concerned about performance.

Q I just added another component to my RAID set, but when I do a newfs I havn't gained any space!

A You are probably still using the old disklabel, not one that reflects the size of your new RAID set. Just to make sure you get what you want, I recommend a:

dd if=/dev/zero of=/dev/rraid0d count=2

(that would be:

dd if=/dev/zero of=/dev/rraid0c count=2

for non-i386 boxes.) to make sure the old label is gone. Then simply do a:

disklabel raid0 > /tmp/label
{edit /tmp/label and make it say what you want}
disklabel -R -r raid0 /tmp/label

This ensures that you have a fresh label that correctly reflects the RAID configuration.

TODO

Here, in a "need to fix very soon" order, is a list of stuff on my TODO list:

Fix a problem with auto-configured sets with missing components. There is no way to "add" a component to the empty spot. (you can hot-add the spare, and if you reboot, it'll show up as the (previously missing) component, but that requires an extra reboot, which shouldn't be needed.
Complain profusely and incessantly (in the nicest possible way) when a component fails. This may be accomplished through the use of a little user-land daemon which monitors the state of all RAID sets.
N-way RAID1's have problems if more than one of the components fail. (actually, N is limited to 2 right now :( so this is rather moot )
Talk more about component labels in the FAQ.

Here, in no particular order, is a (very small, and very incomplete) list of the stuff on my TODO list for this project:

Look more closely at the Parity Logging code
Test RAID6 (make that *fix* RAID 6, as the ECC code is apparently broken).
Add more info to the FAQ, esp. stuff on using drives of various sizes...
Automatic reconstruction when a component fails. (The goop to do it is all there, just a matter of making it happen, and allowing the user to specify the desird behaviour).
Clean up the code even more.

(Both of the above lists are now very much incomplete.)

NetBSD and RAIDframe

(written and maintained by Greg Oster <oster@cs.usask.ca> )

Recent Changes

Project History

FAQ

TODO

Other RAID-related links