NetBSD and RAIDframe
(written and maintained by Greg Oster <oster@cs.usask.ca> )
RAIDframe is a
framework for rapid prototyping of RAID structures. RAIDframe was
developed by the folks at the Parallel Data Laboratory at
Carnegie Mellon University. RAIDframe, as distributed by CMU,
provides a RAID simulator for a number of different architectures, and
a user-level device driver and a kernel device driver for Digital
Unix.
NetBSD
now supports RAIDframe as a kernel-level device driver. The
only major changes from the original RAIDframe code have been the
addition of a NetBSD kernel interface to the driver, and the fixing of
a few bugs. The RAIDframe functionality is primarily the same as
was originally distributed in the RAIDframe 1.1 distribution from CMU,
with the addition of component labels, hot adding of components, and a
few other things (see below for details).
Some of the features of RAIDframe in NetBSD:
- Support for root filesystems on any RAID level.
- RAIDframe handles a large number of different RAID levels and
configuration options including RAID 0, 1, 4, 5, 6, hot spares, parity
logging, and a number of other goodies. At this point, unfortunately,
only a subset of these have been extensively tested in a NetBSD
environment. Some of the options, like RAID 6 and parity logging, are
still in a highly developmental stage, and are not suitable for even
experimental use. Most RAID levels (RAID 0, 1, and 5), however, are
well-tested, and are being used in production settings.
-
Independence from lower-level devices. The current driver has been
tested with vnd, IDE, SCSI, and even other RAID devices as the
underlying components. Because the driver uses the VNODE interface to
access its components it gains a great deal of flexibility in this
area.
-
The driver does not have any restrictions on how RAID devices may be
combined. This allows, for example, a number of RAID 5 sets to be
striped using a RAID 0.
- Other functionality:
- - on-demand failing of disks (in-place, or to spares)
- - on-demand parity regeneration
- - on-demand data/parity reconstruction and copyback
- - hot-adding of spare disks
- Currently un-explored areas:
- Parity logging
- Roll-backward transaction commits.
- Accounting
All RAID levels (with the exceptions of 4 and 6) have received a lot
of testing.
Work on documentation is still in-progress. Comments on current
documentation are welcome.
Recent Changes
Most recent changes to the RAIDframe driver in NetBSD, and other
tidbits about the RAIDframe development I've been doing include:
-
October 31, 2007 - With the help of jnemeth, implemented support for
drvctl in RAIDframe.
-
October 4, 2007 - Believe it or not, every once in a while something
new happens in RAIDframe. Today we had the addition of the ability to
dump kernel cores to RAID 1 sets.
-
February 5, 2005 - Improved the error handling in the case of a
read/write error that occurs during a reconstruction. We go from zero
error handling and likely panicing if something goes amiss, to
gracefully bailing and leaving the system in the best, usable state
possible.
-
November 16, 2004 - If a read/write fails to a component, and that
failure would make the RAID set completely dead, then don't
immediately mark the component as failed. Instead, retry the IO some
number of times, and then return EIO if the IO fails.
-
June 27, 2004 - Added a bunch of "emergency IO buffers" to be used
if/when the system runs out of kernel memory. Makes the RAID bits far
more robust in the event of memory resource shortages.
-
May 22, 2004 - Add support for the word "absent" in the "disks" section
of RAID config files. "absent" can be used as a placeholder for a
component that will eventually be added to the set. This makes
configuring RAID sets with missing (or "absent" :) ) disks much easier.
-
April 9, 2004 - After effectively removing all mallocs from the write
paths in RAIDframe, it is now possible to swap to RAID 5. These
changes make the RAID sets much more robust to low-memory conditions.
-
March 21, 2004 - Clobbered a really nasty bug that would occasionally
rear its head when doing a 'raidctl -f' on a RAID set when IOs were in
flight. This bug has been around for a couple of years (or more!),
but only started really biting me when I was doing some extended
testing of the recent changes.
-
March 20, 2004 - It appears that RAIDframe was removed from FreeBSD a
few days ago. It doesn't work with their GEOM framework, and it
appears that no-one stepped forward to convert it.
-
March 19, 2004 - This list hasn't been updated in a while, but I've
actually been hacking on RAIDframe. Of the most interesting fixes,
the I/O code paths are now nearly malloc-free. Most of the mallocs
have been converted to using pooled structures, and the few remaining
mallocs now have a set of "emergency buffers" to use in the event of
low-memory situations.
January 2, 2004 - Checkin a fix for the "Failed to create a DAG"
problem. Now the system won't (shouldn't!) panic on a component
failure on a RAID 0 set, or a multi-component failure on a RAID 5
set. Getting this one fixed properly was about as nasty as I figured
it would be, which is why I ignored fixing it for so long.
-
December 31, 2003 - Fix a whole bunch more little things. A lot of
dead code has been removed, and the i386 kernel size has dropped by
14K or so.
-
December 28, 2003 - Check-in a bunch of memory allocation changes.
First kick at removing malloc() from any of the critical paths.
-
December 28, 2003 - Remove most of the "rows" stuff from the
internals. The code was only pretending to support it, and having it
removed makes things easier to read.
-
November 13, 2003 - Happy Anniversary! It's been 5 years since
RAIDframe was officially incorporated into the NetBSD sources!
-
April 12, 2003 - Fixed a bug where components that are not suitable
for use as hot spares were not being closed.
-
November 17, 2002 - More bug fixing. Fortunatly the problems being
fixed don't occur with "normal operation" of the RAID set.
-
November 14, 2002 - Fixed a couple of obscure bugs. Some of them are
really just making sure the user doesn't do something stupid. Others
just reduce the risk of lossage in the event of an inopportune reboot
of a machine (e.g. after a reconstruct). Amazingly enough, after
these fixes, there was actually a net reduction in the number of items
on the todo list!
-
November 12, 2002 - Fixed a little bug that was causing nasty problems
with reconstructs. A little more testing to do, but RAIDframe and SMP
should now play nicely together. (They've been working well together
for some time, but a bug that's been there since at least the
RAIDframe 1.1 release was keeping me from declaring things "fixed".)
-
October 11, 2002 - Have 'poolified' quite a few structures. Things
are closer to working with LOCKDEBUG.
-
October 4, 2002 - RAIDframe and LOCKDEBUG don't play well together, so
we need to get that fixed ASAP. This is important for SMP.
Introduced a new rf_RaidIOThread() to take care of calling
rf_DiskIOComplete and the CompleteFunc() for each completed request.
This should result in a more responsive system, as much less work is
being done at splbio().
-
September 23, 2002 - Cleaned up a whole bunch of unused code/variables.
-
September 8, 2002 - When failing a component, close the component.
This lets us hot-swap out the component, or do other things with the
closed component. Also fixed was the issue of being able to initate a
'reconstruct-in-place' on an already failed (and reconstructed to a
hot-spare) component.
-
August 7, 2002 - Checked in a bunch of locking fixes. These take care
of 95% of the LOCKDEBUG problems that I've seen. There are probably
more fixes needed yet, but these changes go a long way to making
things happier for SMP.
-
August 2, 2002 - Ok, so this list hasn't been kept as up-to-date as
one might have liked. But at least I've been hacking on RAID code.
Mostly doing little fixes and nuking unused bits of code.
-
June 12, 2002 - Wow... hasn't been much to report in a long time.
And still nothing more to report today.
-
January 2, 2002 - Finally getting back to more RAID stuff. RAIDframe
is now enabled by default for a number of different GENERIC kernels in
NetBSD. At the same time, a number of little-used RAID types were
turned off, shrinking the RAIDframe-related parts of the kernel down
to a mere 230K (i386).
- October 25, 2001 - Christoph Kaegi wrote a really nice little article about
setting up mirrors with NetBSD-1.5.2 and RAIDframe.
-
October 4, 2001 - disentangled the header files of RAIDframe.
Two new headers (raidframevar.h and raidframeio.h) now contain the
stuff needed by both the kernel and by raidctl (and other userland
bits). "About time."
-
August 29, 2001 - There is a nice little article
by David Kwok appearing in BSD
Today. Gotta like good press :)
-
July 16, 2001 - Turns out that unconfiguring multi-level RAID sets at
shutdown time was a lot easier to do than expected. A half-dozen
lines of real code, and multi-level RAID sets now get unconfigured
they way one might expect them to at shutdown time. Thanks to Bill
Squier for testing.
-
July 10, 2001 - Luke Mewburn added a '-G' option to raidctl. This
option generates a configuration file for the RAID set. This is most
useful for autoconfigured RAID sets where you've somehow managed to
lose the original raid0.conf file. Thanks Luke!
-
June 14, 2001 - Fixed a long-standing silly condition where even after
a reconstruct you would need to rebuild the parity. On all but RAID
6, if you reconstruct, then your parity bits are known to be correct
then too!
-
June 8, 2001 - More press on Daemon News with the story of RAIDframe
for FreeBSD.
-
June 7, 2001 - RAIDframe as a kernel driver has been ported to FreeBSD by Scott Long. More info
can be found here.
This is yet another testament to the portability of the original CMU code.
-
May 24, 2001 - Sheesh.. nothing much got done in April, and May has
been about the same. At least I've had time to look at the RAIDframe
code this past week. Now that I appear to have the kinks worked out
of my main box, maybe I can start doing some serious development again.
-
March 9, 2001 - Wow! Have I ever been slacking off on RAID stuff :-/
And to make matters worse, I havn't been getting much of anything else
done either... At least today I have some excellent performance
numbers to report:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
2000 24566 46.0 33565 23.8 6515 7.8 26492 86.3 43542 41.1 96.9 1.8
These numbers are from a 1GHz AMD Thunderbird on a ABIT KT7A-Raid
motherboard, using three IBM DLTA-307045 45GB UltraDMA/100 Drives. (Two
drives on UDMA100, one on UDMA66 (until our driver does UDMA100 on
that channel too :) ). The three drives are pulled into a RAID 5 set
with a stripe width of 32. The frag/block settings are 4096/32768.
I'd like to use smaller values for these, but a) it's *much* slower
(ok, 15MB/sec writes and 38MB/sec reads) and b) it takes forever to
newfs the filesystem. (w/ the above parameters, newfs'ing the 87GB
filesystem takes about a minute.) I played with a bunch of
parameters, but this seems to be about the best I can get, for now.
The filesystem looks like:
/dev/raid3e 89270792 4 84807248 0% 1 5631997 0% /u3
The other difficulty here is the limited number of files I can have on
this set... The 45GB set I had before had some 10 million inodes.
But getting just that many on this set means I have to chop the block
size to 16384, and that eats into my performance too much. So I guess
I just put "big files" (larger than 32K) on this filesystem.
-
Nov. 13, 2000 - It's the 2nd anniversary of RAIDframe being an
official part of NetBSD!
-
Nov. 13, 2000 - after I spent a while trying to figure out what was
wrong, Chuck Silvers pointed out a code ordering problem in
src/sys/uvm/uvm_swap.c which was preventing "swap on RAID" from being
unconfigured correctly. With this code now fixed, RAID sets that have
swap on them can now be unconfigured, and won't be dirty after a
reboot.
-
Nov. 2, 2000 - Matt Thomas just added support for booting off a RAID 1
set on the VAX architecture. Thanks Matt! :)
-
Oct. 30, 2000 - Luke Mewburn just added support for booting off RAID 1
sets for i386. It's not quite complete yet (needs some changes to
installboot to get the /boot stuff to work correctly) but it's mostly
there. A big "Thank You" to Luke for doing this.
-
Oct. 26, 2000 - fixed up the raidctl.8 man-page a fair bit. Added a
new section on Performance Tuning.
-
Oct. 19, 2000 - fixed up the disk_busy()/disk_unbusy() calls so that
the IO statistics are now more sane. Also fixed a few problems
related to accidentally touching the labels failed disks in certain
(very obscure) cases.
-
Sept. 26, 2000 - Simon Burge was at it again, and committed changes to
the alpha bootblocks, allowing the bootblocks to pull a kernel
directly from a RAID1 set. Again, thanks to Simon for getting this
done. (two ports down, many more to go :) )
-
Sept. 13, 2000 - I've been asked a couple of times "How stable is
RAIDframe, and is it good enough for / on RAID?" Well, my main
machine at home has /, swap, and /u2 each on their own RAID 1 set, and
the machine has been up for the past 65 days without a hitch or hiccup
(NetBSD-1.5_ALPHA on i386). And yes, this machine gets used
lots :)
-
Sept. 13, 2000 - Simon Burge just committed changes to the pmax
bootblocks, allowing the bootblocks to pull a kernel
directly from a RAID1 set. Thanks to Simon for doing all the hard
work on this one :) The pmax port is the first to have this feature.
It is hoped that the other ports can have this feature soon too.
-
Sept. 12, 2000 - Yes, I'm still working on RAIDframe :) I've spent the
last week trying to figure out why a copyback seems to be completely
locking up my system. It turns out that the copyback does not
support accessing the RAID set while the copyback is in progress.
This is not that great, and really needs to be addressed at some
point. :(
I've also fixed a bunch of locking issues in the RAIDframe code -- a
few ltsleeps() have been added, and now the kernel doesn't complain
much when running with LOCKDEBUG turned on.
-
August 23, 2000 - More press for RAIDframe! :) BSD Today has an article
about RAIDframe. Thanks to Peter Clark for writing it!
-
August 19, 2000 - Yes, I'm still doing RAIDframe stuff :) Checked in
a couple of bug fixes and have been trying to find time to get a few
other things wrapped up before NetBSD 1.5 is cut.
-
June 3, 2000 - Received an email from a RAIDframe user, pointing me at
this web-page where he looks
at some of the performance implications of various stripe sizes. I
want to do something similar (and have a paper about this in the
works) but I've been too busy with other things :(
-
May 28, 2000 - Spent the weekend fixing bugs. The most notable ones
are: 1) 'raidctl -u' no longer causes a panic if a parity re-write is
taking place (the re-write now gets aborted before the unconfigure)
and 2) the auto-configuration code is more agressive, and will do more
to ensure that previously failed disks do not mess up the
auto-configuration process.
-
May 17, 2000 - I've had the opportunity to work on an Athlon 650 with
384MB RAM, 4 AdvanSys
ASB3940U2W-00 SCSI controllers and 24 Seagate ST150176LW 50GB
Ultra2Wide SCSI drives (6 per chain). The machine was configured with
four sets of 6 disks per RAID 5 set (5 + 1 spare) with the disks for
any particuar RAID set being load balanced across all 4 controllers.
For a single RAID 5 set, the following performance was achieved with a
bit of filesystem and RAID parameter tweaking:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
100 13656 92.1 31653 18.8 4533 7.4 8853 97.8 61635 40.4 144.3 3.0
Unfortunately, I didn't get a chance to benchmark this set with a
larger test size. For the RAID 0 over 4 RAID 5's, the following
performance was observed:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
100 9939 71.0 13590 8.4 6725 11.0 8580 89.5 54228 40.6 346.7 5.7
Interestingly, performance gets better with a larger test size:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
2000 11818 84.2 21219 13.7 5858 11.3 8295 91.9 59475 44.0 120.6 3.2
The filesystems on this critter (which has 2 IDE drives setup w/ boot
partitions, / on a RAID 1, and swap on a different RAID 1) look like:
Filesystem 1K-blocks Used Avail Capacity Mounted on
/dev/raid0a 2441256 973870 1345323 41% /
/dev/raid6e 778113184 72640 739134880 0% /build
Checking/fixing parity a single RAID 5 set takes about 1.5 hours if
the parity is nearly completely correct. It can take up to 13 hours
if it has to rebuild all the parity. If a disk fails, it takes 45
minutes to get the data reconstructed onto the new disk. A newfs of
the above filesystem takes about 45 minutes. There are actually 7
different RAID sets on this machine (2 RAID 1, 4 RAID 5, and 1 RAID 0).
Special thanks to Dante Profeta for his dedicated work to making the AdvanSys
driver work so very well.
-
May 1, 2000 - Made the May edition of Daemon News with a RAIDframe
article. You can find the article here.
-
March 21, 2000 - flakey disk replaced, and now /, swap, and /u2 are
all running on (individual) RAID 1 sets.
-
March 15, 2000 - Now have my home box running with / on a RAID 1 set.
Discovered yesterday that one of the 18.2GB disks I got is flakey
(read errors!) so it'll be going back for a replacement. The up-side
to the disk having problems was that RAIDframe performed like a champ,
and the system didn't even notice the disk was 'failed'.
-
March 14, 2000 - ARG! Checked in a change that I neglected to check in
when I did the RAID_AUTOCONFIG stuff. Now / on RAID should really
work for everyone, not just for me :)
-
March 13, 2000 - Picked up an AdvanSys
ASB3940U2W-00 SCSI controller, and two 18.2GB Fujitsu U2W 7200RPM
(MAE3182) drives. The performance of this combination is pretty good:
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
RAID 0 11819 87.5 24502 55.0 4676 16.5 10697 94.9 30217 89.4 89.8 5.7
RAID 1 10088 79.6 14948 32.7 4648 18.3 11051 95.3 19874 42.6 85.0 5.7
Re-syncing the mirrors happens at about 20MB/sec, so a re-sync of a
14GB set takes less than 15 minutes. I also upgraded my main box to
NetBSD -current, which means that I can play with all the new RAID
toys (and actually use the above card! :) )
Project History
In case anyone cares about how such a project might evolve, here is a brief outline of some of the
key points in the development effort.
FAQ
Ok, so some of these havn't been asked of me yet: I'm just doing a bit of
preventative maintenance... :-)
Q Where do I get it?
RAIDframe is an integrated part of NetBSD, and is available in all
NetBSD releases since NetBSD 1.4. You'll note that it's only been
compiled on i386, pmax, sparc, sun3, hp300, and alpha boxes at this
point [May 24, 2001: This information is rather old - it's been
compiled on a lot more boxes than this.], so if you're having problems
with getting it going, PLEASE send me an email ( oster@cs.usask.ca), and I'd be
more than happy to help out. (This is especially true if you are
attempting to get the RAIDframe kernel driver going on a platform I
havn't tested it on yet.)
Q Is it stable? Will I lose data? Can I trust it?
A Mostly. Not likely. Perhaps. :-) There is absolutely no
way that I can guarantee that you will not lose data from using this
software. But then, that holds true for just about anything. That
being said, my stress-testing has involved load averages of 64+ over
only 3 SCSI disks, and not a single byte has been lost in "normal
operation". I've also re-built the NetBSD world a couple of times off
a filesystem living on a RAID device, so in that sense at least it's
self-hosting. How well the driver fares under "component failure" is
less well tested, and is the subject of current testing. Update
(Feb. 14, 2000):My home machine (had an uptime of 120+ days until
it suffered a 2 drive failure) has a RAID 5 setup on it that I use
every day for Real Stuff. I didn't even notice that the first drive
had failed :( Update (March 22, 2000) My /, swap, and /u2
partitions on my main home machine are now all on RAID 1.
Q Is NetBSD-1.3.x supported?
A No, and while a back-port should not take more than a
few hours to do, I have no intention nor inclination to do it.
Q What can I do to help?
A Play with the driver, and try to break it. If folks would
like to help identify more bugs for me, that would be great. Stories
of success, especially successful recoveries after a component dies,
would be most welcome.
Q Why are kernels with RAIDframe so huge?
A RAIDframe is not small. It typically requires at least
500K of extra space [May 24, 2000?: Gee.. this is old too... on i386, I
believe the additional space required is only about 320K these days.
Feb. 22, 2004: RAIDframe weighs in at 150K for i386].
This makes it tight to get it working on something like a Sun 3/50
(kernels must be less than 1MB in size), but if you're going to be
running this on a server, chances are an extra 500K isn't going to be
much of a problem. Doing RAID well and with as many options as
RAIDframe provides takes a lot of space. I am working on pruning out
unnecesary bits from the driver.
Q Can I use disks of different sizes?
A Your physical disks can be of different sizes, but each
component in the RAID set must be the same size. If you have say one
9GB SCSI disk, two 4.5GB SCSI disks, and two 10GB disks, you might do
something like the following:
RAID5#1 RAID5#2
9GB: /dev/sd0e /dev/sd0f
4.5GB: /dev/sd1e
4.5GB: /dev/sd2e
10GB: /dev/wd0e /dev/wd0f (1GB unused on /dev/wd0g)
10GB: /dev/wd1e /dev/wd1f (1GB unused on /dev/wd1h)
where each component in each RAID set is 4.5GB.
This would give you 18GB on one RAID5 set, and 9GB on the other.
RAID5#1 RAID5#2
9GB: /dev/sd0e /dev/sd0f
4.5GB: /dev/sd1e
4.5GB: /dev/sd2e
10GB: /dev/wd0e /dev/wd0f (1GB unused on /dev/wd0g)
10GB: /dev/wd1e /dev/wd1f (1GB unused on /dev/wd1g)
and end up with 2 raid sets, each with 13.5GB of total storage.
(you could also pull /dev/wd0g and /dev/wd1g into a RAID 0 or 1).
If you wanted all of the partitions into a single RAID set, then
you might try to pull the above two sets together with a RAID 0.
For RAID5#1, you might use:
START disks
/dev/sd0e
/dev/sd1e
/dev/wd0e
/dev/wd1e
and (for RAID5#2) you might use:
START disks
/dev/wd1f
/dev/wd0f
/dev/sd2e
/dev/sd0f
Note that the drives are ordered here in an attempt to avoid having a
single IO go to two different partitions on the same disk. The
reality is you don't want to use more than one partition from any one
disk if you are concerned about performance.
Q I just added another component to my RAID set, but when I
do a newfs I havn't gained any space!
A You are probably still using the old disklabel, not one
that reflects the size of your new RAID set. Just to make sure you
get what you want, I recommend a:
dd if=/dev/zero of=/dev/rraid0d count=2
(that would be:
dd if=/dev/zero of=/dev/rraid0c count=2
for non-i386 boxes.) to make sure the old label is gone. Then simply
do a:
disklabel raid0 > /tmp/label
{edit /tmp/label and make it say what you want}
disklabel -R -r raid0 /tmp/label
This ensures that you have a fresh label that correctly reflects the
RAID configuration.
TODO
Here, in a "need to fix very soon" order, is a list of stuff on my
TODO list:
-
Fix a problem with auto-configured sets with missing components.
There is no way to "add" a component to the empty spot. (you can
hot-add the spare, and if you reboot, it'll show up as the (previously
missing) component, but that requires an extra reboot, which shouldn't
be needed.
- Complain profusely and incessantly (in the nicest possible way)
when a component fails. This may be accomplished through the use of a
little user-land daemon which monitors the state of all RAID sets.
- N-way RAID1's have problems if more than one of the components
fail. (actually, N is limited to 2 right now :( so this is rather moot )
- Talk more about component labels in the FAQ.
Here, in no particular order, is a (very small, and very incomplete)
list of the stuff on my TODO list for this project:
- Look more closely at the Parity Logging code
- Test RAID6 (make that *fix* RAID 6, as the ECC code is apparently broken).
- Add more info to the FAQ, esp. stuff on using drives of various sizes...
- Automatic reconstruction when a component fails. (The goop to do
it is all there, just a matter of making it happen, and allowing the
user to specify the desird behaviour).
- Clean up the code even more.
(Both of the above lists are now very much incomplete.)
Other RAID-related links
Page last modified: October 31, 2007.
Send comments to oster@cs.usask.ca.