How to install, monitor and repair Software RAID on Debian GNU/Linux.
Installing Software RAID on Debian GNU/Linux
A care is to be taken when installing Software RAID 1 on Debian/woody onto the boot partition. One of the best recent guides is written by Marcus Schoppen and is available at http://wwwhomes.uni-bielefeld.de/schoppa/raid/woody-raid-howto.html. With small changes, you can follow his procedure and install Software RAID 1 onto non-RAID system remotely.
Some comments to his guide:
in step 1, you better do this:
$ sfdisk -d /dev/sda > partitions.sda $ cp -a partitions.sda partitions.sdb $ perl -pi -e 's,/sda,/sdb,g' partitions.sdb $ sfdisk /dev/sdb < partitions.sda
in step 7, you better do this:
$ mount -v /dev/md0 /mnt # let's start with md0 $ cd / # since md0=/, see note below $ find . -xdev | cpio -pm /mnt $ umount /mnt
for each filesystem (md0=/, md1=/var, md2=/tmp, ...). But beware, cpio or mirrordir do not work for files greater than 2GB! Have to use cp for those.
in step 9, it is not necessary to make boot floppy, if you do:
$ cp /etc/lilo.conf /tmp # to keep "good" lilo.conf handy $ vi /tmp/lilo.conf # and put there root arg, like this: # image=/boot/.... # label=Linux # root=/dev/md0 # read-only $ raidstop /dev/md0 # otherwise may have problems $ raidstop /dev/md1 # stop raid for each mdX filesystem $ lilo -C /tmp/lilo.conf $ reboot # FIRST REBOOT if you started from RAID-capable kernel
Then the installation can be fully remote. (tested)
in step 11, do not put partition argument in lilo.conf. Alternative working configuration is as follows:
$ cat lilo.conf lba32 restricted boot=/dev/md0 root=/dev/md0 install=/boot/boot-menu.b map=/boot/map password=foobar delay=20 vga=normal raid-extra-boot="/dev/sda,/dev/sdb" default=Linux image=/vmlinuz label=Linux read-only image=/vmlinuz.old label=LinuxOLD read-only optional
After step 11 is done, do SECOND (and final) REBOOT. You are done.
Software RAID Runtime Monitoring
All our Linux servers run Software RAID-1 disks. So beware in case of failure. A command to check the status of the RAID array is:
$ cat /proc/mdstat
and should show "[UU]" for each volume when everything is fine:
$ cat /proc/mdstat Personalities : [raid1] read_ahead 1024 sectors md0 : active raid1 sdb1[1] sda1[0] 96256 blocks [2/2] [UU] md1 : active raid1 sdb2[1] sda2[0] 995904 blocks [2/2] [UU] md2 : active raid1 sdb5[1] sda5[0] 586240 blocks [2/2] [UU] md3 : active raid1 sdb6[1] sda6[0] 995904 blocks [2/2] [UU] md4 : active raid1 sdb7[1] sda7[0] 2931712 blocks [2/2] [UU] md5 : active raid1 sdb8[1] sda8[0] 5855552 blocks [2/2] [UU] md6 : active raid1 sdb9[1] sda9[0] 6457984 blocks [2/2] [UU] unused devices: <none>
If this is not the case, read section on repairing below.
Note that our machines usually run the mdadm daemon that periodically scans the health status of RAID devices and that alerts root by email in case it spots something wrong.
Software RAID Repairing
How to repair a degraded RAID device? If you do:
$ cat /proc/mdstat
and you see a line containing U_ such as:
md3 : active raid1 sdb6[1] sda6[0] 979840 blocks [2/2] [UU] md4 : active raid1 sda7[0] 2931712 blocks [2/1] [U_]
then it means that md4 is running in a degraded mode and that sdb7 has crashed.
Firstly you should check whether the disk is physically okay. Look into /var/log/messages and search for lines like:
$ sudo grep I/O /var/log/messages Sep 15 02:32:06 pcwebc00 kernel: I/O error: dev 08:21, sector 139017744 Sep 15 02:32:32 pcwebc00 kernel: I/O error: dev 08:21, sector 139017752
If you see this, then the disk should be physically replaced before continuing, and repartitioned exactly like the old one or the one it is going to mirror.
(Sometimes the system can detect the disk as faulty and will mark it as (F) in /proc/mdstat output, for example:
$ sudo cat /proc/mdstat [...] md6 : active raid1 sdb9[1](F) sda9[0] 12329792 blocks [2/1] [U_]
and you can double-check that /var/log/messages indeed indicates an I/O error:
Jan 4 00:16:26 pcdh90 kernel: scsi2: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 01 a8 d9 ef 00 00 50 00 Jan 4 00:16:28 pcdh90 kernel: Info fld=0x1a8da02, Current sd08:19: sense key Medium Error Jan 4 00:16:28 pcdh90 kernel: Additional sense indicates Unrecovered read error Jan 4 00:16:28 pcdh90 kernel: I/O error: dev 08:19, sector 16661768
asking for physical disk examination, as stated above.)
If you don't see any symptoms of a disk failure, then you may repair the RAID device onto the same disk and onto the same partition.
To repair the RAID /dev/md4 of the example above, do:
$ sudo raidhotadd /dev/md4 /dev/sdb7
or, if you use mdadm instead of raidtools, like this:
$ sudo mdadm /dev/md4 -a /dev/sdb7
and watch the progress:
$ cat /proc/mdstat md4 : active raid1 sdb7[2] sda7[0] 2931712 blocks [2/1] [U_] [====>................] recovery = 21.4% (629440/2931712) finish=1.5min speed=25177K/sec
After a while, the RAID should be repaired:
$ cat /proc/mdstat md4 : active raid1 sdb7[1] sda7[0] 2931712 blocks [2/2] [UU]
You are done.