Saturday, November 14, 2009

Replacing a Faulty Drive in a Software RAID under Linux

At the end of summer, one of the hard drives in my Intel D945GSEJT server failed. With the drive under warranty, and being part of an encrypted RAID 1, it was replaced at no expense from my side, and with no data lost or leaked. With the excellent software RAID under Linux, adding the replacement incurred only a few minutes of lost uptime. I followed a great guide on Linux software RAID management, and these are the few simple steps needed to replace the failing drive:
  1. Remove the relevant partitions from the RAID, e.g. if /dev/sdb has failed and the RAID consists of /dev/md0 and /dev/md1:
    mdadm /dev/md0 –remove /dev/sdb1
    mdadm /dev/md1 –remove /dev/sdb2
  2. Power down, replace the drive, and power up
  3. Copy the partition table from the drive still in the RAID to the new drive, being very careful to get the commands right, e.g.:
    sfdisk –d /dev/sda | sfdisk /dev/sdb
  4. Add the new drive to the RAID, e.g.:
    mdadm –add /dev/md0 /dev/sdb1
    mdadm –add /dev/md1 /dev/sdb2