Failing and Removing a Device From a RAID 1 Array in Linux



 

 


When a RAID device fails, it is necessary to remove the hard drive containing the failed device from the array and replace it with a new hard drive. With Linux software RAID this is actually fairly simple using the mdadm command.

First lets look at an existing RAID 1 setup with a pair of RAID devices configured. The computer used in this example is emstools2b.

The fdisk -l command shows your RAID devices and the disk partitions that make them up.

[root@emstools2b ~]# fdisk -l

Disk /dev/sda: 120.0 GB, 120034123776 bytes
255 heads, 63 sectors/track, 14593 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          33      265041   fd  Linux raid autodetect
/dev/sda2              34       14593   116953200   fd  Linux raid autodetect

Disk /dev/sdb: 120.0 GB, 120034123776 bytes
255 heads, 63 sectors/track, 14593 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          33      265041   fd  Linux raid autodetect
/dev/sdb2              34       14593   116953200   fd  Linux raid autodetect

Disk /dev/md1: 119.7 GB, 119759962112 bytes
2 heads, 4 sectors/track, 29238272 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md1 doesn't contain a valid partition table

Disk /dev/md0: 271 MB, 271319040 bytes
2 heads, 4 sectors/track, 66240 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md0 doesn't contain a valid partition table

In this case there are two hard drives, /dev/sda and /dev/sdb. There are two RAID 1 devices created from these hard drives, /dev/md0 which is 271MB and used for /boot, and /dev/md1 which is 119.7GB and is formatted as an LVM Volume Group used for the rest of the Linux file systems. The df -h command shows the rest of this information.

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-rootVol                      1.4G  292M  1.1G  22% /
/dev/mapper/VolGroup00-varVol                      1.9G  250M  1.6G  14% /var
/dev/mapper/VolGroup00-usrVol                      3.8G  2.1G  1.6G  57% /usr
/dev/mapper/VolGroup00-usrlocalVol                      1.9G   36M  1.8G   2% /usr/local
/dev/mapper/VolGroup00-tmpVol                      4.8G  138M  4.4G   4% /tmp
/dev/mapper/VolGroup00-homeVol                      961M   18M  895M   2% /home
/dev/mapper/VolGroup00-optVol                       48G  180M   45G   1% /opt
/dev/md0              251M   25M  214M  11% /boot
tmpfs                 2.0G     0  2.0G   0% /dev/shm

You can use the mdadm command to view the status of a RAID device.

[root@emstools2b ~]# mdadm -D /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Sun Jan  7 08:58:58 2007
     Raid Level : raid1
     Array Size : 116953088 (111.54 GiB 119.76 GB)
    Device Size : 116953088 (111.54 GiB 119.76 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Tue Jan  8 08:59:06 2008
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 8ba0b0a5:274b0bc5:253d75af:d75ac7b5
         Events : 0.8

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2

This shows the details of the device including the current status and the component devices that make up the /dev/md1 RAID array.

The mdadm command can be used to simulate the failure of a RAID device. Let’s use this command to fail the /dev/sdb2 device of the /dev/md1 array.

[root@emstools2b ~]# mdadm -f /dev/md1 /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1

Note that when a RAID device fails, whether manually like this or a true failure, the mdmonitor service detects the failure and send an email to root.

This is an automatically generated mail message from mdadm
running on emstools2b.cisco.com

A Fail event had been detected on md device /dev/md1.
It could be related to component device /dev/sdb2.
Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
      264960 blocks [2/2] [UU]

md1 : active raid1 sdb2[2](F) sda2[0]
      116953088 blocks [2/1] [U_]

unused devices:

The mdadm command can show the status of the failed drive and indicate which device has failed.

[root@emstools2b ~]# mdadm -D /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Sun Jan  7 08:58:58 2007
     Raid Level : raid1
     Array Size : 116953088 (111.54 GiB 119.76 GB)
    Device Size : 116953088 (111.54 GiB 119.76 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Tue Jan  8 09:14:37 2008
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : 8ba0b0a5:274b0bc5:253d75af:d75ac7b5
         Events : 0.24

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       0        0        1      removed

       2       8       18        -      faulty spare   /dev/sdb2

The actions you need to take to recover are:

  1. Remove the damaged device from the array.
  2. Remove any other devices that are located on the same physical drive as the failed device and which are components of any RAID array.
  3. Replace the defective hard drive.
  4. Create the RAID partition on the physical hard drive.
  5. Create the new RAID devices.
  6. Add the RAID devices into the array.

See the document Configuring Software RAID 1 Arrays With Linux for details of this procedure.

Remove the device from the array using the mdadm command. Also remove the other RAID devices located on this physical hard drive that are part of any other RAID array.

[root@emstools2b ~]# mdadm -r /dev/md1 /dev/sdb2
mdadm: hot removed /dev/sdb2

At this time you can remove the defective physical hard drive and replace it with a new one and create the required RAID devices. Each RAID device must be the same physical size as the other device in the array it will be added to. Then simply use the mdadm command again to add the new device into the array.

[root@emstools2b ~]# mdadm -a /dev/md1 /dev/sdb2
mdadm: re-added /dev/sdb2

The mdadm command can be used to monitor the rebuilding progress of the array. The rebuild begins as soon as the device is added into the array; no other commands are required to cause that to happen.

[root@emstools2b ~]# mdadm -D /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Sun Jan  7 08:58:58 2007
     Raid Level : raid1
     Array Size : 116953088 (111.54 GiB 119.76 GB)
    Device Size : 116953088 (111.54 GiB 119.76 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Tue Jan  8 09:16:06 2008
          State : clean, degraded, recovering
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

 Rebuild Status : 21% complete

           UUID : 8ba0b0a5:274b0bc5:253d75af:d75ac7b5
         Events : 0.42

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       2       8       18        1      spare rebuilding   /dev/sdb2