Wednesday, November 5, 2008

Moving to Soft RAID 5 on Linux

These days, more and more often do I need to access old data that I have remotely (sending out music demos, old pictures, old documents, old codes, etc.), and I quickly learned that my old way of finding the index document and inserting the optical media to access an old file doesn't work well anymore. With the terabyte-level hard drive being available, I decided to move most of my data onto big hard drives that are running live. After a while, I realized the huge time cost in moving the data around. My solution is to have redundancy so if the hardware fails, I can quickly do hard drive to hard drive recovery instead of optical medium to hard drive copying.

I heard many fellow system admins complaining about hardware RAID controller going End of Life and they can't access their data after their last backup controller going dead. Therefore, I chose to go software RAID. Considering performance isn't crucial for my purpose, this should be a reasonable solution.

The hard drives I use are three of 1TB WD10EACS. I chose them for the price and for their low temperature/power design. However, I had only got one drive at first and started to put data on that drive, before I realized that I would like RAID. Due to my linux box's limitation in the number of SATA ports, I can only have two more WD10EACS. RAID 5 is my choice for its nice trade-off between performance and capacity.

Normally, one would build RAID 5 on 3 drives and copy data onto that RAID 5. However, in my case, one of the drive already contains that data that I want to keep so I can't build a 3-disk RAID 5 from the beginning. After some research, I learned that we can actually start from a degraded RAID 5 with 2 drives and add the 3rd one in later on.

Linux Raid is an excellent guide for me to do so. Managing RAID and LVM with Linux also helped a lot. Thanks to those kind minds.

/dev/sda is the drive that already contains the data (in /dev/sda1)
/dev/sdb and /dev/sdc are new drives.

I decided to use build a large partition instead of using the whole disk, in case in the future I am having trouble getting a drive that has slightly larger capacity for replacement. Using a partition that's slightly smaller than the whole drive capacity will save us from having to rely on manufacturer producing 100% identical capacity hard drive in the future.

(As a side note: In one of the posts in Managing RAID and LVM with Linux, the author mentioned a Slashdot solution that involves 10 partitions. They are perfectly right that solution provides a lot of flexibility, but the steps are too tedious and take too long. I can't imagine myself spending that much time in the future just to expand the capacity. With the continuous price dropping of large volume hard drives, I believe I will be able to find and afford larger hard drives for data migration.)

# fdisk /dev/sdb

Create one partition that's slightly smaller then the drive's capacity. In my case, I use cylinder 1-121577, as opposed to 1-121601. That's about 1000.2GB. I also changed the partition ID to "da", Non-FS data. For those who are not familiar with fdisk, the commands are: "new->primary->1->1->121577" and "t->da", and then "w", and then "q".

Do the same for another disk.
# fdisk /dev/sdc

Now we are ready to build a 2-disk RAID 5 across /dev/sdb1 and /dev/sdc1.

# mdadm --create --verbose /dev/md0 -l5 -n2 /dev/sdb1 /dev/sdc1

Now /dev/md0 is ready to be used as a drive. However, for future flexibility, I decide to build LVM on top of RAID 5.

# pvcreate /dev/md0
# vgcreate -s 64M lvm-tb-storage /dev/md0

(Note that I use lvm2 so there is no restriction on the number of extends in each logical volume. However, according to "man vgcreate", too many extends may slow down the tools. Therefore, for a few TB, setting 64M as the physical extent size is probably better.)

# vgdisplay lvm-tb-storage
# lvcreate -L931.32G -n tb lvm-tb-storage
# mkfs.ext3 -b 4096 -R stride=8 /dev/md0

Then we can mount the drive. Here's my /etc/fstab

/dev/lvm-tb-storage/tb /mnt/lvm-tb-storage-tb ext3 defaults 1 1

And then we can copy the data onto that. I used "rsync -a" so I can resume it if it stopped for some reason.

# rsync -a old_1TB_drive_mount_point/* new_raid_mount_point

To make the system bootable, we add the RAID setting into its config file

# mdadm --detail --scan > /etc/mdadm.conf

Now, we have a 2-disk RAID 5 that has all the data we need and the original 1TB drive that we can add to the RAID 5.

First, do the same partitioning for this old drive.

# fdisk /dev/sda

Then we are ready to add it into the array.

# mdadm --grow /dev/md0 -n3 --backup-file=raidbackup
# mdadm -a /dev/md0 /dev/sda1

Although the shell prompt pops up right away, the rebuilding has started in the background. Use this to check.

# mdadm --detail /dev/md0

I could leave a shell open and repeat the following command once in a while:

# date; mdadm --detail /dev/md0 | grep "Reshape Status"

It should show a time stamp and something like "Reshape Status : 0% complete".

However, for convenience, e.g. remotely management, I put the following line in my crontab and execute it hourly.

/usr/bin/perl -le 'print "---------\n" . `/sbin/mdadm --detail /dev/md0 `;' >>/root/rebuild_progress.txt 2>&1

From the log, it operated at a rate of approximately 1% per hour. Therefore, the whole process took about 7-8 days to complete. Of course, I can still read/write to the RAID during this time. Or, if you don't mind system taking heavier load, check out this advice for how to speed up the reconstruction.

After the process is completed (this is after 8 days), we want to update the mdadm.con so next time when we boot it'll have the latest setting.

# mdadm --detail --scan > /etc/mdadm.conf

Now we can resize the LVM.

# pvresize /dev/md0
# vgdisplay lvm-tb-storage

This is what it looks like.


[root@localhost ~]# vgdisplay lvm-tb-storage
--- Volume group ---
VG Name lvm-tb-storage
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 4
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 1
Open LV 1
Max PV 0
Cur PV 1
Act PV 1
VG Size 1.82 TB
PE Size 4.00 MB
Total PE 476839
Alloc PE / Size 238422 / 931.34 GB
Free PE / Size 238417 / 931.32 GB
VG UUID wqIXsb-KRZQ-eRnH-JvuP-VdHk-XJTG-DSWimc

[root@localhost ~]# lvresize -l +238450 /dev/lvm-tb-storage/tb
Extending logical volume tb to 1.82 TB
Insufficient free space: 238450 extents needed, but only 238389 available
[root@localhost ~]# lvresize -l +238389 /dev/lvm-tb-storage/tb
Extending logical volume tb to 1.82 TB
Logical volume tb successfully resized
[root@localhost ~]# vgdisplay lvm-tb-storage
--- Volume group ---
VG Name lvm-tb-storage
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 6
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 1
Open LV 1
Max PV 0
Cur PV 1
Act PV 1
VG Size 1.82 TB
PE Size 4.00 MB
Total PE 476839
Alloc PE / Size 476839 / 1.82 TB
Free PE / Size 0 / 0
VG UUID wqIXsb-KRZQ-eRnH-JvuP-VdHk-XJTG-DSWimc


Note that I tried to gave it a bigger PE number than available so it would warn me and give me the max PE number. Also note that my PE size is 4MB, an unfortunately mistake I made, which would require too much time to fix. As I said before, the optimal PE size should be 64MB.

Now we are ready to expand the file system.

Although resize2fs should work online, to be safe, we first unmount the volume

# umount /dev/mapper/lvm--tb--storage-tb

Then do a check before we modify the file system. This takes a while.

# e2fsck -t -f /dev/mapper/lvm--tb--storage-tb

And now resize the file system. (The -p option shows the progress bar. Since we already assigned RAID-stride when we created the file system, we don't need to do it again with -S)

# resize2fs -p /dev/lvm-tb-storage/tb

Now we can re-mount and enjoy the expanded RAID 5 storage.

# mount -all

As a side mark, if you are getting crazy temperature reading from smartd in /var/log/messages, try editing /etc/smartd.conf and add "-R 194" in the option. My /etc/smartd.conf has one line (see smartmontools's FAQ page for instructions):

DEVICESCAN -a -R 194