[LU-15965] Backup-Restore when one of the hard drive on the MDT failed. Created: 23/Jun/22  Updated: 05/Jul/22  Resolved: 05/Jul/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question/Request Priority: Critical
Reporter: Loan Thai Assignee: Andreas Dilger
Resolution: Done Votes: 0
Labels: None
Environment:

Dell EMC Redhat 7.4, IML version 4.0.3. Consist of IML server, 1 head node, 72 compute nodes, 1 MDT, MDSs, OSSs, and OSTs.


Rank (Obsolete): 9223372036854775807

 Description   

Hello,

One of the hard drives on MDT will fail soon.  We need to replace it.  

The question is after replace the failed hard drive, should the Metadata will be rebuilt automatically or do we have to run something to recover the metadata?  We have IML ver 4.0.3.  Should we  backup the DB before replacing the failed HD?  Please advice.

Thanks.   



 Comments   
Comment by Andreas Dilger [ 23/Jun/22 ]

Yes, definitely backup the MDT. You should do that now, before the drive is failing, and probably a second backup right before replacing the drive. Please see https://wiki.lustre.org/Backing_Up_a_Lustre_File_System for details.

Whether the MDT volume is automatically rebuilt when the drive is replaced depends on the underlying RAID hardware on your system. Lustre itself does not provide any redundancy, so there must be some kind of RAID at the storage level to protect from individual drive failures. That is not something that we support, please contact your storage vendor.

Comment by Loan Thai [ 24/Jun/22 ]

Hi Andreas,

Thanks for working on my issue.  To do the Device-Level backup, it requires to unmount the target?  I am afraid  the drive will fail after unmount and also we can not unmount the FS.  Is there other method to backup MDT without unmount the luster FS?

For the Device-level backup,  i need to use dd command:  dd if=/dev/{original} of=/dev/{newdev} bs=4M.   

We have 2 MDS, should i run the dd command on both or just one MDS? 

Disks layout on the MDS is below.  Should i backup all /dev/sdx or only /dev/sda ?

  • /dev/sda (299 GB)  mount /boot , swap, /, etc.
  • /dev/sdb (11 GB)   mount /mnt/MGS
  • /dev/sdc (6 TB)
  • /dev/sdd (11 GB)   mount /mnt/MGS
  • /dev/sde (6 TB) 

Thank you.

Comment by Andreas Dilger [ 24/Jun/22 ]

The "dd" mechanism can be used with a mounted MDT. Though it may not be a perfect backup (possibly some recently created/deleted files may be missed), it will be much better than having no backup at all. The less activity on the filesystem, the better the backup will be.

While you may have multiple MDS, I suspect you only have one MDT on a system this old. I would recommend backing up at least the MGT and the MGS. It isn't clear from your comment which device is the MDT, it should be clearly listed:

mds# mount | grep lustre
/dev/nvme0n1p1 on /mnt/myth/mgs type lustre (ro,svname=MGS,nosvc,mgs,osd=osd-ldiskfs,user_xattr,errors=remount-ro,_netdev)
/dev/sda on /mnt/myth/ost0000 type lustre (ro,svname=myth-OST0000,mgsnode=192.168.20.1@tcp,osd=osd-ldiskfs,errors=remount-ro,extents,mballoc,_netdev)
/dev/sdb on /mnt/myth/ost0001 type lustre (ro,svname=myth-OST0001,mgsnode=192.168.20.1@tcp,osd=osd-ldiskfs,errors=remount-ro,extents,mballoc,_netdev)
/dev/sdc on /mnt/myth/ost0002 type lustre (ro,svname=myth-OST0002,mgsnode=192.168.20.1@tcp,osd=osd-ldiskfs,errors=remount-ro,extents,mballoc,_netdev)
/dev/sdd on /mnt/myth/ost0003 type lustre (ro,svname=myth-OST0003,mgsnode=192.168.20.1@tcp,osd=osd-ldiskfs,errors=remount-ro,_netdev)
/dev/sde on /mnt/myth/ost0004 type lustre (ro,svname=myth-OST0004,mgsnode=192.168.20.1@tcp,osd=osd-ldiskfs,errors=remount-ro,_netdev)
/dev/nvme0n1p2 on /mnt/myth/mdt0000 type lustre (ro,svname=myth-MDT0000,mgsnode=192.168.20.1@tcp,osd=osd-ldiskfs,errors=remount-ro,iopen_nopriv,user_xattr,_netdev)

In this example, the MDT is on /dev/nvme0n1p2.

That said, having a backup of the Linux OS disk is also very useful, because it would take a lot of time and effort to reinstall and rebuild this system from scratch (if even possible, given the age), and since the OS disk doesn't change very often then even a single OS backup is probably enough. The lowest-price 4TB drive here is about $100, so backing up 300GB is about $7 worth of storage (probably less with a larger drive), and it will definitely take much more time and effort to rebuild the OS drive than $7.

Comment by Loan Thai [ 24/Jun/22 ]

Sorry for the confusion.

Our system was built by a vendor so i dont quite understand the setup. 
We only have 1 MDT (combine MDT/MGS). From the mount output, i got:

  • MDS0-0: (please see the attached)
    /dev/mapper/mpathb on /mnt/MGS type lustre (ro)
  • MDS0-1: (please see the attached)
    /dev/mapper/mpatha on /mnt/MGS type lustre (ro)

Is the MDT on /dev/mapper/mpatha  or /dev/mapper/mpathb ?  
I will backup both mpatha , mpathb with this command on MDS0-0:

MDS0-0# dd if=/dev/mapper/mpatha of=/mnt/my_NFS_mounted bs=4M.     #-- is it correct? --#

You are right about the HD for backup.  It is not money but permission.  Our system is standalone/isolated in a secured area.  It is very hard to get an approval for add in devices.  But i will take your advice seriously and work on it.

Comment by Andreas Dilger [ 24/Jun/22 ]

Sorry, it isn't clear at all that the /mnt/MGS device is the right one. Check the output from "mount | grep lustre" on both MDS servers to see which device is mounting MDT0000, as was shown in my example output above.

Comment by Andreas Dilger [ 24/Jun/22 ]

PS: I'd assume you know which RAID device is holding the failed drive? If yes, that is the device that should be backed up.

Comment by Loan Thai [ 28/Jun/22 ]

GM Andrea, i am sorry for a late response as i was out on TDY yesterday.

From the MDS0-0, the output of "mount | grep lustre" is:
    /dev/mapper/mpathb    /mnt/lustre-MDT0000

From the MDS0-1, the output of "mount | grep lustre" is::
   /dev/mapper/mpatha   /mnt/MGS

So i will login MDS0-0 and backup the MDT with this command:  
  dd if=/dev/mapper/mpathb of=/mnt/backupMDT0000_onNFSmounted bs=4M.  

 

and  also it is still good to backup the MGS right? dd if=/dev/mapper/mpatha of=/mnt/backupMGS_onNFSmounted bs=4M.  

Thanks.

Comment by Andreas Dilger [ 28/Jun/22 ]

Yes, making a backup of both is a good idea.

Comment by Loan Thai [ 28/Jun/22 ]

Hi Andreas,

I ran dd command to backup the MDT to my NFS but had to break it out because i dont have enough space on my nfs.
Total space of MDT is 4.4TB, used space is 418GB.  My nfs only has about 1TB.  The dd command seems to backup the entire 4.4TB of MDT.

Is there another way just to back up the used space on MDT?  Thanks.

Comment by Andreas Dilger [ 28/Jun/22 ]

Using 'dd' is by far the fastest and most reliable way to do a backup and restore, and would be my strong recommendation for you to use. Yes, it does a full backup of the entire MDT device, but it also ensures that if it needs to be restored it will be exactly the same as before, and it can nominally be done while the system is in use (though I would recommend to make a second backup when you are able to stop the MDT, or immediately before the drive is replaced).

You could consider piping the "dd" output through bzip2 to try and compress it, but I don't know how much compression you will get as this depends heavily on how full the MDT is and the lifetime of the system. this is not difficult to try, something like the following, but of course compressing and decompressing the backup will make that process take significantly longer:

dd if=/dev/mapper/mpathb bs=4M | bzip2 -9 > /mnt/backupMDT0000_onNFSmounted.img.bz2

That said, it makes sense to start such a backup now while other options are considered.

There are measures that could be taken to improve the compression of the dd image, but that would involve unmounting the MDT and writing a lot of zeroes to it, and this is not advisable if a drive is already near failure.

The user manual also describes how to use "tar" to do a backup/restore, and this may take less space (depends on how full the MDT is), but will take much longer and put a lot more load on the MDT drives. This would also require reformatting the MDT before restoring the backup, and will not produce an "exact" backup and restore process. It also isn't clear to me what state your configuration is, how the MDT was initially formatted, the software versions, and your experience level in order to use that approach, and given the secure nature of the system it would be extremely difficult to assist you. This kind of system administration task is really outside the scope of the Lustre support contract.

You really need to reconsider the relatively low cost of installing a suitable backup drive, not just for the MDT, but also for the OS images. Given the lack of familiarity with the system, restoring a failed OS drive might prove to be very difficult and time consuming. This is very worthwhile given the risk of potentially of losing all of the data in the filesystem if the MDT is lost. If there is some administration overhead to install a new drive there, then consider making it a large one so that it will last a long time and can hold multiple MDT backups. Most secure sites I've worked with have less objection to bringing in new equipment, and more objection to removing equipment, but even a few hundred dollar 14TB drive in a USB enclosure would be sufficient for this use and could be left onsite afterward, since it will be doing linear writes during the backup and could sustain full bandwidth, so would finish a full MDT backup in about 48h (I'm assuming based on the age of the system, USB2.0 ~= 25MB/s).

Comment by Loan Thai [ 29/Jun/22 ]

I will do the dd backup and compress the data as you suggested.  I have a couple big drives, 10 and 14tb.  I will figure out and get approval to add them in.  I will post the update.  Thanks for your assistant.

Comment by Loan Thai [ 05/Jul/22 ]

I followed your instruction to backup (with the compression) the MDT and MGS.  Using Dell MDSM tool to manage the MDT and replacing the failed disk.  Data is rebuilding now.   Please close this ticket as resolved.  Thanks so much for your help Andreas,

Generated at Sat Feb 10 03:22:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.