Here are my notes for a backup and restore test.
ZFS snapshots and send/receive for object backups.
This example is for combined mgs/mdt object, but the same would apply for an OST device-level backup. This example was run on on Redhat Enterprise Linux 6.2 and lustre 2.4.0.
Servers and filesystems in the example
luste2-8-25 - MGS/MDT server
lustre-meta/meta - Lustre ZFS MGS/MDT volume/filesystem on lustre2-8-25
lustre2-8-11 - OSS/OST server
lustre-ost0 - Lustre ZFS OST volume on lustre2-8-11
Backing up the object
Take a snapshot
"-r" means do a recursive snapshot, so this will include both the volume and the filesystem.
list existing snapshots
send and store on a remote ZFS/Lustre server:
note "-R" recursively sends the volume, filesystem, and preserves all properties. It is critical to preserve filesystem properties. If not using the "-R" flag, be sure to use "-p", we will show that during recovery.
Examine on remote side
Recovery from failure.
In testing, I first corrupted the filesystem with 'dd'. You could also simply reformat it for testing.
Create a new ZFS lustre volume/filesystem with the same name.
In my test case we have a raid 10:
Mount with "service start lustre"
This makes a volume called "lustre-meta" and filesystem "meta"
"mount" command shows:
Login into lustre2-8-11 (remote target where you stored the snapshot), and send the filesystem back.
Now I will only send the filesystem back, not the whole volume (why do a whole volume? Convenient if you have multiple datasets?)
-p = preserve attributes, important for the lustre filesystem to mount.
Back on On lustre2-8-11 (failed metadata server), rename the filesystem to make the snapshot active.
oops! That didn't work. You need to unmount the filesystem so it isn't busy.
Note, this doesn't mean stop the lustre service, if you do you can't access the zfs volume.
You should now be recovered.
The steps for restore ZFS backend via ZPL by 'tar':
1) Create new pool for the target if necessary, then reformat new Lustre FS with "--replace" parameter. For example:
2) Enable "canmount" property on the target FS. For example:
3) Mount the target as 'zfs'. For example:
4) Retore the data. For example:
5) Remove stale OIs and index objects. For example:
6) Umount the target.
7) (optional) If the restored system has different NID as the backup system, please change NID. For detail, please refer to Lustre manual 14.5. For example:
8) Mount the target as "lustre". Usually, we will use "-o abort_recov" option to skip unnecessary recovery. For example:
The osd-zfs can detect the restore automatically when mount the target, then trigger OI scrub to rebuild the OIs and index objects asynchronously at background. You can check the OI scrub status. For example:
Or you can read the proc interface on the target directly:
If you want to restore the system from ldiskfs-based backup, please follow the same steps.