Details
-
Question/Request
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
9223372036854775807
Description
This is a sanity check question. NSC sees no reason the method described below should not work, but due to the high impact a failure would have we'd like a second opinion. We have scheduled downtime to execute it Thursday next week, 26 Jan.
To sort out the fallout of LU-8953 (out of inodes on ZFS MDT solved by adding more disks to the pool) we need to recreate the original pool. The reason we ran out of inodes was that when the vendor sent us hardware for the latest expansion that was supposed to be equivalent to the last shipment the SSD had switched from reporting 512b blocks to 4k blocks. Since I had not hardcoded ashift we ended up with 6-8 times fewer inodes and this was missed in testing.
There isn't enough slots in the MDSs to solve this by throwing HW at it as a permanent solution, so I need to move all data from pools with ashift=12 to ashift=9. Do you see any problem with just doing the following:
(The funny device names come from running LVM just to get more easily identifiable names)
Unmount the filesystem on all nodes then run something like this for each mdt that needs fixing:
umount lustre-mdt0/fouo6
zfs snapshot lustre-mdt0/fouo6@copythis
zpool create lustre-mdt-tmp -o ashift=9 mirror \
/dev/new_sdr/mdt_fouo6new_sdr \
/dev/new_sdu/mdt_fouo6new_sdu
zfs send -R lustre-mdt0/fouo6@copythis | zfs recv lustre-mdt-tmp/fouo6tmp
zpool destroy lustre-REMOVETHIS-mdt0
zpool create lustre-mdt0/fouo6 -o ashift=9 \
mirror /dev/mds9_sdm/mdt_fouo6_sdm /dev/mds9_sdn/mdt_fouo6_sdn \
mirror /dev/mds9_sdo/mdt_fouo6_sdo /dev/mds9_sdp/mdt_fouo6_sdp
zfs send -R lustre-mdt-tmp/fouo6tmp@copythis | zfs recv lustre-mdt0/fouo6
mount -t lustre lustre-mdt0/fouo6 /mnt/lustre/local/fouo6
zpool destroy lustre-mdt-tmp
The "REMOVETHIS-" inserted due to desktop copy buffer paranoia should be removed before running of course.
Attachments
Issue Links
- is related to
-
LUDOC-161 document backup/restore process for ZFS backing filesystems
-
- Resolved
-
Hi Gabriel,
The weekends tests looks good. I have some tests I will run over night and lock down the plans tomorrow. A couple of questions:
1. Did you have any reason that sending the whole pool would be better than sending individual filesystems except that it was easier because you also had the MGT there? Unless there is a reason not to I will send the filesystems, only for clearitys sake. The pools have anonymous names while the MDTs are named after the filsystems. I will be doing this for 3 pools on the same machine, so keeping the names reduces the chance of recv:ing or destroying the wrong filesystem. These will be the actual sends on my end:
zfs send -vR lustre-mdt0/fouo6@copythis | gzip > /lustre-mdt-tmpfs/mds0-fouo6.gz
zfs send -vR lustre-mdt1/rossby20@copythis | gzip > /lustre-mdt-tmpfs/mds1-rossby20.gz
zfs send -vR lustre-mdt2/smhid13@copythis | gzip > /lustre-mdt-tmpfs/mds2-smhid13.gz
2. I will not be moving the MGT from ashift=12 to ashift=9. Will this cause any problems? I know the question is borderline insane, but this is really the original reason I opened this ticket with you. I'm OK with sorting out everything on the zfs level, but I'm trying to fish for half-insane things like hard-coding offsets on MDT creation time based on number of blocks somewhere deep in Lustre.