[LU-9023] Second opinion on MDT inode recovery requested - Whamcloud Community JIRA

Details

Type: Question/Request
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

This is a sanity check question. NSC sees no reason the method described below should not work, but due to the high impact a failure would have we'd like a second opinion. We have scheduled downtime to execute it Thursday next week, 26 Jan.

To sort out the fallout of ~~LU-8953~~ (out of inodes on ZFS MDT solved by adding more disks to the pool) we need to recreate the original pool. The reason we ran out of inodes was that when the vendor sent us hardware for the latest expansion that was supposed to be equivalent to the last shipment the SSD had switched from reporting 512b blocks to 4k blocks. Since I had not hardcoded ashift we ended up with 6-8 times fewer inodes and this was missed in testing.

There isn't enough slots in the MDSs to solve this by throwing HW at it as a permanent solution, so I need to move all data from pools with ashift=12 to ashift=9. Do you see any problem with just doing the following:

(The funny device names come from running LVM just to get more easily identifiable names)

Unmount the filesystem on all nodes then run something like this for each mdt that needs fixing:

umount lustre-mdt0/fouo6
zfs snapshot lustre-mdt0/fouo6@copythis
zpool create lustre-mdt-tmp -o ashift=9 mirror \
/dev/new_sdr/mdt_fouo6new_sdr \
/dev/new_sdu/mdt_fouo6new_sdu
zfs send -R lustre-mdt0/fouo6@copythis | zfs recv lustre-mdt-tmp/fouo6tmp
zpool destroy lustre-REMOVETHIS-mdt0
zpool create lustre-mdt0/fouo6 -o ashift=9 \
mirror /dev/mds9_sdm/mdt_fouo6_sdm /dev/mds9_sdn/mdt_fouo6_sdn \
mirror /dev/mds9_sdo/mdt_fouo6_sdo /dev/mds9_sdp/mdt_fouo6_sdp
zfs send -R lustre-mdt-tmp/fouo6tmp@copythis | zfs recv lustre-mdt0/fouo6
mount -t lustre lustre-mdt0/fouo6 /mnt/lustre/local/fouo6
zpool destroy lustre-mdt-tmp

The "REMOVETHIS-" inserted due to desktop copy buffer paranoia should be removed before running of course.

Attachments

Issue Links

is related to

LUDOC-161 document backup/restore process for ZFS backing filesystems

Resolved

Activity

[LU-9023] Second opinion on MDT inode recovery requested

Peter Bortas added a comment - 31/Jan/17 1:50 PM

The ZFS oddity seems to be unrelated to the recreation of the filesystems, so I'll track that separately if needed.

This concludes this issue from my side. Thanks for the help everyone!

Peter Bortas added a comment - 31/Jan/17 1:50 PM The ZFS oddity seems to be unrelated to the recreation of the filesystems, so I'll track that separately if needed. This concludes this issue from my side. Thanks for the help everyone!

Peter Bortas added a comment - 30/Jan/17 4:09 PM

This operation was somewhat delayed by unrelated failures in one of the attached compute clusters, but completed without problems on Friday.

I have noted one oddity with ZFS snapshots today, but nothing that affects production. I'll try to figure out that one by tomorrow and then we can close this.

Peter Bortas added a comment - 30/Jan/17 4:09 PM This operation was somewhat delayed by unrelated failures in one of the attached compute clusters, but completed without problems on Friday. I have noted one oddity with ZFS snapshots today, but nothing that affects production. I'll try to figure out that one by tomorrow and then we can close this.

Gabriele Paciucci (Inactive) added a comment - 26/Jan/17 9:28 PM

Hi zino,
you don't see performance improvement because the bottleneck is in the code and not in the underline hw performance. We saw very different performance instead in the OST not using ashift=12.

Gabriele Paciucci (Inactive) added a comment - 26/Jan/17 9:28 PM Hi zino , you don't see performance improvement because the bottleneck is in the code and not in the underline hw performance. We saw very different performance instead in the OST not using ashift=12.

Peter Bortas added a comment - 26/Jan/17 3:43 PM

Hi Gabriel,

You are in time. We got a bit delayed by hardware failing elsewhere in the cluster, so the procedure is just started. We'll know today if I lost the filesystems or not.

I'll make an extra backup of the whole filesystems. It just adds about 1h to the procedure, and that's worth it.

I don't think the tools for formatting really needs any intelligence here really, this was an operator error. But if there are no performance problems with running ashift=9 on 4k block SSDs in the general case it might be a good idea to default to ashift=9 there though. In my tests I've not seen any performance advantage outside of the error margin by using ashift=12 on SSDs on the MDS.

Peter Bortas added a comment - 26/Jan/17 3:43 PM Hi Gabriel, You are in time. We got a bit delayed by hardware failing elsewhere in the cluster, so the procedure is just started. We'll know today if I lost the filesystems or not. I'll make an extra backup of the whole filesystems. It just adds about 1h to the procedure, and that's worth it. I don't think the tools for formatting really needs any intelligence here really, this was an operator error. But if there are no performance problems with running ashift=9 on 4k block SSDs in the general case it might be a good idea to default to ashift=9 there though. In my tests I've not seen any performance advantage outside of the error margin by using ashift=12 on SSDs on the MDS.

Gabriele Paciucci (Inactive) added a comment - 26/Jan/17 9:06 AM

Hi zino,
I'm back...sorry for this. I don't know if this is too late:
1. Yes I was sending the whole pool for that reason, but testing only the individual volume worked. But I suggest to have a backup of the whole file system is not a bad idea...just in case
2. We are not expecting any performance or big capacity requirement on the MGT, so I don't see any problem to leave with the original ashift.

Making lustre decide the shift layout at the format time it is something that maybe adilger can evaluate. Not sure if Lustre can evaluate the physical layout of the disks.

Gabriele Paciucci (Inactive) added a comment - 26/Jan/17 9:06 AM Hi zino , I'm back...sorry for this. I don't know if this is too late: 1. Yes I was sending the whole pool for that reason, but testing only the individual volume worked. But I suggest to have a backup of the whole file system is not a bad idea...just in case 2. We are not expecting any performance or big capacity requirement on the MGT, so I don't see any problem to leave with the original ashift. Making lustre decide the shift layout at the format time it is something that maybe adilger can evaluate. Not sure if Lustre can evaluate the physical layout of the disks.

Peter Bortas added a comment - 24/Jan/17 12:32 PM

Hi Peter,

That's unfortunate. Of course it's technically possible to delay this to another week, but the cluster downtime is now to late to stop for this week. I will also have to mount the filesystems ro for a few weeks since the users will run out of inodes before the next window.

I'd be happy with an answer to just this question: As far as Intel engineers know, is there anything in the filesystem that stores a structure that would be affected by a change in block size; i.e. could cause problems during this data move. We'll assume for the sake of this discussion that I'll be able to flawlessly take care of the bit shuffling on disk.

Peter Bortas added a comment - 24/Jan/17 12:32 PM Hi Peter, That's unfortunate. Of course it's technically possible to delay this to another week, but the cluster downtime is now to late to stop for this week. I will also have to mount the filesystems ro for a few weeks since the users will run out of inodes before the next window. I'd be happy with an answer to just this question: As far as Intel engineers know, is there anything in the filesystem that stores a structure that would be affected by a change in block size; i.e. could cause problems during this data move. We'll assume for the sake of this discussion that I'll be able to flawlessly take care of the bit shuffling on disk.

People

Assignee:: Gabriele Paciucci (Inactive)

Reporter:: Peter Bortas

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 16/Jan/17 2:56 PM

Updated:: 31/Jan/17 1:50 PM