[LU-9023] Second opinion on MDT inode recovery requested Created: 16/Jan/17 Updated: 31/Jan/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Major |
| Reporter: | Peter Bortas | Assignee: | Gabriele Paciucci (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This is a sanity check question. NSC sees no reason the method described below should not work, but due to the high impact a failure would have we'd like a second opinion. We have scheduled downtime to execute it Thursday next week, 26 Jan. To sort out the fallout of There isn't enough slots in the MDSs to solve this by throwing HW at it as a permanent solution, so I need to move all data from pools with ashift=12 to ashift=9. Do you see any problem with just doing the following: (The funny device names come from running LVM just to get more easily identifiable names) Unmount the filesystem on all nodes then run something like this for each mdt that needs fixing: umount lustre-mdt0/fouo6 The "REMOVETHIS-" inserted due to desktop copy buffer paranoia should be removed before running of course. |
| Comments |
| Comment by Peter Bortas [ 16/Jan/17 ] |
|
That create line is incorrect. Should be just "zpool create lustre-mdt0" without the extra filesystem part. |
| Comment by Joseph Gmitter (Inactive) [ 18/Jan/17 ] |
|
Hi Zhiqi, Do you have any recommendation here? Thanks. |
| Comment by Joseph Gmitter (Inactive) [ 18/Jan/17 ] |
|
Peter, |
| Comment by Peter Jones [ 19/Jan/17 ] |
|
Just a test to check access for zino |
| Comment by Peter Bortas [ 19/Jan/17 ] |
|
Appriciated Joseph, That doc in my mind confirms that we are on the right track with this procedure. I'll wait for Zhiqi to see if he has any further insight. (And thanks Peter, I can see the tickets again now.) Cheers, |
| Comment by Zhiqi Tao (Inactive) [ 19/Jan/17 ] |
|
Hi Peter B, Gabriele will try this procedure in an internal develop lab and update this ticket with his experience. We understand your timing, "We have scheduled downtime to execute it Thursday next week, 26 Jan." We should have results in one day or two. Best Regards, |
| Comment by Gabriele Paciucci (Inactive) [ 19/Jan/17 ] |
|
Hi zino, BTW I'm London based, so we can organize a call to double check the procedure. |
| Comment by Peter Bortas [ 19/Jan/17 ] |
|
Hi Gabriele, Sounds great. Let's check in on Monday and see how things have progressed. And we can schedule a call then if we feel it's needed. |
| Comment by Gabriele Paciucci (Inactive) [ 20/Jan/17 ] |
|
Okay this is a first procedure that doesn't need a second pool and we save our backup in a gzip file. In my environment, I have in the same pool MDT and MGT. # zfs snap -r MDS@backup # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT MDS@backup 0 - 96K - MDS/mdt0@backup 0 - 489M - MDS/mgt@backup 0 - 5.27M - # zfs send -R MDS@backup | gzip > /tmp/backup.gz # zfs list NAME USED AVAIL REFER MOUNTPOINT MDS 495M 360G 96K /MDS MDS/mdt0 489M 360G 489M /MDS/mdt0 MDS/mgt 5.27M 360G 5.27M /MDS/mgt # zpool destroy MDS # zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT # zpool create -o ashift=9 MDS mirror /dev/sdc /dev/sde # zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT MDS 372G 50K 372G - 0% 0% 1.00x ONLINE - # zcat /tmp/backup.gz | zfs recv -F MDS # zfs list NAME USED AVAIL REFER MOUNTPOINT MDS 121M 360G 19K /MDS MDS/mdt0 118M 360G 118M /MDS/mdt0 MDS/mgt 3.22M 360G 3.22M /MDS/mgt # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT MDS@backup 0 - 19K - MDS/mdt0@backup 0 - 118M - MDS/mgt@backup 0 - 3.22M - # mount -t lustre MDS/mgt /mnt/mgt/ # mount -t lustre MDS/mdt0 /mnt/mdt0 # df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda4 897134592 3685648 893448944 1% / devtmpfs 32823216 0 32823216 0% /dev tmpfs 32836956 39648 32797308 1% /dev/shm tmpfs 32836956 9444 32827512 1% /run tmpfs 32836956 0 32836956 0% /sys/fs/cgroup /dev/sda2 10471424 176380 10295044 2% /boot /dev/sda1 1046516 9644 1036872 1% /boot/efi tmpfs 6567392 0 6567392 0% /run/user/0 MDS/mgt 374806016 3328 374800640 1% /mnt/mgt MDS/mdt0 374922752 120960 374799744 1% /mnt/mdt0 |
| Comment by Gabriele Paciucci (Inactive) [ 20/Jan/17 ] |
|
Do you need the same procedure using another zpool ? |
| Comment by Peter Bortas [ 20/Jan/17 ] |
|
Not really. I like your method better. It does invalidate some of my testing though, so I'll run some new over the weekend. |
| Comment by Gabriele Paciucci (Inactive) [ 20/Jan/17 ] |
|
Okay, I'm now on hold waiting for your feedback. |
| Comment by Peter Bortas [ 23/Jan/17 ] |
|
Hi Gabriel, The weekends tests looks good. I have some tests I will run over night and lock down the plans tomorrow. A couple of questions: 1. Did you have any reason that sending the whole pool would be better than sending individual filesystems except that it was easier because you also had the MGT there? Unless there is a reason not to I will send the filesystems, only for clearitys sake. The pools have anonymous names while the MDTs are named after the filsystems. I will be doing this for 3 pools on the same machine, so keeping the names reduces the chance of recv:ing or destroying the wrong filesystem. These will be the actual sends on my end: zfs send -vR lustre-mdt0/fouo6@copythis | gzip > /lustre-mdt-tmpfs/mds0-fouo6.gz 2. I will not be moving the MGT from ashift=12 to ashift=9. Will this cause any problems? I know the question is borderline insane, but this is really the original reason I opened this ticket with you. I'm OK with sorting out everything on the zfs level, but I'm trying to fish for half-insane things like hard-coding offsets on MDT creation time based on number of blocks somewhere deep in Lustre. |
| Comment by Peter Jones [ 23/Jan/17 ] |
|
Peter Gabriele is unexpectedly out of the office at short notice. Can this wait until he is available again (hopefully next week)? Peter |
| Comment by Peter Bortas [ 24/Jan/17 ] |
|
Hi Peter, That's unfortunate. Of course it's technically possible to delay this to another week, but the cluster downtime is now to late to stop for this week. I will also have to mount the filesystems ro for a few weeks since the users will run out of inodes before the next window. I'd be happy with an answer to just this question: As far as Intel engineers know, is there anything in the filesystem that stores a structure that would be affected by a change in block size; i.e. could cause problems during this data move. We'll assume for the sake of this discussion that I'll be able to flawlessly take care of the bit shuffling on disk. |
| Comment by Gabriele Paciucci (Inactive) [ 26/Jan/17 ] |
|
Hi zino, Making lustre decide the shift layout at the format time it is something that maybe adilger can evaluate. Not sure if Lustre can evaluate the physical layout of the disks. |
| Comment by Peter Bortas [ 26/Jan/17 ] |
|
Hi Gabriel, You are in time. We got a bit delayed by hardware failing elsewhere in the cluster, so the procedure is just started. We'll know today if I lost the filesystems or not. I'll make an extra backup of the whole filesystems. It just adds about 1h to the procedure, and that's worth it. I don't think the tools for formatting really needs any intelligence here really, this was an operator error. But if there are no performance problems with running ashift=9 on 4k block SSDs in the general case it might be a good idea to default to ashift=9 there though. In my tests I've not seen any performance advantage outside of the error margin by using ashift=12 on SSDs on the MDS. |
| Comment by Gabriele Paciucci (Inactive) [ 26/Jan/17 ] |
|
Hi zino, |
| Comment by Peter Bortas [ 30/Jan/17 ] |
|
This operation was somewhat delayed by unrelated failures in one of the attached compute clusters, but completed without problems on Friday. I have noted one oddity with ZFS snapshots today, but nothing that affects production. I'll try to figure out that one by tomorrow and then we can close this. |
| Comment by Peter Bortas [ 31/Jan/17 ] |
|
The ZFS oddity seems to be unrelated to the recreation of the filesystems, so I'll track that separately if needed. This concludes this issue from my side. Thanks for the help everyone! |