[LU-9023] Second opinion on MDT inode recovery requested - Whamcloud Community JIRA

Details

Type: Question/Request
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

This is a sanity check question. NSC sees no reason the method described below should not work, but due to the high impact a failure would have we'd like a second opinion. We have scheduled downtime to execute it Thursday next week, 26 Jan.

To sort out the fallout of ~~LU-8953~~ (out of inodes on ZFS MDT solved by adding more disks to the pool) we need to recreate the original pool. The reason we ran out of inodes was that when the vendor sent us hardware for the latest expansion that was supposed to be equivalent to the last shipment the SSD had switched from reporting 512b blocks to 4k blocks. Since I had not hardcoded ashift we ended up with 6-8 times fewer inodes and this was missed in testing.

There isn't enough slots in the MDSs to solve this by throwing HW at it as a permanent solution, so I need to move all data from pools with ashift=12 to ashift=9. Do you see any problem with just doing the following:

(The funny device names come from running LVM just to get more easily identifiable names)

Unmount the filesystem on all nodes then run something like this for each mdt that needs fixing:

umount lustre-mdt0/fouo6
zfs snapshot lustre-mdt0/fouo6@copythis
zpool create lustre-mdt-tmp -o ashift=9 mirror \
/dev/new_sdr/mdt_fouo6new_sdr \
/dev/new_sdu/mdt_fouo6new_sdu
zfs send -R lustre-mdt0/fouo6@copythis | zfs recv lustre-mdt-tmp/fouo6tmp
zpool destroy lustre-REMOVETHIS-mdt0
zpool create lustre-mdt0/fouo6 -o ashift=9 \
mirror /dev/mds9_sdm/mdt_fouo6_sdm /dev/mds9_sdn/mdt_fouo6_sdn \
mirror /dev/mds9_sdo/mdt_fouo6_sdo /dev/mds9_sdp/mdt_fouo6_sdp
zfs send -R lustre-mdt-tmp/fouo6tmp@copythis | zfs recv lustre-mdt0/fouo6
mount -t lustre lustre-mdt0/fouo6 /mnt/lustre/local/fouo6
zpool destroy lustre-mdt-tmp

The "REMOVETHIS-" inserted due to desktop copy buffer paranoia should be removed before running of course.

Attachments

Issue Links

is related to

LUDOC-161 document backup/restore process for ZFS backing filesystems

Resolved

Activity

[LU-9023] Second opinion on MDT inode recovery requested

Peter Bortas added a comment - 23/Jan/17 4:43 PM

Hi Gabriel,

The weekends tests looks good. I have some tests I will run over night and lock down the plans tomorrow. A couple of questions:

1. Did you have any reason that sending the whole pool would be better than sending individual filesystems except that it was easier because you also had the MGT there? Unless there is a reason not to I will send the filesystems, only for clearitys sake. The pools have anonymous names while the MDTs are named after the filsystems. I will be doing this for 3 pools on the same machine, so keeping the names reduces the chance of recv:ing or destroying the wrong filesystem. These will be the actual sends on my end:

zfs send -vR lustre-mdt0/fouo6@copythis | gzip > /lustre-mdt-tmpfs/mds0-fouo6.gz
zfs send -vR lustre-mdt1/rossby20@copythis | gzip > /lustre-mdt-tmpfs/mds1-rossby20.gz
zfs send -vR lustre-mdt2/smhid13@copythis | gzip > /lustre-mdt-tmpfs/mds2-smhid13.gz

2. I will not be moving the MGT from ashift=12 to ashift=9. Will this cause any problems? I know the question is borderline insane, but this is really the original reason I opened this ticket with you. I'm OK with sorting out everything on the zfs level, but I'm trying to fish for half-insane things like hard-coding offsets on MDT creation time based on number of blocks somewhere deep in Lustre.

Peter Bortas added a comment - 23/Jan/17 4:43 PM Hi Gabriel, The weekends tests looks good. I have some tests I will run over night and lock down the plans tomorrow. A couple of questions: 1. Did you have any reason that sending the whole pool would be better than sending individual filesystems except that it was easier because you also had the MGT there? Unless there is a reason not to I will send the filesystems, only for clearitys sake. The pools have anonymous names while the MDTs are named after the filsystems. I will be doing this for 3 pools on the same machine, so keeping the names reduces the chance of recv:ing or destroying the wrong filesystem. These will be the actual sends on my end: zfs send -vR lustre-mdt0/fouo6@copythis | gzip > /lustre-mdt-tmpfs/mds0-fouo6.gz zfs send -vR lustre-mdt1/rossby20@copythis | gzip > /lustre-mdt-tmpfs/mds1-rossby20.gz zfs send -vR lustre-mdt2/smhid13@copythis | gzip > /lustre-mdt-tmpfs/mds2-smhid13.gz 2. I will not be moving the MGT from ashift=12 to ashift=9. Will this cause any problems? I know the question is borderline insane, but this is really the original reason I opened this ticket with you. I'm OK with sorting out everything on the zfs level, but I'm trying to fish for half-insane things like hard-coding offsets on MDT creation time based on number of blocks somewhere deep in Lustre.

Gabriele Paciucci (Inactive) added a comment - 20/Jan/17 3:44 PM

Okay, I'm now on hold waiting for your feedback.

Gabriele Paciucci (Inactive) added a comment - 20/Jan/17 3:44 PM Okay, I'm now on hold waiting for your feedback.

Peter Bortas added a comment - 20/Jan/17 3:30 PM

Not really. I like your method better. It does invalidate some of my testing though, so I'll run some new over the weekend.

Peter Bortas added a comment - 20/Jan/17 3:30 PM Not really. I like your method better. It does invalidate some of my testing though, so I'll run some new over the weekend.

Gabriele Paciucci (Inactive) added a comment - 20/Jan/17 3:00 PM

Do you need the same procedure using another zpool ?

Gabriele Paciucci (Inactive) added a comment - 20/Jan/17 3:00 PM Do you need the same procedure using another zpool ?

Gabriele Paciucci (Inactive) added a comment - 20/Jan/17 2:58 PM - edited

Okay this is a first procedure that doesn't need a second pool and we save our backup in a gzip file. In my environment, I have in the same pool MDT and MGT.
The name of the pool is MDS

#  zfs snap -r MDS@backup

#    zfs list -t snapshot
NAME              USED  AVAIL  REFER  MOUNTPOINT
MDS@backup           0      -    96K  -
MDS/mdt0@backup      0      -   489M  -
MDS/mgt@backup       0      -  5.27M  -

#  zfs send -R MDS@backup | gzip > /tmp/backup.gz

#  zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
MDS            495M   360G    96K  /MDS
MDS/mdt0       489M   360G   489M  /MDS/mdt0
MDS/mgt       5.27M   360G  5.27M  /MDS/mgt

# zpool destroy MDS

# zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT

#  zpool create -o ashift=9 MDS mirror  /dev/sdc /dev/sde

# zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
MDS       372G    50K   372G         -     0%     0%  1.00x  ONLINE  -

#   zcat /tmp/backup.gz | zfs recv -F MDS

# zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
MDS            121M   360G    19K  /MDS
MDS/mdt0       118M   360G   118M  /MDS/mdt0
MDS/mgt       3.22M   360G  3.22M  /MDS/mgt

#   zfs list -t snapshot
NAME              USED  AVAIL  REFER  MOUNTPOINT
MDS@backup           0      -    19K  -
MDS/mdt0@backup      0      -   118M  -
MDS/mgt@backup       0      -  3.22M  -

#  mount -t lustre MDS/mgt /mnt/mgt/
#  mount -t lustre MDS/mdt0 /mnt/mdt0
# df
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sda4      897134592 3685648 893448944   1% /
devtmpfs        32823216       0  32823216   0% /dev
tmpfs           32836956   39648  32797308   1% /dev/shm
tmpfs           32836956    9444  32827512   1% /run
tmpfs           32836956       0  32836956   0% /sys/fs/cgroup
/dev/sda2       10471424  176380  10295044   2% /boot
/dev/sda1        1046516    9644   1036872   1% /boot/efi
tmpfs            6567392       0   6567392   0% /run/user/0
MDS/mgt        374806016    3328 374800640   1% /mnt/mgt
MDS/mdt0       374922752  120960 374799744   1% /mnt/mdt0

Gabriele Paciucci (Inactive) added a comment - 20/Jan/17 2:58 PM - edited Okay this is a first procedure that doesn't need a second pool and we save our backup in a gzip file. In my environment, I have in the same pool MDT and MGT. The name of the pool is MDS # zfs snap -r MDS@backup # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT MDS@backup 0 - 96K - MDS/mdt0@backup 0 - 489M - MDS/mgt@backup 0 - 5.27M - # zfs send -R MDS@backup | gzip > /tmp/backup.gz # zfs list NAME USED AVAIL REFER MOUNTPOINT MDS 495M 360G 96K /MDS MDS/mdt0 489M 360G 489M /MDS/mdt0 MDS/mgt 5.27M 360G 5.27M /MDS/mgt # zpool destroy MDS # zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT # zpool create -o ashift=9 MDS mirror /dev/sdc /dev/sde # zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT MDS 372G 50K 372G - 0% 0% 1.00x ONLINE - # zcat /tmp/backup.gz | zfs recv -F MDS # zfs list NAME USED AVAIL REFER MOUNTPOINT MDS 121M 360G 19K /MDS MDS/mdt0 118M 360G 118M /MDS/mdt0 MDS/mgt 3.22M 360G 3.22M /MDS/mgt # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT MDS@backup 0 - 19K - MDS/mdt0@backup 0 - 118M - MDS/mgt@backup 0 - 3.22M - # mount -t lustre MDS/mgt /mnt/mgt/ # mount -t lustre MDS/mdt0 /mnt/mdt0 # df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda4 897134592 3685648 893448944 1% / devtmpfs 32823216 0 32823216 0% /dev tmpfs 32836956 39648 32797308 1% /dev/shm tmpfs 32836956 9444 32827512 1% /run tmpfs 32836956 0 32836956 0% /sys/fs/cgroup /dev/sda2 10471424 176380 10295044 2% /boot /dev/sda1 1046516 9644 1036872 1% /boot/efi tmpfs 6567392 0 6567392 0% /run/user/0 MDS/mgt 374806016 3328 374800640 1% /mnt/mgt MDS/mdt0 374922752 120960 374799744 1% /mnt/mdt0

Peter Bortas added a comment - 19/Jan/17 3:41 PM

Hi Gabriele,

Sounds great. Let's check in on Monday and see how things have progressed. And we can schedule a call then if we feel it's needed.

Peter Bortas added a comment - 19/Jan/17 3:41 PM Hi Gabriele, Sounds great. Let's check in on Monday and see how things have progressed. And we can schedule a call then if we feel it's needed.

Gabriele Paciucci (Inactive) added a comment - 19/Jan/17 2:22 PM - edited

Hi zino,
I would say maybe 3 or 4 days

BTW I'm London based, so we can organize a call to double check the procedure.

Gabriele Paciucci (Inactive) added a comment - 19/Jan/17 2:22 PM - edited Hi zino , I would say maybe 3 or 4 days BTW I'm London based, so we can organize a call to double check the procedure.

Zhiqi Tao (Inactive) added a comment - 19/Jan/17 2:16 PM

Hi Peter B,

Gabriele will try this procedure in an internal develop lab and update this ticket with his experience. We understand your timing, "We have scheduled downtime to execute it Thursday next week, 26 Jan."

We should have results in one day or two.

Best Regards,
Zhiqi

Zhiqi Tao (Inactive) added a comment - 19/Jan/17 2:16 PM Hi Peter B, Gabriele will try this procedure in an internal develop lab and update this ticket with his experience. We understand your timing, "We have scheduled downtime to execute it Thursday next week, 26 Jan." We should have results in one day or two. Best Regards, Zhiqi

Peter Bortas added a comment - 19/Jan/17 1:21 PM

Appriciated Joseph,

That doc in my mind confirms that we are on the right track with this procedure. I'll wait for Zhiqi to see if he has any further insight.

(And thanks Peter, I can see the tickets again now.)

Cheers,
Peter B

Peter Bortas added a comment - 19/Jan/17 1:21 PM Appriciated Joseph, That doc in my mind confirms that we are on the right track with this procedure. I'll wait for Zhiqi to see if he has any further insight. (And thanks Peter, I can see the tickets again now.) Cheers, Peter B

Peter Jones added a comment - 19/Jan/17 12:56 PM

Just a test to check access for zino

Peter Jones added a comment - 19/Jan/17 12:56 PM Just a test to check access for zino

Joseph Gmitter (Inactive) added a comment - 18/Jan/17 6:13 PM

Peter,
While we wait for Zhiqi to comment, you can also see the commentary in ~~LUDOC-161~~ as a 2nd source of information.
Thanks.
Joe

Joseph Gmitter (Inactive) added a comment - 18/Jan/17 6:13 PM Peter, While we wait for Zhiqi to comment, you can also see the commentary in LUDOC-161 as a 2nd source of information. Thanks. Joe

People

Assignee:: Gabriele Paciucci (Inactive)

Reporter:: Peter Bortas

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 16/Jan/17 2:56 PM

Updated:: 31/Jan/17 1:50 PM