[LU-8288] handle error due to file with "no stripe info" rewritten before lfsck is run Created: 15/Jun/16 Updated: 29/Jan/18 Resolved: 18/Jan/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Nathan Dauchy (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This is a followup on the filesystem recovery efforts from If you think that the layout LFSCK made wrong decision when re-generated the "nagtest.toobig.stripes" LOV EA, we need to make new patch to recover it. More than just making a wrong decision, lfsck can actually corrupt files when it is run. The case is where the MDT loses stripe information, and then the file is rewritten (or appeneded to?), and then lfsck is run. In general, it would be good if lfsck can handle "conflicts" more gracefully. I understand that it may not know which object is the right one, but it should not pick them arbitrarily since that can result in a mixed-data file. Additionally, at the time when lfsck is run, it has information about what file an object is associated with, and that could be exposed to the user in the name of the file placed in lost+found. |
| Comments |
| Comment by Nathan Dauchy (Inactive) [ 15/Jun/16 ] |
|
Here is a test case that shows one possible scenario where lfsck has a problem. This is not exactly what happened in CLIENT step1: # cd /mnt/lustre/client/lfscktest/ # ./make_lustre_test_file.sh stripedfile setting stripe info for stripedfile -rw-r--r-- 1 root root 3145728 Jun 15 12:37 stripedfile # uniq -c stripedfile 49152 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz. # lfs getstripe stripedfile stripedfile lmm_stripe_count: 12 lmm_stripe_size: 262144 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 3 obdidx objid objid group 3 491076 0x77e44 0 4 491044 0x77e24 0 5 491076 0x77e44 0 0 491012 0x77e04 0 1 491044 0x77e24 0 8 491076 0x77e44 0 11 491076 0x77e44 0 10 491076 0x77e44 0 7 491012 0x77e04 0 2 491044 0x77e24 0 13 490948 0x77dc4 0 14 491076 0x77e44 0 SERVER step2: (simulate lost attributes, from corrupt MDT and e2fsck recovery for example) # umount /mnt/lustre/nbptest-mdt # mount -t ldiskfs /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt # cd /mnt/lustre/nbptest-mdt/ROOT/lfscktest/ # getfattr -d -m ".*" -e hex stripedfile # setfattr -x "trusted.link" stripedfile # setfattr -x "trusted.lma" stripedfile # setfattr -x "trusted.lov" stripedfile # cd / # umount /mnt/lustre/nbptest-mdt # mount -t lustre /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt CLIENT step3: # ls -l stripedfile -rw-r--r-- 1 root root 0 Jun 15 12:37 stripedfile # lfs getstripe stripedfile stripedfile has no stripe info # ./make_lustre_test_file.sh stripedfile file exists with stripe count of '', overwriting -rw-r--r-- 1 root root 3145728 Jun 15 12:40 stripedfile # uniq -c stripedfile 49152 .zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA9876543210 # lfs getstripe stripedfile stripedfile lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 491044 0x77e24 0 SERVER step4: # lctl lfsck_start -A -M nbptest-MDT0000 -c on -C on -o Started LFSCK on the device nbptest-MDT0000: scrub layout namespace # lctl get_param -n osd-ldiskfs.*.oi_scrub | grep status CLIENT step5: # ls -l stripedfile -rw-r--r-- 1 root root 3407872 Jun 15 12:40 stripedfile # uniq -c stripedfile uniq: error reading stripedfile # lfs getstripe stripedfile stripedfile lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_pattern: 40000001 lmm_layout_gen: 2 lmm_stripe_offset: 0 obdidx objid objid group 0 491045 0x77e25 0 0 0 0 0 0 0 0 0 0 491012 0x77e04 0 # md5sum stripedfile md5sum: stripedfile: Input/output error # cd /mnt/lustre/client/.lustre/lost+found/MDT0000/ # ls -la total 132 drwx------+ 3 root root 126976 Jun 15 12:33 . dr-x------+ 3 root root 4096 Jun 10 13:41 .. SERVER step6: (lfsck workaround in place of # umount /mnt/lustre/nbptest-mdt # mount -t ldiskfs /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt # rm -f /mnt/lustre/nbptest-mdt/oi.16.* # umount /mnt/lustre/nbptest-mdt # mount -t lustre /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt # lctl lfsck_start -A -M nbptest-MDT0000 -c on -C on -o CLIENT step7: # cd /mnt/lustre/client/.lustre/lost+found/MDT0000/ # ls -l total 5888 -r-------- 1 root root 3145728 Jun 15 12:49 [0x200002b10:0x4:0x0]-[0x2000032e0:0x4e:0x0]-0-C-0 -r-------- 1 root root 11796480 Jun 15 12:49 [0x2000032e0:0x4e:0x0]-R-0 # head -n 1 * ==> [0x200002b10:0x4:0x0]-[0x2000032e0:0x4e:0x0]-0-C-0 <== .zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA9876543210 ==> [0x2000032e0:0x4e:0x0]-R-0 <== 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz. |
| Comment by Nathan Dauchy (Inactive) [ 15/Jun/16 ] |
|
The test file was created and rewritten with... #!/bin/bash
size=262144
count=12
if [ -z "$1" ]; then
echo "usage: $0 <filename>"
exit
fi
file=$1
if [ -e "$file" ]; then
c=$(lfs getstripe $file | grep stripe_count | awk '{print $2}')
echo "file exists with stripe count of '$c', overwriting"
string=".zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA9876543210"
else
echo "setting stripe info for $file"
lfs setstripe -S $size -c $count $file
string="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz."
fi
sizeof=$(echo "$string" | wc -c)
repeats=$(( size * count / sizeof ))
for i in $(seq 1 $repeats); do
echo $string
done > $file
ls -l $file
|
| Comment by nasf (Inactive) [ 16/Jun/16 ] |
|
The layout LFSCK logic is that: |
| Comment by Peter Jones [ 16/Jun/16 ] |
|
Fan Yong Could you please help with this one? Thanks Peter |
| Comment by nasf (Inactive) [ 17/Jun/16 ] |
In fact, the key issue is that during the layout LFSCK 1st phase scanning (orphan OST-object will be detected in the 2nd phase scanning), if it finds that some LOV EA references a non-existing OST-object, it does not know exactly whether it is the OST-object lost or the LOV EA corrupted. If it is the former case, creating the lost OST-object can make the system to be available as fast as possible; but if it is the latter case, correcting the LOV EA is better choice. So two possible solutions for that: 1) Postpone the layout LFSCK preparing decision for dangling reference case until orphan OST-objects handled properly. That means the 3rd phase scanning introduced, that will much affect the whole LFSCK framework. 2) Never re-create the lost OST-object. Andreas, how do you think for that ? |
| Comment by Andreas Dilger [ 17/Jun/16 ] |
|
The problem that was seen in "CLIENT step5" could be fixed with the fidea changes being implemented for PFL Phase 3a. In particular, the current fidea does not store the total number of stripes in the layout, so old stripes found on the OST (e.g. with "stripe_idx = 4" in this case) would currently be added to the file layout and the stripe count increased. With the new PFL fidea the total stripe count is also saved with each OST object, which could be used in this case to determine whether the orphan OST objects are part of the same layout or not. It may be in the common case that the use of default stripe counts means the total stripe count is also the same between multiple sets of orphan OST objects. However, I don't think that would be a problem. It would avoid the case seen here where stale objects with a higher stripe index are added to the recreated file with fewer stripes. If the file was recreated, then all objects should be present, so if old orphan objects have the same stripe count they will not be added to the layout and be put into lost+found instead. If the old orphan objects have a different stripe count then they should not be added to the existing file. I'm also a bit confused why the "CLIENT step5" layout was not reconstructed with all 12 of the original stripes? Was lfsck still running on the other OSTs? Why were there two objects allocated on OST index 0, and why was a new object allocated on OST index 0 (objid 491045) in place of the manually recreated object (objid 491044)? |
| Comment by nasf (Inactive) [ 17/Jun/16 ] |
Only the OST-object that has ever been modified (write/setattr) after creation has PFID EA, then the LFSCK will handle it as orphan if no MDT-object reference it. In this case, I am not sure whether the original 12 tripes all have been modified before MDT-object LOV EA removed.
That is also my concern. Currently, for a given striped file, it has at most one OST-object on the specified OST. In this case, I am afraid that the wrong OST-object is written? |
| Comment by Gerrit Updater [ 28/Jul/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21562 |
| Comment by Nathan Dauchy (Inactive) [ 08/Sep/16 ] |
|
It looks like there were many iterations on this patch, but it is ready for final review and then landing. Please confirm. Also, once the patch is finalized, we will need a backport to the 2.7 FE branch as well as master. Thanks! |
| Comment by Peter Jones [ 08/Sep/16 ] |
|
That matches my understanding Nathan |
| Comment by Andreas Dilger [ 27/Sep/16 ] |
|
Per earlier discussion in this ticket, it would be worthwhile to backport the PFL patches to increase the MDT and OST inode size, as well as the patch to improve the fid xattr to store the total stripe count and stripe size on each OST object. That would allow LFSCK to reconstruct the layout properly, even in the case where some OST objects are totally missing. Having clients send this information with each write will ensure that this information is stored on each OST object for later use if needed. |
| Comment by Gerrit Updater [ 18/Jan/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21562/ |
| Comment by Jay Lan (Inactive) [ 18/Jan/17 ] |
|
Could you port this patch to b2_7_fe and land to b2_9_fe? Thanks! |
| Comment by Minh Diep [ 18/Jan/17 ] |
|
Landed for 2.10 |