[LU-8288] handle error due to file with "no stripe info" rewritten before lfsck is run Created: 15/Jun/16  Updated: 29/Jan/18  Resolved: 18/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: Nathan Dauchy (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-4615 LFSCK 5: OST index verification durin... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This is a followup on the filesystem recovery efforts from LU-8071, in particular the comment:

If you think that the layout LFSCK made wrong decision when re-generated the
"nagtest.toobig.stripes" LOV EA, we need to make new patch to recover it. 

More than just making a wrong decision, lfsck can actually corrupt files when it is run. The case is where the MDT loses stripe information, and then the file is rewritten (or appeneded to?), and then lfsck is run.

In general, it would be good if lfsck can handle "conflicts" more gracefully. I understand that it may not know which object is the right one, but it should not pick them arbitrarily since that can result in a mixed-data file. Additionally, at the time when lfsck is run, it has information about what file an object is associated with, and that could be exposed to the user in the name of the file placed in lost+found.



 Comments   
Comment by Nathan Dauchy (Inactive) [ 15/Jun/16 ]

Here is a test case that shows one possible scenario where lfsck has a problem. This is not exactly what happened in LU-8071, but hopefully illustrates where things can go wrong in the "conflict" handling code.

CLIENT step1:

# cd /mnt/lustre/client/lfscktest/
# ./make_lustre_test_file.sh stripedfile
setting stripe info for stripedfile
-rw-r--r-- 1 root root 3145728 Jun 15 12:37 stripedfile
# uniq -c stripedfile 
  49152 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.
# lfs getstripe stripedfile
stripedfile
lmm_stripe_count:   12
lmm_stripe_size:    262144
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  3
	obdidx		 objid		 objid		 group
	     3	        491076	      0x77e44	             0
	     4	        491044	      0x77e24	             0
	     5	        491076	      0x77e44	             0
	     0	        491012	      0x77e04	             0
	     1	        491044	      0x77e24	             0
	     8	        491076	      0x77e44	             0
	    11	        491076	      0x77e44	             0
	    10	        491076	      0x77e44	             0
	     7	        491012	      0x77e04	             0
	     2	        491044	      0x77e24	             0
	    13	        490948	      0x77dc4	             0
	    14	        491076	      0x77e44	             0

SERVER step2: (simulate lost attributes, from corrupt MDT and e2fsck recovery for example)

# umount /mnt/lustre/nbptest-mdt
# mount -t ldiskfs /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt
# cd /mnt/lustre/nbptest-mdt/ROOT/lfscktest/
# getfattr -d -m ".*" -e hex stripedfile
# setfattr -x "trusted.link" stripedfile
# setfattr -x "trusted.lma" stripedfile
# setfattr -x "trusted.lov" stripedfile
# cd /
# umount /mnt/lustre/nbptest-mdt
# mount -t lustre /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt

CLIENT step3:

# ls -l stripedfile 
-rw-r--r-- 1 root root 0 Jun 15 12:37 stripedfile
# lfs getstripe stripedfile
stripedfile has no stripe info
# ./make_lustre_test_file.sh stripedfile
file exists with stripe count of '', overwriting
-rw-r--r-- 1 root root 3145728 Jun 15 12:40 stripedfile
# uniq -c stripedfile
  49152 .zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA9876543210
# lfs getstripe stripedfile
stripedfile
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  0
	obdidx		 objid		 objid		 group
	     0	        491044	      0x77e24	             0

SERVER step4:

# lctl lfsck_start -A -M nbptest-MDT0000 -c on -C on -o
Started LFSCK on the device nbptest-MDT0000: scrub layout namespace
# lctl get_param -n osd-ldiskfs.*.oi_scrub | grep status

CLIENT step5:

# ls -l stripedfile 
-rw-r--r-- 1 root root 3407872 Jun 15 12:40 stripedfile
# uniq -c stripedfile
uniq: error reading stripedfile
# lfs getstripe stripedfile 
stripedfile
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_pattern:        40000001
lmm_layout_gen:     2
lmm_stripe_offset:  0
	obdidx		 objid		 objid		 group
	     0	        491045	      0x77e25	             0
	     0	             0	            0	             0
	     0	             0	            0	             0
	     0	        491012	      0x77e04	             0

# md5sum stripedfile
md5sum: stripedfile: Input/output error

# cd /mnt/lustre/client/.lustre/lost+found/MDT0000/
# ls -la
total 132
drwx------+ 3 root root 126976 Jun 15 12:33 .
dr-x------+ 3 root root   4096 Jun 10 13:41 ..

SERVER step6: (lfsck workaround in place of LU-8218)

# umount /mnt/lustre/nbptest-mdt
# mount -t ldiskfs /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt
# rm -f /mnt/lustre/nbptest-mdt/oi.16.*
# umount /mnt/lustre/nbptest-mdt
# mount -t lustre /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt
# lctl lfsck_start -A -M nbptest-MDT0000 -c on -C on -o

CLIENT step7:

# cd /mnt/lustre/client/.lustre/lost+found/MDT0000/
# ls -l
total 5888
-r-------- 1 root root  3145728 Jun 15 12:49 [0x200002b10:0x4:0x0]-[0x2000032e0:0x4e:0x0]-0-C-0
-r-------- 1 root root 11796480 Jun 15 12:49 [0x2000032e0:0x4e:0x0]-R-0

# head -n 1 *
==> [0x200002b10:0x4:0x0]-[0x2000032e0:0x4e:0x0]-0-C-0 <==
.zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA9876543210

==> [0x2000032e0:0x4e:0x0]-R-0 <==
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.
Comment by Nathan Dauchy (Inactive) [ 15/Jun/16 ]

The test file was created and rewritten with...

#!/bin/bash

size=262144
count=12

if [ -z "$1" ]; then
    echo "usage: $0 <filename>"
    exit
fi
file=$1

if [ -e "$file" ]; then
    c=$(lfs getstripe $file | grep stripe_count | awk '{print $2}')
    echo "file exists with stripe count of '$c', overwriting"
    string=".zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA9876543210"
else
    echo "setting stripe info for $file"
    lfs setstripe -S $size -c $count $file
    string="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz."
fi
sizeof=$(echo "$string" | wc -c)
repeats=$(( size * count / sizeof ))

for i in $(seq 1 $repeats); do
    echo $string
done > $file
ls -l $file
Comment by nasf (Inactive) [ 16/Jun/16 ]

The layout LFSCK logic is that:
1) If the MDT-object's LOV EA lost, then the layout will re-generate the LOV EA according to the found orphan OST-objects. Under such case, it will not create new OST-object to reference.
2) If the MDT-object's LOV EA corrupted as to contains some invalid LOV EA information, then before the layout LFSCK finding out the right OST-objects (in the 2nd phase scanning), it will check the corrupted LOV EA firstly (in the 1st phase scanning , and at that time, the layout LFSCK does not know the LOV EA is corrupted, instead, it will think that the OST-object (referenced by the corrupted LOV EA) is lost, and then, depends on the LFSCK start option ("-c"), the layout will create the 'lost' OST-object or give out some warning message. Here we discuss the case of "-c" specified, means the layout LFSCK creates the "lost" OST-object.
2.1) If nobody modified such new created OST-object before the layout LFSCK finding out the real orphan OST-object, then the layout LFSCK will drop the new created OST-object and replace it with the real orphan OST-object. Otherwise,
2.2) Since the new created OST-object contains new data, we cannot drop it, to make the user to realise that there were some conflict, the layout LFSCK will generate new file under .lustre/lost+found with the name ${FID}-${infix}-${conflict_version}, that contains the old data.

Comment by Peter Jones [ 16/Jun/16 ]

Fan Yong

Could you please help with this one?

Thanks

Peter

Comment by nasf (Inactive) [ 17/Jun/16 ]

2.1) If nobody modified such new created OST-object before the layout LFSCK finding out the real orphan OST-object, then the layout LFSCK will drop the new created OST-object and replace it with the real orphan OST-object. Otherwise,
2.2) Since the new created OST-object contains new data, we cannot drop it, to make the user to realise that there were some conflict, the layout LFSCK will generate new file under .lustre/lost+found with the name $FID-$infix-$conflict_version, that contains the old data.

In fact, the key issue is that during the layout LFSCK 1st phase scanning (orphan OST-object will be detected in the 2nd phase scanning), if it finds that some LOV EA references a non-existing OST-object, it does not know exactly whether it is the OST-object lost or the LOV EA corrupted. If it is the former case, creating the lost OST-object can make the system to be available as fast as possible; but if it is the latter case, correcting the LOV EA is better choice. So two possible solutions for that:

1) Postpone the layout LFSCK preparing decision for dangling reference case until orphan OST-objects handled properly. That means the 3rd phase scanning introduced, that will much affect the whole LFSCK framework.

2) Never re-create the lost OST-object.

Andreas, how do you think for that ?

Comment by Andreas Dilger [ 17/Jun/16 ]

The problem that was seen in "CLIENT step5" could be fixed with the fidea changes being implemented for PFL Phase 3a. In particular, the current fidea does not store the total number of stripes in the layout, so old stripes found on the OST (e.g. with "stripe_idx = 4" in this case) would currently be added to the file layout and the stripe count increased. With the new PFL fidea the total stripe count is also saved with each OST object, which could be used in this case to determine whether the orphan OST objects are part of the same layout or not.

It may be in the common case that the use of default stripe counts means the total stripe count is also the same between multiple sets of orphan OST objects. However, I don't think that would be a problem. It would avoid the case seen here where stale objects with a higher stripe index are added to the recreated file with fewer stripes. If the file was recreated, then all objects should be present, so if old orphan objects have the same stripe count they will not be added to the layout and be put into lost+found instead. If the old orphan objects have a different stripe count then they should not be added to the existing file.

I'm also a bit confused why the "CLIENT step5" layout was not reconstructed with all 12 of the original stripes? Was lfsck still running on the other OSTs? Why were there two objects allocated on OST index 0, and why was a new object allocated on OST index 0 (objid 491045) in place of the manually recreated object (objid 491044)?

Comment by nasf (Inactive) [ 17/Jun/16 ]

I'm also a bit confused why the "CLIENT step5" layout was not reconstructed with all 12 of the original stripes? Was lfsck still running on the other OSTs?

Only the OST-object that has ever been modified (write/setattr) after creation has PFID EA, then the LFSCK will handle it as orphan if no MDT-object reference it. In this case, I am not sure whether the original 12 tripes all have been modified before MDT-object LOV EA removed.

Why were there two objects allocated on OST index 0, and why was a new object allocated on OST index 0 (objid 491045) in place of the manually recreated

That is also my concern. Currently, for a given striped file, it has at most one OST-object on the specified OST. In this case, I am afraid that the wrong OST-object is written?

Comment by Gerrit Updater [ 28/Jul/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21562
Subject: LU-8288 lfsck: handle dangling LOV EA reference
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 61dc2ac65258fceb30bf0549e76b8ff7eace2d29

Comment by Nathan Dauchy (Inactive) [ 08/Sep/16 ]

It looks like there were many iterations on this patch, but it is ready for final review and then landing. Please confirm.

Also, once the patch is finalized, we will need a backport to the 2.7 FE branch as well as master. Thanks!

Comment by Peter Jones [ 08/Sep/16 ]

That matches my understanding Nathan

Comment by Andreas Dilger [ 27/Sep/16 ]

Per earlier discussion in this ticket, it would be worthwhile to backport the PFL patches to increase the MDT and OST inode size, as well as the patch to improve the fid xattr to store the total stripe count and stripe size on each OST object. That would allow LFSCK to reconstruct the layout properly, even in the case where some OST objects are totally missing. Having clients send this information with each write will ensure that this information is stored on each OST object for later use if needed.

Comment by Gerrit Updater [ 18/Jan/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21562/
Subject: LU-8288 lfsck: handle dangling LOV EA reference
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 17cc912fd5b40965d14a89a268cbf2d63b2fe21b

Comment by Jay Lan (Inactive) [ 18/Jan/17 ]

Could you port this patch to b2_7_fe and land to b2_9_fe? Thanks!

Comment by Minh Diep [ 18/Jan/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:16:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.