[LU-8218] lfsck not able to recover files lost from MDT Created: 30/May/16  Updated: 14/Jun/18  Resolved: 22/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Nathan Dauchy (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File LU-8218_lfsck_lost_files.tgz     File lfsck.log    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

My understanding is that lfsck in lustre-2.7 should be able to handle lost file information on the MDT, as long as the objects are still on the OSTs. However, a simple test to simulate this is not recovering the files. Shouldn't it at least be able to put them into lost+found? Or am I misunderstanding the capabilities of lfsck? Or is the following test case invalid in some way?

On the client, just create some test files...

# cd /mnt/lustre/client/lfscktest
# echo foo > foo
# mkdir bar
# echo baz > bar/baz

# lfs getstripe foo bar/baz
foo
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  9
    obdidx         objid         objid         group
         9            460962          0x708a2                 0

bar/baz
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  12
    obdidx         objid         objid         group
        12            460866          0x70842                 0

# sync

On the MDS, simulate the MDT losing the information, such as could happen through restoring from a slightly outdated MDT backup...

# umount /mnt/lustre/nbptest-mdt
# mount -t ldiskfs /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt
# cd /mnt/lustre/nbptest-mdt/ROOT

# ls -ld lfscktest lfscktest/*
drwxr-xr-x+ 3 root root 4096 May 30 08:15 lfscktest
drwxr-xr-x+ 2 root root 4096 May 30 08:15 lfscktest/bar
-rw-r--r--  1 root root    0 May 30 08:14 lfscktest/foo

# rm -rf lfscktest/*

# cd
# umount /mnt/lustre/nbptest-mdt
# mount -t lustre /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt

Now check the filesystem...

# lctl clear
# lctl debug_daemon start /var/log/lfsck.debug
# lctl lfsck_start -A -M nbptest-MDT0000 -c on -C on -o
Started LFSCK on the device nbptest-MDT0000: scrub layout namespace

# lctl get_param -n osd-ldiskfs.*.oi_scrub | grep status
status: init
status: completed

# lctl debug_daemon stop
# lctl debug_file /var/log/lfsck.debug | egrep -v " (NRS|RPC) " > /var/log/lfsck.log

And look back on the client...

# cd /mnt/lustre/client/         

# ls -la lfscktest/
total 8
drwxr-xr-x+ 2 root root 4096 May 30 08:22 .
drwxr-xr-x+ 9 root root 4096 May 30 08:14 ..

# ls -la .lustre/lost+found/MDT0000
total 8
drwx------+ 3 root root 4096 May 27 10:44 .
dr-x------+ 3 root root 4096 May 27 09:01 ..

Notice that there is no sign of the files being restored anywhere. Nor do I find any mention of the object ID's in the lfsck.log file.

Note that running lfsck_start with the "-t layout" option did not change the behaviour either.



 Comments   
Comment by Peter Jones [ 30/May/16 ]

Fan Yong

Could you please advise

Peter

Comment by nasf (Inactive) [ 31/May/16 ]

In theory, the layout LFSCK should have the functionality to find out orphan OST-objects, and use the orphan OST-objects' PFID EA to re-generaate the MDT-object's LOV EA. So how to find out orphan OST-objects is important. For layout LFSCK, orphan OST-object means it exists and has ever been modified after pre-created, but nobody reference it. In your case, I am not sure whether the data has been written back to the OST before the layout LFSCK. If the dirty data has not been written back to the OST in time, then related OST-object will be in the pre-created status, not modified, and then, the layout LFSCK will not regard it as orphan OST-object. That can be verified via dump (debugfs) related OST-object on the OST.

In our sanity-lfsck test, we flush dirty data back to the OST via cancelling OST locks (lctl set_param -n ldlm.namespaces.osc.lru_size=clear) on the client. Such mechanism has been verified. So please try as following:

  1. cd /mnt/lustre/client/lfscktest
  2. echo foo > foo
  3. lctl set_param -n ldlm.namespaces.*osc*.lru_size=clear
    Then other subsequent operations as you did above. Start the layout LFSCK as "lctl start_lfsck -M nbptest-MDT0000 -t layout -o -r".

Thanks!

Comment by Nathan Dauchy (Inactive) [ 31/May/16 ]

Adding "lru_size=clear" to the test process did not seem to change anything. I will try another attempt, this time with unmounting the client and fully restarting the file system targets.

Comment by Nathan Dauchy (Inactive) [ 31/May/16 ]

The following procedure did not allow the files to be recovered either...

  • write files on client
  • set_param lru_size=clear
  • unmount client
  • stop all targets
  • mount MDT as ldiskfs, and remove the files, unmount
  • start all targets
  • run lfsck as "lctl lfsck_start -M nbptest-MDT0000 -t layout -o -r"

What other debugging information can I gather, to determine where those "lost" objects are ending up and why lfsck can't recover them?

Comment by nasf (Inactive) [ 01/Jun/16 ]

After "set_param lru_size=clear", would you please to dump related OST-object's attr and PFID EA via debugfs on the OST to check whether it is modified properly? Thanks!

Comment by Nathan Dauchy (Inactive) [ 01/Jun/16 ]

Is this the information you are looking for?

Client:

# cd /mnt/lustre/client/lfscktest
# echo foo > foo
# mkdir bar
# echo baz > bar/baz

# lctl get_param ldlm.namespaces.*osc*.lru_size | grep -v =0
  ldlm.namespaces.nbptest-OST0000-osc-ffff8805daad9800.lru_size=1
  ldlm.namespaces.nbptest-OST0007-osc-ffff8805daad9800.lru_size=2
  ldlm.namespaces.nbptest-OST0008-osc-ffff8805daad9800.lru_size=1
  ldlm.namespaces.nbptest-OST0009-osc-ffff8805daad9800.lru_size=1
  ldlm.namespaces.nbptest-OST000b-osc-ffff8805daad9800.lru_size=1
  ldlm.namespaces.nbptest-OST000c-osc-ffff8805daad9800.lru_size=1
# lctl set_param -n ldlm.namespaces.*osc*.lru_size=clear
# lctl get_param ldlm.namespaces.*osc*.lru_size | grep -v =0
  (nothing returned)

# getfattr -d -m ".*" -e hex foo bar/baz 
# file: foo
lustre.lov=0xd00bd10b010000000100000000000000b03a0000020000000000100001000000e20807000000000000000000000000000000000008000000
trusted.link=0xdff1ea11010000002d00000000000000000000000000000000150000000200002b100000000900000000666f6f
trusted.lma=0x0000000000000000b03a0000020000000100000000000000
trusted.lov=0xd00bd10b010000000100000000000000b03a0000020000000000100001000000e20807000000000000000000000000000000000008000000

# file: bar/baz
lustre.lov=0xd00bd10b010000000300000000000000b03a0000020000000000100001000000e2080700000000000000000000000000000000000b000000
trusted.link=0xdff1ea11010000002d00000000000000000000000000000000150000000200003ab0000000020000000062617a
trusted.lma=0x0000000000000000b03a0000020000000300000000000000
trusted.lov=0xd00bd10b010000000300000000000000b03a0000020000000000100001000000e2080700000000000000000000000000000000000b000000

service320 /mnt/lustre/client/lfscktest # 

# lfs getstripe foo bar/baz
foo
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  8
	obdidx		 objid		 objid		 group
	     8	        461026	      0x708e2	             0

bar/baz
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  11
	obdidx		 objid		 objid		 group
	    11	        461026	      0x708e2	             0
# echo $(( 461026 % 32 ))         
2

# debugfs /dev/mapper/nbptest-ost8
debugfs 1.42.13.wc4 (28-Nov-2015)
debugfs:  cd O
debugfs:  cd 0
debugfs:  cd d2
debugfs:  stat 461026
Inode: 487   Type: regular    Mode:  0666   Flags: 0x80000
Generation: 2904170364    Version: 0x0000000c:00000005
User:     0   Group:     0   Size: 4
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x574efbee:00000000 -- Wed Jun  1 08:14:54 2016
 atime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
 mtime: 0x574efbee:00000000 -- Wed Jun  1 08:14:54 2016
crtime: 0x574d9b4f:3f4d531c -- Tue May 31 07:10:23 2016
Size of extra inode fields: 28
Extended attributes stored in inode body: 
invalid EA entry in inode
EXTENTS:
(0):152064
debugfs:  dump 461026 /tmp/obj.461026.foo
debugfs:  quit

# cat /tmp/obj.461026.foo
foo
# debugfs /dev/mapper/nbptest-ost11
debugfs 1.42.13.wc4 (28-Nov-2015)
debugfs:  cd O
debugfs:  cd 0
debugfs:  cd d2
debugfs:  stat 461026
Inode: 489   Type: regular    Mode:  0666   Flags: 0x80000
Generation: 3312724559    Version: 0x0000000c:00000007
User:     0   Group:     0   Size: 4
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x574efbf6:00000000 -- Wed Jun  1 08:15:02 2016
 atime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
 mtime: 0x574efbf6:00000000 -- Wed Jun  1 08:15:02 2016
crtime: 0x574d9b4f:58007134 -- Tue May 31 07:10:23 2016
Size of extra inode fields: 28
Extended attributes stored in inode body: 
invalid EA entry in inode
EXTENTS:
(0):128768
debugfs:  dump 461026 /tmp/obj.461026.baz
debugfs:  quit

# cat /tmp/obj.461026.baz
baz
Comment by nasf (Inactive) [ 02/Jun/16 ]

Yes, that is what I want to know. The OST-object's size has been updated, that means the dirty data bas been flushed back to the OST, although the PFID EA ("trusted.fid") is not printed properly.

Please run layout LFSCK just on this system with LFSCK debug enabled, and collect the kernel debug logs on both the MDT and nbptest-ost8 and nbptest-ost11. Thanks!

Comment by Nathan Dauchy (Inactive) [ 02/Jun/16 ]

debug logs from the servers while lfsck was run.

service320 is client and where ost8 runs
service322 is MDS
service323 is where ost11 runs

Comment by nasf (Inactive) [ 03/Jun/16 ]

Because you only removed the files on the MDT under ldiskfs mode directly, but kept the OI files (oi.16.xxx) there which contains stale OI mappings for those removed MDT-objects as to the further LFSCK cannot locate objects properly. So please remove the OI files under ldiskfs mode and run LFSCK after that.

Thanks!

Comment by Nathan Dauchy (Inactive) [ 03/Jun/16 ]

OK... I can test that, but what if this was a "real" case of MDT corruption where only the files were lost? Is a new feature or phase in lfsck needed to manage the stale OI mappings?

Comment by nasf (Inactive) [ 03/Jun/16 ]

From the LFSCK view, the case of removing MDT-object directly without destroy the OI mapping is indistinguishable from the case of MDT file-level backup/restore. When the OSD tries to locate the local object/inode via the ino# that is obtained from the stale OI mapping, it does not know whether the real MDT-object exists or not. The possible solution is that the OI scrub should make double scanning: the first phase scanning is inode table based to scan all know object on the device; the second phase scanning is OI files based to find out all staled OI mappings. Currently, it only does the first phase scanning.

Comment by Nathan Dauchy (Inactive) [ 07/Jun/16 ]

Just to clarify the status of this ticket... we are on hold waiting for a new phase of scanning to be added to lfsck?

In the meantime, is there a workaround we can use as part of the MDT recovery procedure when getting such stale mappings is expected? Can we mount as ldiskfs and manually check or clean things up?

Comment by nasf (Inactive) [ 07/Jun/16 ]

The workaround for your special case is that if you want to remove some MDT-object under "ldiskfs" mode directly, then please remove the OI files also.

Comment by Gerrit Updater [ 07/Jun/16 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20659
Subject: LU-8218 osd: handle stale OI mapping for non-restore case
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 31cf77414ad4f88c28d6eb2be54b32a7ec399ab7

Comment by nasf (Inactive) [ 07/Jun/16 ]

Nathan,

Above patch may be not perfect solution, but it should be enough to resolve your case.

Comment by Nathan Dauchy (Inactive) [ 08/Jun/16 ]

Fan Yong, thank you for the patch! I haven't had a chance to test with a new build yet, but did do a quick check of running lfsck after "rm -f oi.16.*" under ldiskfs. The lfsck then resulted in files like the following in ".lustre/lost+found/MDT0000/":

.lustre/lost+found/MDT0000/[0x200003ab0:0x1:0x0]-R-0

That is what we should expect, even with the patch, right? There is no way to determine the object's path once it is lost from the ROOT tree on the MDT?

Comment by nasf (Inactive) [ 08/Jun/16 ]

That is what we should expect, even with the patch, right? There is no way to determine the object's path once it is lost from the ROOT tree on the MDT?

Yes, that is what we can do now. The path information is stored as linkEA ("trusted.link") in the MDT-object. There is no other backup in the system. So if the MDT-object itself lost, then the LFSCK cannot know its original location, and have to put it under .luster/lost+found/

Comment by Jay Lan (Inactive) [ 10/Jun/16 ]

Hi Fan Yong,

Do you intend to land http://review.whamcloud.com/20659 to master and future releases or to provide a workaround for us?

Comment by nasf (Inactive) [ 12/Jun/16 ]

The patch 20659 should be landed to master, and then be ported to the other branches.

Comment by Gerrit Updater [ 22/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20659/
Subject: LU-8218 osd: handle stale OI mapping for non-restore case
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cecde8bdb4913fd4405d425b0bf3aead03181e9d

Comment by Peter Jones [ 22/Sep/16 ]

Landed for 2.9

Comment by Mahmoud Hanafi [ 22/Sep/16 ]

Can be closed. Add nasa label.

Generated at Sat Feb 10 02:15:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.