Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8218

lfsck not able to recover files lost from MDT

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      My understanding is that lfsck in lustre-2.7 should be able to handle lost file information on the MDT, as long as the objects are still on the OSTs. However, a simple test to simulate this is not recovering the files. Shouldn't it at least be able to put them into lost+found? Or am I misunderstanding the capabilities of lfsck? Or is the following test case invalid in some way?

      On the client, just create some test files...

      # cd /mnt/lustre/client/lfscktest
      # echo foo > foo
      # mkdir bar
      # echo baz > bar/baz
      
      # lfs getstripe foo bar/baz
      foo
      lmm_stripe_count:   1
      lmm_stripe_size:    1048576
      lmm_pattern:        1
      lmm_layout_gen:     0
      lmm_stripe_offset:  9
          obdidx         objid         objid         group
               9            460962          0x708a2                 0
      
      bar/baz
      lmm_stripe_count:   1
      lmm_stripe_size:    1048576
      lmm_pattern:        1
      lmm_layout_gen:     0
      lmm_stripe_offset:  12
          obdidx         objid         objid         group
              12            460866          0x70842                 0
      
      # sync
      

      On the MDS, simulate the MDT losing the information, such as could happen through restoring from a slightly outdated MDT backup...

      # umount /mnt/lustre/nbptest-mdt
      # mount -t ldiskfs /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt
      # cd /mnt/lustre/nbptest-mdt/ROOT
      
      # ls -ld lfscktest lfscktest/*
      drwxr-xr-x+ 3 root root 4096 May 30 08:15 lfscktest
      drwxr-xr-x+ 2 root root 4096 May 30 08:15 lfscktest/bar
      -rw-r--r--  1 root root    0 May 30 08:14 lfscktest/foo
      
      # rm -rf lfscktest/*
      
      # cd
      # umount /mnt/lustre/nbptest-mdt
      # mount -t lustre /dev/mapper/nbptest--vg-mdttest /mnt/lustre/nbptest-mdt
      

      Now check the filesystem...

      # lctl clear
      # lctl debug_daemon start /var/log/lfsck.debug
      # lctl lfsck_start -A -M nbptest-MDT0000 -c on -C on -o
      Started LFSCK on the device nbptest-MDT0000: scrub layout namespace
      
      # lctl get_param -n osd-ldiskfs.*.oi_scrub | grep status
      status: init
      status: completed
      
      # lctl debug_daemon stop
      # lctl debug_file /var/log/lfsck.debug | egrep -v " (NRS|RPC) " > /var/log/lfsck.log
      

      And look back on the client...

      # cd /mnt/lustre/client/         
      
      # ls -la lfscktest/
      total 8
      drwxr-xr-x+ 2 root root 4096 May 30 08:22 .
      drwxr-xr-x+ 9 root root 4096 May 30 08:14 ..
      
      # ls -la .lustre/lost+found/MDT0000
      total 8
      drwx------+ 3 root root 4096 May 27 10:44 .
      dr-x------+ 3 root root 4096 May 27 09:01 ..
      

      Notice that there is no sign of the files being restored anywhere. Nor do I find any mention of the object ID's in the lfsck.log file.

      Note that running lfsck_start with the "-t layout" option did not change the behaviour either.

      Attachments

        Activity

          [LU-8218] lfsck not able to recover files lost from MDT

          Just to clarify the status of this ticket... we are on hold waiting for a new phase of scanning to be added to lfsck?

          In the meantime, is there a workaround we can use as part of the MDT recovery procedure when getting such stale mappings is expected? Can we mount as ldiskfs and manually check or clean things up?

          ndauchy Nathan Dauchy (Inactive) added a comment - Just to clarify the status of this ticket... we are on hold waiting for a new phase of scanning to be added to lfsck? In the meantime, is there a workaround we can use as part of the MDT recovery procedure when getting such stale mappings is expected? Can we mount as ldiskfs and manually check or clean things up?

          From the LFSCK view, the case of removing MDT-object directly without destroy the OI mapping is indistinguishable from the case of MDT file-level backup/restore. When the OSD tries to locate the local object/inode via the ino# that is obtained from the stale OI mapping, it does not know whether the real MDT-object exists or not. The possible solution is that the OI scrub should make double scanning: the first phase scanning is inode table based to scan all know object on the device; the second phase scanning is OI files based to find out all staled OI mappings. Currently, it only does the first phase scanning.

          yong.fan nasf (Inactive) added a comment - From the LFSCK view, the case of removing MDT-object directly without destroy the OI mapping is indistinguishable from the case of MDT file-level backup/restore. When the OSD tries to locate the local object/inode via the ino# that is obtained from the stale OI mapping, it does not know whether the real MDT-object exists or not. The possible solution is that the OI scrub should make double scanning: the first phase scanning is inode table based to scan all know object on the device; the second phase scanning is OI files based to find out all staled OI mappings. Currently, it only does the first phase scanning.

          OK... I can test that, but what if this was a "real" case of MDT corruption where only the files were lost? Is a new feature or phase in lfsck needed to manage the stale OI mappings?

          ndauchy Nathan Dauchy (Inactive) added a comment - OK... I can test that, but what if this was a "real" case of MDT corruption where only the files were lost? Is a new feature or phase in lfsck needed to manage the stale OI mappings?

          Because you only removed the files on the MDT under ldiskfs mode directly, but kept the OI files (oi.16.xxx) there which contains stale OI mappings for those removed MDT-objects as to the further LFSCK cannot locate objects properly. So please remove the OI files under ldiskfs mode and run LFSCK after that.

          Thanks!

          yong.fan nasf (Inactive) added a comment - Because you only removed the files on the MDT under ldiskfs mode directly, but kept the OI files (oi.16.xxx) there which contains stale OI mappings for those removed MDT-objects as to the further LFSCK cannot locate objects properly. So please remove the OI files under ldiskfs mode and run LFSCK after that. Thanks!

          debug logs from the servers while lfsck was run.

          service320 is client and where ost8 runs
          service322 is MDS
          service323 is where ost11 runs

          ndauchy Nathan Dauchy (Inactive) added a comment - debug logs from the servers while lfsck was run. service320 is client and where ost8 runs service322 is MDS service323 is where ost11 runs

          Yes, that is what I want to know. The OST-object's size has been updated, that means the dirty data bas been flushed back to the OST, although the PFID EA ("trusted.fid") is not printed properly.

          Please run layout LFSCK just on this system with LFSCK debug enabled, and collect the kernel debug logs on both the MDT and nbptest-ost8 and nbptest-ost11. Thanks!

          yong.fan nasf (Inactive) added a comment - Yes, that is what I want to know. The OST-object's size has been updated, that means the dirty data bas been flushed back to the OST, although the PFID EA ("trusted.fid") is not printed properly. Please run layout LFSCK just on this system with LFSCK debug enabled, and collect the kernel debug logs on both the MDT and nbptest-ost8 and nbptest-ost11. Thanks!

          Is this the information you are looking for?

          Client:

          # cd /mnt/lustre/client/lfscktest
          # echo foo > foo
          # mkdir bar
          # echo baz > bar/baz
          
          # lctl get_param ldlm.namespaces.*osc*.lru_size | grep -v =0
            ldlm.namespaces.nbptest-OST0000-osc-ffff8805daad9800.lru_size=1
            ldlm.namespaces.nbptest-OST0007-osc-ffff8805daad9800.lru_size=2
            ldlm.namespaces.nbptest-OST0008-osc-ffff8805daad9800.lru_size=1
            ldlm.namespaces.nbptest-OST0009-osc-ffff8805daad9800.lru_size=1
            ldlm.namespaces.nbptest-OST000b-osc-ffff8805daad9800.lru_size=1
            ldlm.namespaces.nbptest-OST000c-osc-ffff8805daad9800.lru_size=1
          # lctl set_param -n ldlm.namespaces.*osc*.lru_size=clear
          # lctl get_param ldlm.namespaces.*osc*.lru_size | grep -v =0
            (nothing returned)
          
          # getfattr -d -m ".*" -e hex foo bar/baz 
          # file: foo
          lustre.lov=0xd00bd10b010000000100000000000000b03a0000020000000000100001000000e20807000000000000000000000000000000000008000000
          trusted.link=0xdff1ea11010000002d00000000000000000000000000000000150000000200002b100000000900000000666f6f
          trusted.lma=0x0000000000000000b03a0000020000000100000000000000
          trusted.lov=0xd00bd10b010000000100000000000000b03a0000020000000000100001000000e20807000000000000000000000000000000000008000000
          
          # file: bar/baz
          lustre.lov=0xd00bd10b010000000300000000000000b03a0000020000000000100001000000e2080700000000000000000000000000000000000b000000
          trusted.link=0xdff1ea11010000002d00000000000000000000000000000000150000000200003ab0000000020000000062617a
          trusted.lma=0x0000000000000000b03a0000020000000300000000000000
          trusted.lov=0xd00bd10b010000000300000000000000b03a0000020000000000100001000000e2080700000000000000000000000000000000000b000000
          
          service320 /mnt/lustre/client/lfscktest # 
          
          # lfs getstripe foo bar/baz
          foo
          lmm_stripe_count:   1
          lmm_stripe_size:    1048576
          lmm_pattern:        1
          lmm_layout_gen:     0
          lmm_stripe_offset:  8
          	obdidx		 objid		 objid		 group
          	     8	        461026	      0x708e2	             0
          
          bar/baz
          lmm_stripe_count:   1
          lmm_stripe_size:    1048576
          lmm_pattern:        1
          lmm_layout_gen:     0
          lmm_stripe_offset:  11
          	obdidx		 objid		 objid		 group
          	    11	        461026	      0x708e2	             0
          
          # echo $(( 461026 % 32 ))         
          2
          
          # debugfs /dev/mapper/nbptest-ost8
          debugfs 1.42.13.wc4 (28-Nov-2015)
          debugfs:  cd O
          debugfs:  cd 0
          debugfs:  cd d2
          debugfs:  stat 461026
          Inode: 487   Type: regular    Mode:  0666   Flags: 0x80000
          Generation: 2904170364    Version: 0x0000000c:00000005
          User:     0   Group:     0   Size: 4
          File ACL: 0    Directory ACL: 0
          Links: 1   Blockcount: 8
          Fragment:  Address: 0    Number: 0    Size: 0
           ctime: 0x574efbee:00000000 -- Wed Jun  1 08:14:54 2016
           atime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
           mtime: 0x574efbee:00000000 -- Wed Jun  1 08:14:54 2016
          crtime: 0x574d9b4f:3f4d531c -- Tue May 31 07:10:23 2016
          Size of extra inode fields: 28
          Extended attributes stored in inode body: 
          invalid EA entry in inode
          EXTENTS:
          (0):152064
          debugfs:  dump 461026 /tmp/obj.461026.foo
          debugfs:  quit
          
          # cat /tmp/obj.461026.foo
          foo
          
          # debugfs /dev/mapper/nbptest-ost11
          debugfs 1.42.13.wc4 (28-Nov-2015)
          debugfs:  cd O
          debugfs:  cd 0
          debugfs:  cd d2
          debugfs:  stat 461026
          Inode: 489   Type: regular    Mode:  0666   Flags: 0x80000
          Generation: 3312724559    Version: 0x0000000c:00000007
          User:     0   Group:     0   Size: 4
          File ACL: 0    Directory ACL: 0
          Links: 1   Blockcount: 8
          Fragment:  Address: 0    Number: 0    Size: 0
           ctime: 0x574efbf6:00000000 -- Wed Jun  1 08:15:02 2016
           atime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969
           mtime: 0x574efbf6:00000000 -- Wed Jun  1 08:15:02 2016
          crtime: 0x574d9b4f:58007134 -- Tue May 31 07:10:23 2016
          Size of extra inode fields: 28
          Extended attributes stored in inode body: 
          invalid EA entry in inode
          EXTENTS:
          (0):128768
          debugfs:  dump 461026 /tmp/obj.461026.baz
          debugfs:  quit
          
          # cat /tmp/obj.461026.baz
          baz
          
          ndauchy Nathan Dauchy (Inactive) added a comment - Is this the information you are looking for? Client: # cd /mnt/lustre/client/lfscktest # echo foo > foo # mkdir bar # echo baz > bar/baz # lctl get_param ldlm.namespaces.*osc*.lru_size | grep -v =0 ldlm.namespaces.nbptest-OST0000-osc-ffff8805daad9800.lru_size=1 ldlm.namespaces.nbptest-OST0007-osc-ffff8805daad9800.lru_size=2 ldlm.namespaces.nbptest-OST0008-osc-ffff8805daad9800.lru_size=1 ldlm.namespaces.nbptest-OST0009-osc-ffff8805daad9800.lru_size=1 ldlm.namespaces.nbptest-OST000b-osc-ffff8805daad9800.lru_size=1 ldlm.namespaces.nbptest-OST000c-osc-ffff8805daad9800.lru_size=1 # lctl set_param -n ldlm.namespaces.*osc*.lru_size=clear # lctl get_param ldlm.namespaces.*osc*.lru_size | grep -v =0 (nothing returned) # getfattr -d -m ".*" -e hex foo bar/baz # file: foo lustre.lov=0xd00bd10b010000000100000000000000b03a0000020000000000100001000000e20807000000000000000000000000000000000008000000 trusted.link=0xdff1ea11010000002d00000000000000000000000000000000150000000200002b100000000900000000666f6f trusted.lma=0x0000000000000000b03a0000020000000100000000000000 trusted.lov=0xd00bd10b010000000100000000000000b03a0000020000000000100001000000e20807000000000000000000000000000000000008000000 # file: bar/baz lustre.lov=0xd00bd10b010000000300000000000000b03a0000020000000000100001000000e2080700000000000000000000000000000000000b000000 trusted.link=0xdff1ea11010000002d00000000000000000000000000000000150000000200003ab0000000020000000062617a trusted.lma=0x0000000000000000b03a0000020000000300000000000000 trusted.lov=0xd00bd10b010000000300000000000000b03a0000020000000000100001000000e2080700000000000000000000000000000000000b000000 service320 /mnt/lustre/client/lfscktest # # lfs getstripe foo bar/baz foo lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 8 obdidx objid objid group 8 461026 0x708e2 0 bar/baz lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_pattern: 1 lmm_layout_gen: 0 lmm_stripe_offset: 11 obdidx objid objid group 11 461026 0x708e2 0 # echo $(( 461026 % 32 )) 2 # debugfs /dev/mapper/nbptest-ost8 debugfs 1.42.13.wc4 (28-Nov-2015) debugfs: cd O debugfs: cd 0 debugfs: cd d2 debugfs: stat 461026 Inode: 487 Type: regular Mode: 0666 Flags: 0x80000 Generation: 2904170364 Version: 0x0000000c:00000005 User: 0 Group: 0 Size: 4 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x574efbee:00000000 -- Wed Jun 1 08:14:54 2016 atime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969 mtime: 0x574efbee:00000000 -- Wed Jun 1 08:14:54 2016 crtime: 0x574d9b4f:3f4d531c -- Tue May 31 07:10:23 2016 Size of extra inode fields: 28 Extended attributes stored in inode body: invalid EA entry in inode EXTENTS: (0):152064 debugfs: dump 461026 /tmp/obj.461026.foo debugfs: quit # cat /tmp/obj.461026.foo foo # debugfs /dev/mapper/nbptest-ost11 debugfs 1.42.13.wc4 (28-Nov-2015) debugfs: cd O debugfs: cd 0 debugfs: cd d2 debugfs: stat 461026 Inode: 489 Type: regular Mode: 0666 Flags: 0x80000 Generation: 3312724559 Version: 0x0000000c:00000007 User: 0 Group: 0 Size: 4 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x574efbf6:00000000 -- Wed Jun 1 08:15:02 2016 atime: 0x00000000:00000000 -- Wed Dec 31 16:00:00 1969 mtime: 0x574efbf6:00000000 -- Wed Jun 1 08:15:02 2016 crtime: 0x574d9b4f:58007134 -- Tue May 31 07:10:23 2016 Size of extra inode fields: 28 Extended attributes stored in inode body: invalid EA entry in inode EXTENTS: (0):128768 debugfs: dump 461026 /tmp/obj.461026.baz debugfs: quit # cat /tmp/obj.461026.baz baz

          After "set_param lru_size=clear", would you please to dump related OST-object's attr and PFID EA via debugfs on the OST to check whether it is modified properly? Thanks!

          yong.fan nasf (Inactive) added a comment - After "set_param lru_size=clear", would you please to dump related OST-object's attr and PFID EA via debugfs on the OST to check whether it is modified properly? Thanks!

          The following procedure did not allow the files to be recovered either...

          • write files on client
          • set_param lru_size=clear
          • unmount client
          • stop all targets
          • mount MDT as ldiskfs, and remove the files, unmount
          • start all targets
          • run lfsck as "lctl lfsck_start -M nbptest-MDT0000 -t layout -o -r"

          What other debugging information can I gather, to determine where those "lost" objects are ending up and why lfsck can't recover them?

          ndauchy Nathan Dauchy (Inactive) added a comment - The following procedure did not allow the files to be recovered either... write files on client set_param lru_size=clear unmount client stop all targets mount MDT as ldiskfs, and remove the files, unmount start all targets run lfsck as "lctl lfsck_start -M nbptest-MDT0000 -t layout -o -r" What other debugging information can I gather, to determine where those "lost" objects are ending up and why lfsck can't recover them?

          Adding "lru_size=clear" to the test process did not seem to change anything. I will try another attempt, this time with unmounting the client and fully restarting the file system targets.

          ndauchy Nathan Dauchy (Inactive) added a comment - Adding "lru_size=clear" to the test process did not seem to change anything. I will try another attempt, this time with unmounting the client and fully restarting the file system targets.
          yong.fan nasf (Inactive) added a comment - - edited

          In theory, the layout LFSCK should have the functionality to find out orphan OST-objects, and use the orphan OST-objects' PFID EA to re-generaate the MDT-object's LOV EA. So how to find out orphan OST-objects is important. For layout LFSCK, orphan OST-object means it exists and has ever been modified after pre-created, but nobody reference it. In your case, I am not sure whether the data has been written back to the OST before the layout LFSCK. If the dirty data has not been written back to the OST in time, then related OST-object will be in the pre-created status, not modified, and then, the layout LFSCK will not regard it as orphan OST-object. That can be verified via dump (debugfs) related OST-object on the OST.

          In our sanity-lfsck test, we flush dirty data back to the OST via cancelling OST locks (lctl set_param -n ldlm.namespaces.osc.lru_size=clear) on the client. Such mechanism has been verified. So please try as following:

          1. cd /mnt/lustre/client/lfscktest
          2. echo foo > foo
          3. lctl set_param -n ldlm.namespaces.*osc*.lru_size=clear
            Then other subsequent operations as you did above. Start the layout LFSCK as "lctl start_lfsck -M nbptest-MDT0000 -t layout -o -r".

          Thanks!

          yong.fan nasf (Inactive) added a comment - - edited In theory, the layout LFSCK should have the functionality to find out orphan OST-objects, and use the orphan OST-objects' PFID EA to re-generaate the MDT-object's LOV EA. So how to find out orphan OST-objects is important. For layout LFSCK, orphan OST-object means it exists and has ever been modified after pre-created, but nobody reference it. In your case, I am not sure whether the data has been written back to the OST before the layout LFSCK. If the dirty data has not been written back to the OST in time, then related OST-object will be in the pre-created status, not modified, and then, the layout LFSCK will not regard it as orphan OST-object. That can be verified via dump (debugfs) related OST-object on the OST. In our sanity-lfsck test, we flush dirty data back to the OST via cancelling OST locks (lctl set_param -n ldlm.namespaces. osc .lru_size=clear) on the client. Such mechanism has been verified. So please try as following: cd /mnt/lustre/client/lfscktest echo foo > foo lctl set_param -n ldlm.namespaces.*osc*.lru_size=clear Then other subsequent operations as you did above. Start the layout LFSCK as "lctl start_lfsck -M nbptest-MDT0000 -t layout -o -r". Thanks!

          People

            yong.fan nasf (Inactive)
            ndauchy Nathan Dauchy (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: