Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3180

Test failure on test suite lfsck: Failed to find fid

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.4.0, Lustre 2.5.0
    • server and client: tag-2.3.64 build #1411
    • 3
    • 7749

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ff811592-a66a-11e2-90ad-52540035b04c.

      03:53:10:Memory used: 2436k/21180k (745k/1692k), time:  0.20/ 0.07/ 0.02
      03:53:10:I/O read: 10MB, write: 0MB, rate: 50.10MB/s
      03:53:10:CMD: client-19vm1.lab.whamcloud.com PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh _check_progs_installed lfsck 
      03:53:10:CMD: client-19vm1.lab.whamcloud.com PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh is_mounted /mnt/lustre 
      03:53:10:lfsck -c -l --mdsdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/mdsdb --ostdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-0 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-1 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-2 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-3 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-4 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-5 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-6 /mnt/lustre
      03:53:11:CMD: client-19vm1.lab.whamcloud.com lfsck -c -l --mdsdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/mdsdb --ostdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-0 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-1 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-2 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-3 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-4 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-5 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-6 /mnt/lustre
      03:53:11:lfsck 1.42.6.wc2 (10-Dec-2012)
      03:53:11:lfsck: ost_idx 0: pass1: check for duplicate objects
      03:53:11:lfsck: ost_idx 0: pass1 OK (12 files total)
      03:53:11:lfsck: ost_idx 0: pass2: check for missing inode objects
      03:53:11:Failed to find fid [0x2000013a1:0xda11:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda15:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda16:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda13:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda12:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda14:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda17:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda18:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda19:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda1b:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda1a:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda1c:0x0]: DB_NOTFOUND: No matching key/data pair found
      

      Attachments

        Issue Links

          Activity

            [LU-3180] Test failure on test suite lfsck: Failed to find fid

            patch landed on b2_4 & master.

            niu Niu Yawei (Inactive) added a comment - patch landed on b2_4 & master.

            I created LU-3838 for the two remaining issues, this ticket can be closed.

            niu Niu Yawei (Inactive) added a comment - I created LU-3838 for the two remaining issues, this ticket can be closed.

            I created LU-3837 for the old lfsck on DNE problem. I think we'd keep this ticket open (the "Failed to find fid" problem), but lower the priority, because lfsck test won't fail for this error message.

            niu Niu Yawei (Inactive) added a comment - I created LU-3837 for the old lfsck on DNE problem. I think we'd keep this ticket open (the "Failed to find fid" problem), but lower the priority, because lfsck test won't fail for this error message.
            pjones Peter Jones added a comment -

            I certainly think that it makes sense to create new LU tickets for the remaining issues and consider the priority of those separately

            pjones Peter Jones added a comment - I certainly think that it makes sense to create new LU tickets for the remaining issues and consider the priority of those separately

            Hi, Andreas

            I don't have any idea on these two problems ("Fail to find FID" & "always return 1 on second lfsck run") so far, to fix them, I think I probably need to read most of the lfsck code, that's not a small task.

            Given that it'll be replaced with new LFSCK soon, and no customer complained about these two problems, I tend to think we'd leave them behind. Maybe we need only to fix the problem on DNE mentioned by you?

            niu Niu Yawei (Inactive) added a comment - Hi, Andreas I don't have any idea on these two problems ("Fail to find FID" & "always return 1 on second lfsck run") so far, to fix them, I think I probably need to read most of the lfsck code, that's not a small task. Given that it'll be replaced with new LFSCK soon, and no customer complained about these two problems, I tend to think we'd leave them behind. Maybe we need only to fix the problem on DNE mentioned by you?
            adilger Andreas Dilger added a comment - - edited

            Niu, how much effort do you think it is to fix these problems? It concerns me that we need to spend time to fix the old lfsck, when the new LFSCK is going to replace it soon. Also, if this has been broken since 2.1, I don't think users could be depending on it very heavily.

            Finally, I'm also concerned that if the old lfsck is run on a DNE filesystem with multiple MDTs and/or OSTs with FID-on-OST enabled, it is going to do completely the wrong thing, possibly deleting a large number of "unused" objects that are not referenced by MDT0000.

            At a minimum, a check should be added to old lfsck to refuse to run if it finds signs of DNE (e.g. O/seq, seq > 2) on the OSTs.

            adilger Andreas Dilger added a comment - - edited Niu, how much effort do you think it is to fix these problems? It concerns me that we need to spend time to fix the old lfsck, when the new LFSCK is going to replace it soon. Also, if this has been broken since 2.1, I don't think users could be depending on it very heavily. Finally, I'm also concerned that if the old lfsck is run on a DNE filesystem with multiple MDTs and/or OSTs with FID-on-OST enabled, it is going to do completely the wrong thing, possibly deleting a large number of "unused" objects that are not referenced by MDT0000. At a minimum, a check should be added to old lfsck to refuse to run if it finds signs of DNE (e.g. O/seq, seq > 2) on the OSTs.

            few more problems are found when investigate on this ticket:

            • ll_lov_setea() overflowed flags, I opened LU-3744 for it;
            • typo in is_empty_fs();
            • in lfsck.sh, sync should be preformced when necessary to make sure data flushed back;
            • if the first run of lfsck fixed some errors, the second run of lfsck will not return clean (0) as expected, it'll always return 1 (some errors fixed) instead, the reason of this is unknown yet (and it exists from day one), but I think we'd go back to the old lfsck to make it pass at this moment, and fix it later;
            niu Niu Yawei (Inactive) added a comment - few more problems are found when investigate on this ticket: ll_lov_setea() overflowed flags, I opened LU-3744 for it; typo in is_empty_fs(); in lfsck.sh, sync should be preformced when necessary to make sure data flushed back; if the first run of lfsck fixed some errors, the second run of lfsck will not return clean (0) as expected, it'll always return 1 (some errors fixed) instead, the reason of this is unknown yet (and it exists from day one), but I think we'd go back to the old lfsck to make it pass at this moment, and fix it later;
            yujian Jian Yu added a comment -

            Lustre Branch: b2_4
            Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/27/

            The latest test results showed that lfsck test still failed:
            https://maloo.whamcloud.com/test_sets/9743f75a-fd7d-11e2-9fdb-52540035b04c
            https://maloo.whamcloud.com/test_sets/16153bd0-fdaa-11e2-9fd5-52540035b04c

            It has never passed on Lustre b2_4 branch.

            yujian Jian Yu added a comment - Lustre Branch: b2_4 Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/27/ The latest test results showed that lfsck test still failed: https://maloo.whamcloud.com/test_sets/9743f75a-fd7d-11e2-9fdb-52540035b04c https://maloo.whamcloud.com/test_sets/16153bd0-fdaa-11e2-9fd5-52540035b04c It has never passed on Lustre b2_4 branch.

            Looks the lfsck failed to fix some missing objects inode due to the "Failed to find fid [0x2000013a1:0xda11:0x0]: DB_NOTFOUND: No matching key/data pair found", however, before the fix of LU-2571, the test will still pass because the run_lfsck() always return 0, thus lfsck.sh won't invoke the second run_lfsck() to verify if the problems are really fixed.

            The fix of LU-2571 fixed the script problem in run_lfsck(), we do verify now, and all lfsck tests should fail (except the case of running lfsck on a clean fs).

            I'll look closer on the problem of "Failed to find fid ...", seems it's a long-standing problem since 2.1.

            niu Niu Yawei (Inactive) added a comment - Looks the lfsck failed to fix some missing objects inode due to the "Failed to find fid [0x2000013a1:0xda11:0x0] : DB_NOTFOUND: No matching key/data pair found", however, before the fix of LU-2571 , the test will still pass because the run_lfsck() always return 0, thus lfsck.sh won't invoke the second run_lfsck() to verify if the problems are really fixed. The fix of LU-2571 fixed the script problem in run_lfsck(), we do verify now, and all lfsck tests should fail (except the case of running lfsck on a clean fs). I'll look closer on the problem of "Failed to find fid ...", seems it's a long-standing problem since 2.1.
            emoly.liu Emoly Liu added a comment -

            After discussion with Niu, we found there were several problems in lfsck test, including

            1) since LU-2780 landed, is_empty_fs() should be changed;

            2) the test with the patch for 1) showed lfsck failed on empty fs; the output like

            lfsck: pass4 finished
            lfsck: exit with 10 unfixed errors
            lfsck finished with rc=2
            removed `/tmp/mdsdb'
            removed `/tmp/mdsdb.mdshdr'
            removed `/tmp/ostdb-0'
            removed `/tmp/ostdb-1'
             lfsck : @@@@@@ FAIL: lfsck test 2 - finished with rc=2 
              Trace dump:
              = /root/master/lustre/tests/test-framework.sh:4022:error_noexit()
              = /root/master/lustre/tests/test-framework.sh:4045:error()
              = lfsck.sh:283:main()
            

            3) the meaning of FSCK_MAX_ERR is not clear. IMO, when fs is not empty, we only do a read-only filesystem check, so FSCK_MAX_ERR=4(File system errors left uncorrected) means we don't need to do a second check?

            emoly.liu Emoly Liu added a comment - After discussion with Niu, we found there were several problems in lfsck test, including 1) since LU-2780 landed, is_empty_fs() should be changed; 2) the test with the patch for 1) showed lfsck failed on empty fs; the output like lfsck: pass4 finished lfsck: exit with 10 unfixed errors lfsck finished with rc=2 removed `/tmp/mdsdb' removed `/tmp/mdsdb.mdshdr' removed `/tmp/ostdb-0' removed `/tmp/ostdb-1' lfsck : @@@@@@ FAIL: lfsck test 2 - finished with rc=2 Trace dump: = /root/master/lustre/tests/test-framework.sh:4022:error_noexit() = /root/master/lustre/tests/test-framework.sh:4045:error() = lfsck.sh:283:main() 3) the meaning of FSCK_MAX_ERR is not clear. IMO, when fs is not empty, we only do a read-only filesystem check, so FSCK_MAX_ERR=4(File system errors left uncorrected) means we don't need to do a second check?

            People

              niu Niu Yawei (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: