[LU-3180] Test failure on test suite lfsck: Failed to find fid Created: 16/Apr/13  Updated: 12/Sep/13  Resolved: 27/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0
Fix Version/s: Lustre 2.4.1, Lustre 2.5.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: MB
Environment:

server and client: tag-2.3.64 build #1411


Issue Links:
Related
is related to LU-3838 lfsck: Failed to find fid Resolved
is related to LU-3367 Interop 2.1.5<->2.4 failure on test s... Resolved
Severity: 3
Rank (Obsolete): 7749

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ff811592-a66a-11e2-90ad-52540035b04c.

03:53:10:Memory used: 2436k/21180k (745k/1692k), time:  0.20/ 0.07/ 0.02
03:53:10:I/O read: 10MB, write: 0MB, rate: 50.10MB/s
03:53:10:CMD: client-19vm1.lab.whamcloud.com PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh _check_progs_installed lfsck 
03:53:10:CMD: client-19vm1.lab.whamcloud.com PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh is_mounted /mnt/lustre 
03:53:10:lfsck -c -l --mdsdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/mdsdb --ostdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-0 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-1 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-2 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-3 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-4 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-5 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-6 /mnt/lustre
03:53:11:CMD: client-19vm1.lab.whamcloud.com lfsck -c -l --mdsdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/mdsdb --ostdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-0 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-1 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-2 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-3 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-4 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-5 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-6 /mnt/lustre
03:53:11:lfsck 1.42.6.wc2 (10-Dec-2012)
03:53:11:lfsck: ost_idx 0: pass1: check for duplicate objects
03:53:11:lfsck: ost_idx 0: pass1 OK (12 files total)
03:53:11:lfsck: ost_idx 0: pass2: check for missing inode objects
03:53:11:Failed to find fid [0x2000013a1:0xda11:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda15:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda16:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda13:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda12:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda14:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda17:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda18:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda19:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda1b:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda1a:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda1c:0x0]: DB_NOTFOUND: No matching key/data pair found


 Comments   
Comment by Peter Jones [ 22/Apr/13 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 23/Apr/13 ]
03:53:10:Warning!  /dev/mapper/lvm--OSS-P7 is in use.
03:53:10:Warning: skipping journal recovery because doing a read-only filesystem check.

The lfsck is doing read-only filesystem check and the FSCK_MAX_ERR should be 4 in this case, but it failed with 1 at the end:

03:53:14: lfsck : @@@@@@ FAIL: lfsck test 2 - finished with rc=1 

Looks the FSCK_MAX_ERR sematics was broken in the fix of LU-2571 (http://review.whamcloud.com/#patch,sidebyside,5139,6,lustre/tests/test-framework.sh), see run_lfsck_remote():

        [ $rc -le $FSCK_MAX_ERR ] ||
                error "$cmd returned $rc, should be <= $FSCK_MAX_ERR"
        echo "lfsck finished with rc=$rc"

        return $rc

I think we'd return 0 when ($rc -le $FSCK_MAX_ERR) but not return $rc.

Comment by Emoly Liu [ 23/Apr/13 ]

After looking into lfsck.sh, I find we can't return 0 directly because lfsck.sh checks the return value of run_lfsck() to see whether a second run is needed or not. BTW, the directory .lustre is not empty in the current master code, which makes is_empty_fs() always return false. We'd better have a check and fix.

Comment by Niu Yawei (Inactive) [ 23/Apr/13 ]

http://review.whamcloud.com/6123

Comment by Emoly Liu [ 23/Apr/13 ]

.lustre/fid I mentioned in last comment was introduced in LU-2780 http://review.whamcloud.com/#change,5298

Comment by Emoly Liu [ 23/Apr/13 ]

After discussion with Niu, we found there were several problems in lfsck test, including

1) since LU-2780 landed, is_empty_fs() should be changed;

2) the test with the patch for 1) showed lfsck failed on empty fs; the output like

lfsck: pass4 finished
lfsck: exit with 10 unfixed errors
lfsck finished with rc=2
removed `/tmp/mdsdb'
removed `/tmp/mdsdb.mdshdr'
removed `/tmp/ostdb-0'
removed `/tmp/ostdb-1'
 lfsck : @@@@@@ FAIL: lfsck test 2 - finished with rc=2 
  Trace dump:
  = /root/master/lustre/tests/test-framework.sh:4022:error_noexit()
  = /root/master/lustre/tests/test-framework.sh:4045:error()
  = lfsck.sh:283:main()

3) the meaning of FSCK_MAX_ERR is not clear. IMO, when fs is not empty, we only do a read-only filesystem check, so FSCK_MAX_ERR=4(File system errors left uncorrected) means we don't need to do a second check?

Comment by Niu Yawei (Inactive) [ 23/Apr/13 ]

Looks the lfsck failed to fix some missing objects inode due to the "Failed to find fid [0x2000013a1:0xda11:0x0]: DB_NOTFOUND: No matching key/data pair found", however, before the fix of LU-2571, the test will still pass because the run_lfsck() always return 0, thus lfsck.sh won't invoke the second run_lfsck() to verify if the problems are really fixed.

The fix of LU-2571 fixed the script problem in run_lfsck(), we do verify now, and all lfsck tests should fail (except the case of running lfsck on a clean fs).

I'll look closer on the problem of "Failed to find fid ...", seems it's a long-standing problem since 2.1.

Comment by Jian Yu [ 09/Aug/13 ]

Lustre Branch: b2_4
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/27/

The latest test results showed that lfsck test still failed:
https://maloo.whamcloud.com/test_sets/9743f75a-fd7d-11e2-9fdb-52540035b04c
https://maloo.whamcloud.com/test_sets/16153bd0-fdaa-11e2-9fd5-52540035b04c

It has never passed on Lustre b2_4 branch.

Comment by Niu Yawei (Inactive) [ 13/Aug/13 ]

few more problems are found when investigate on this ticket:

  • ll_lov_setea() overflowed flags, I opened LU-3744 for it;
  • typo in is_empty_fs();
  • in lfsck.sh, sync should be preformced when necessary to make sure data flushed back;
  • if the first run of lfsck fixed some errors, the second run of lfsck will not return clean (0) as expected, it'll always return 1 (some errors fixed) instead, the reason of this is unknown yet (and it exists from day one), but I think we'd go back to the old lfsck to make it pass at this moment, and fix it later;
Comment by Andreas Dilger [ 23/Aug/13 ]

Niu, how much effort do you think it is to fix these problems? It concerns me that we need to spend time to fix the old lfsck, when the new LFSCK is going to replace it soon. Also, if this has been broken since 2.1, I don't think users could be depending on it very heavily.

Finally, I'm also concerned that if the old lfsck is run on a DNE filesystem with multiple MDTs and/or OSTs with FID-on-OST enabled, it is going to do completely the wrong thing, possibly deleting a large number of "unused" objects that are not referenced by MDT0000.

At a minimum, a check should be added to old lfsck to refuse to run if it finds signs of DNE (e.g. O/seq, seq > 2) on the OSTs.

Comment by Niu Yawei (Inactive) [ 26/Aug/13 ]

Hi, Andreas

I don't have any idea on these two problems ("Fail to find FID" & "always return 1 on second lfsck run") so far, to fix them, I think I probably need to read most of the lfsck code, that's not a small task.

Given that it'll be replaced with new LFSCK soon, and no customer complained about these two problems, I tend to think we'd leave them behind. Maybe we need only to fix the problem on DNE mentioned by you?

Comment by Peter Jones [ 26/Aug/13 ]

I certainly think that it makes sense to create new LU tickets for the remaining issues and consider the priority of those separately

Comment by Niu Yawei (Inactive) [ 27/Aug/13 ]

I created LU-3837 for the old lfsck on DNE problem. I think we'd keep this ticket open (the "Failed to find fid" problem), but lower the priority, because lfsck test won't fail for this error message.

Comment by Niu Yawei (Inactive) [ 27/Aug/13 ]

I created LU-3838 for the two remaining issues, this ticket can be closed.

Comment by Niu Yawei (Inactive) [ 27/Aug/13 ]

patch landed on b2_4 & master.

Generated at Sat Feb 10 01:31:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.