[LU-3180] Test failure on test suite lfsck: Failed to find fid Created: 16/Apr/13 Updated: 12/Sep/13 Resolved: 27/Aug/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB | ||
| Environment: |
server and client: tag-2.3.64 build #1411 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 7749 | ||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ff811592-a66a-11e2-90ad-52540035b04c. 03:53:10:Memory used: 2436k/21180k (745k/1692k), time: 0.20/ 0.07/ 0.02 03:53:10:I/O read: 10MB, write: 0MB, rate: 50.10MB/s 03:53:10:CMD: client-19vm1.lab.whamcloud.com PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh _check_progs_installed lfsck 03:53:10:CMD: client-19vm1.lab.whamcloud.com PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh is_mounted /mnt/lustre 03:53:10:lfsck -c -l --mdsdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/mdsdb --ostdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-0 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-1 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-2 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-3 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-4 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-5 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-6 /mnt/lustre 03:53:11:CMD: client-19vm1.lab.whamcloud.com lfsck -c -l --mdsdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/mdsdb --ostdb /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-0 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-1 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-2 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-3 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-4 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-5 /home/autotest/.autotest/shared_dir/2013-04-14/224508-70192991849440/ostdb-6 /mnt/lustre 03:53:11:lfsck 1.42.6.wc2 (10-Dec-2012) 03:53:11:lfsck: ost_idx 0: pass1: check for duplicate objects 03:53:11:lfsck: ost_idx 0: pass1 OK (12 files total) 03:53:11:lfsck: ost_idx 0: pass2: check for missing inode objects 03:53:11:Failed to find fid [0x2000013a1:0xda11:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda15:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda16:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda13:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda12:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda14:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda17:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda18:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda19:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda1b:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda1a:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda1c:0x0]: DB_NOTFOUND: No matching key/data pair found |
| Comments |
| Comment by Peter Jones [ 22/Apr/13 ] |
|
Niu Could you please look into this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 23/Apr/13 ] |
03:53:10:Warning! /dev/mapper/lvm--OSS-P7 is in use. 03:53:10:Warning: skipping journal recovery because doing a read-only filesystem check. The lfsck is doing read-only filesystem check and the FSCK_MAX_ERR should be 4 in this case, but it failed with 1 at the end: 03:53:14: lfsck : @@@@@@ FAIL: lfsck test 2 - finished with rc=1 Looks the FSCK_MAX_ERR sematics was broken in the fix of [ $rc -le $FSCK_MAX_ERR ] ||
error "$cmd returned $rc, should be <= $FSCK_MAX_ERR"
echo "lfsck finished with rc=$rc"
return $rc
I think we'd return 0 when ($rc -le $FSCK_MAX_ERR) but not return $rc. |
| Comment by Emoly Liu [ 23/Apr/13 ] |
|
After looking into lfsck.sh, I find we can't return 0 directly because lfsck.sh checks the return value of run_lfsck() to see whether a second run is needed or not. BTW, the directory .lustre is not empty in the current master code, which makes is_empty_fs() always return false. We'd better have a check and fix. |
| Comment by Niu Yawei (Inactive) [ 23/Apr/13 ] |
| Comment by Emoly Liu [ 23/Apr/13 ] |
|
.lustre/fid I mentioned in last comment was introduced in |
| Comment by Emoly Liu [ 23/Apr/13 ] |
|
After discussion with Niu, we found there were several problems in lfsck test, including 1) since 2) the test with the patch for 1) showed lfsck failed on empty fs; the output like lfsck: pass4 finished lfsck: exit with 10 unfixed errors lfsck finished with rc=2 removed `/tmp/mdsdb' removed `/tmp/mdsdb.mdshdr' removed `/tmp/ostdb-0' removed `/tmp/ostdb-1' lfsck : @@@@@@ FAIL: lfsck test 2 - finished with rc=2 Trace dump: = /root/master/lustre/tests/test-framework.sh:4022:error_noexit() = /root/master/lustre/tests/test-framework.sh:4045:error() = lfsck.sh:283:main() 3) the meaning of FSCK_MAX_ERR is not clear. IMO, when fs is not empty, we only do a read-only filesystem check, so FSCK_MAX_ERR=4(File system errors left uncorrected) means we don't need to do a second check? |
| Comment by Niu Yawei (Inactive) [ 23/Apr/13 ] |
|
Looks the lfsck failed to fix some missing objects inode due to the "Failed to find fid [0x2000013a1:0xda11:0x0]: DB_NOTFOUND: No matching key/data pair found", however, before the fix of The fix of I'll look closer on the problem of "Failed to find fid ...", seems it's a long-standing problem since 2.1. |
| Comment by Jian Yu [ 09/Aug/13 ] |
|
Lustre Branch: b2_4 The latest test results showed that lfsck test still failed: It has never passed on Lustre b2_4 branch. |
| Comment by Niu Yawei (Inactive) [ 13/Aug/13 ] |
|
few more problems are found when investigate on this ticket:
|
| Comment by Andreas Dilger [ 23/Aug/13 ] |
|
Niu, how much effort do you think it is to fix these problems? It concerns me that we need to spend time to fix the old lfsck, when the new LFSCK is going to replace it soon. Also, if this has been broken since 2.1, I don't think users could be depending on it very heavily. Finally, I'm also concerned that if the old lfsck is run on a DNE filesystem with multiple MDTs and/or OSTs with FID-on-OST enabled, it is going to do completely the wrong thing, possibly deleting a large number of "unused" objects that are not referenced by MDT0000. At a minimum, a check should be added to old lfsck to refuse to run if it finds signs of DNE (e.g. O/seq, seq > 2) on the OSTs. |
| Comment by Niu Yawei (Inactive) [ 26/Aug/13 ] |
|
Hi, Andreas I don't have any idea on these two problems ("Fail to find FID" & "always return 1 on second lfsck run") so far, to fix them, I think I probably need to read most of the lfsck code, that's not a small task. Given that it'll be replaced with new LFSCK soon, and no customer complained about these two problems, I tend to think we'd leave them behind. Maybe we need only to fix the problem on DNE mentioned by you? |
| Comment by Peter Jones [ 26/Aug/13 ] |
|
I certainly think that it makes sense to create new LU tickets for the remaining issues and consider the priority of those separately |
| Comment by Niu Yawei (Inactive) [ 27/Aug/13 ] |
|
I created |
| Comment by Niu Yawei (Inactive) [ 27/Aug/13 ] |
|
I created |
| Comment by Niu Yawei (Inactive) [ 27/Aug/13 ] |
|
patch landed on b2_4 & master. |