[LU-3838] lfsck: Failed to find fid Created: 27/Aug/13 Updated: 22/Jun/16 Resolved: 22/Jun/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Niu Yawei (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9935 | ||||||||
| Description |
|
There are two problems in current lfsck: 1. when running lfsck to do some fix, it often shows following error messages: 03:53:11:lfsck: ost_idx 0: pass1: check for duplicate objects 03:53:11:lfsck: ost_idx 0: pass1 OK (12 files total) 03:53:11:lfsck: ost_idx 0: pass2: check for missing inode objects 03:53:11:Failed to find fid [0x2000013a1:0xda11:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda15:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda16:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda13:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda12:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda14:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda17:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda18:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda19:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda1b:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda1a:0x0]: DB_NOTFOUND: No matching key/data pair found 03:53:11:Failed to find fid [0x2000013a1:0xda1c:0x0]: DB_NOTFOUND: No matching key/data pair found 2. After running lfsck to fix problems, the second run of lfsck doesn't return 0 (filesystem is clean) as expected, it always return 1 (some errors fixed) instead. |
| Comments |
| Comment by Andreas Dilger [ 30/Aug/13 ] |
|
Is this problem already discussed in some other ticket? That bug should be linked to this one. I thought I recall discussion on both the DB_NOTFOUND and non-zero return code in some other ticket, or possibly in a patch. In any case, these seem like serious problems with lfsck. |
| Comment by Niu Yawei (Inactive) [ 05/Sep/13 ] |
|
Looks there are quite a few defects in lfsck, I'm wondering has it ever worked before? The first problem can be fixed by this patch: http://review.whamcloud.com/7563 The second problem looks not so severe, it could probably caused by the 'saved orphan' created during first lfsck run, I'll dig into it further. |
| Comment by Niu Yawei (Inactive) [ 06/Sep/13 ] |
|
Set LOV EA functionality was lost, which caused lfsck unable to fix orhpan object: http://review.whamcloud.com/7573 |
| Comment by Niu Yawei (Inactive) [ 09/Sep/13 ] |
|
Looks there is something wrong in the clio code: when application try to open files in a directory one by one, sometimes, the open could return -5, and log shows -5 returned from lov_init_sub(). 00020000:00020000:1.0:1378698109.981118:0:10675:0:(lov_object.c:184:lov_init_sub()) ....lovsub@ffff880070a79e30[0]
00020000:00020000:1.0:1378698109.982691:0:10675:0:(lov_object.c:184:lov_init_sub()) ....osc@ffff88001b8b0e78id: 0x0:42 idx: 1 gen: 0 kms_valid: 0 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
00020000:00020000:1.0:1378698109.987134:0:10675:0:(lov_object.c:184:lov_init_sub()) } header@ffff880070a79d98
00020000:00020000:1.0:1378698109.988513:0:10675:0:(lov_object.c:184:lov_init_sub()) stripe 0 is already owned.
00020000:00020000:1.0:1378698109.990800:0:10675:0:(lov_object.c:185:lov_init_sub()) header@ffff880005c70ef8[0x0, 1, [0xc8:0xe788e95c:0x0] hash]{
00020000:00020000:1.0:1378698109.992896:0:10675:0:(lov_object.c:185:lov_init_sub()) ....vvp@ffff880005c70f90(- 0 0) inode: ffff88001101eb78 200/3884509532 100644 1 1 ffff880005c70f90 [0xc8:0xe788e95c:0x0]
00020000:00020000:1.0:1378698109.997409:0:10675:0:(lov_object.c:185:lov_init_sub()) ....lov@ffff88000876fd98stripes: 1, valid, lsm{ffff88000d7fd1c0 0x0BD10BD0 1 1 0}:
00020000:00020000:1.0:1378698110.000816:0:10675:0:(lov_object.c:185:lov_init_sub()) header@ffff880070a79d98[0x0, 2, [0x100010000:0x2a:0x0] hash]{
00020000:00020000:1.0:1378698110.002922:0:10675:0:(lov_object.c:185:lov_init_sub()) ....lovsub@ffff880070a79e30[0]
00020000:00020000:1.0:1378698110.004489:0:10675:0:(lov_object.c:185:lov_init_sub()) ....osc@ffff88001b8b0e78id: 0x0:42 idx: 1 gen: 0 kms_valid: 0 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
00020000:00020000:1.0:1378698110.008323:0:10675:0:(lov_object.c:185:lov_init_sub()) } header@ffff880070a79d98
00020000:00020000:1.0:1378698110.009925:0:10675:0:(lov_object.c:185:lov_init_sub())
00020000:00020000:1.0:1378698110.011039:0:10675:0:(lov_object.c:185:lov_init_sub()) } header@ffff880005c70ef8
00020000:00020000:1.0:1378698110.012549:0:10675:0:(lov_object.c:185:lov_init_sub()) owned.
00020000:00020000:1.0:1378698110.013699:0:10675:0:(lov_object.c:186:lov_init_sub()) header@ffff880003ccdb18[0x0, 1, [0x200000400:0x6c:0x0]]
00020000:00020000:1.0:1378698110.015696:0:10675:0:(lov_object.c:186:lov_init_sub()) try to own.
00000020:00000001:1.0:1378698110.016908:0:10675:0:(lustre_fid.h:714:fid_flatten32()) Process leaving (rc=251658026 : 251658026 : effff2a)
00020000:00000001:1.0:1378698110.016910:0:10675:0:(lov_object.c:258:lov_init_raid0()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
00020000:00000001:1.0:1378698110.016911:0:10675:0:(lov_object.c:749:lov_object_init()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
00000020:00000001:1.0:1378698110.016912:0:10675:0:(lustre_fid.h:714:fid_flatten32()) Process leaving (rc=4194412 : 4194412 : 40006c)
I'll try to compose a reproducer later. |
| Comment by Niu Yawei (Inactive) [ 22/Jun/16 ] |
|
Old lfsck has been replaced by new LFSCK. |