[LU-3838] lfsck: Failed to find fid Created: 27/Aug/13  Updated: 22/Jun/16  Resolved: 22/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Niu Yawei (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None

Issue Links:
Related
is related to LU-3180 Test failure on test suite lfsck: Fai... Resolved
Severity: 3
Rank (Obsolete): 9935

 Description   

There are two problems in current lfsck:

1. when running lfsck to do some fix, it often shows following error messages:

03:53:11:lfsck: ost_idx 0: pass1: check for duplicate objects
03:53:11:lfsck: ost_idx 0: pass1 OK (12 files total)
03:53:11:lfsck: ost_idx 0: pass2: check for missing inode objects
03:53:11:Failed to find fid [0x2000013a1:0xda11:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda15:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda16:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda13:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda12:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda14:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda17:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda18:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda19:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda1b:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda1a:0x0]: DB_NOTFOUND: No matching key/data pair found
03:53:11:Failed to find fid [0x2000013a1:0xda1c:0x0]: DB_NOTFOUND: No matching key/data pair found

2. After running lfsck to fix problems, the second run of lfsck doesn't return 0 (filesystem is clean) as expected, it always return 1 (some errors fixed) instead.



 Comments   
Comment by Andreas Dilger [ 30/Aug/13 ]

Is this problem already discussed in some other ticket? That bug should be linked to this one. I thought I recall discussion on both the DB_NOTFOUND and non-zero return code in some other ticket, or possibly in a patch.

In any case, these seem like serious problems with lfsck.

Comment by Niu Yawei (Inactive) [ 05/Sep/13 ]

Looks there are quite a few defects in lfsck, I'm wondering has it ever worked before?

The first problem can be fixed by this patch: http://review.whamcloud.com/7563

The second problem looks not so severe, it could probably caused by the 'saved orphan' created during first lfsck run, I'll dig into it further.

Comment by Niu Yawei (Inactive) [ 06/Sep/13 ]

Set LOV EA functionality was lost, which caused lfsck unable to fix orhpan object: http://review.whamcloud.com/7573

Comment by Niu Yawei (Inactive) [ 09/Sep/13 ]

Looks there is something wrong in the clio code: when application try to open files in a directory one by one, sometimes, the open could return -5, and log shows -5 returned from lov_init_sub().

00020000:00020000:1.0:1378698109.981118:0:10675:0:(lov_object.c:184:lov_init_sub()) ....lovsub@ffff880070a79e30[0]
00020000:00020000:1.0:1378698109.982691:0:10675:0:(lov_object.c:184:lov_init_sub()) ....osc@ffff88001b8b0e78id: 0x0:42 idx: 1 gen: 0 kms_valid: 0 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
00020000:00020000:1.0:1378698109.987134:0:10675:0:(lov_object.c:184:lov_init_sub()) } header@ffff880070a79d98
00020000:00020000:1.0:1378698109.988513:0:10675:0:(lov_object.c:184:lov_init_sub()) stripe 0 is already owned.
00020000:00020000:1.0:1378698109.990800:0:10675:0:(lov_object.c:185:lov_init_sub()) header@ffff880005c70ef8[0x0, 1, [0xc8:0xe788e95c:0x0] hash]{
00020000:00020000:1.0:1378698109.992896:0:10675:0:(lov_object.c:185:lov_init_sub()) ....vvp@ffff880005c70f90(- 0 0) inode: ffff88001101eb78 200/3884509532 100644 1 1 ffff880005c70f90 [0xc8:0xe788e95c:0x0]
00020000:00020000:1.0:1378698109.997409:0:10675:0:(lov_object.c:185:lov_init_sub()) ....lov@ffff88000876fd98stripes: 1, valid, lsm{ffff88000d7fd1c0 0x0BD10BD0 1 1 0}:
00020000:00020000:1.0:1378698110.000816:0:10675:0:(lov_object.c:185:lov_init_sub()) header@ffff880070a79d98[0x0, 2, [0x100010000:0x2a:0x0] hash]{
00020000:00020000:1.0:1378698110.002922:0:10675:0:(lov_object.c:185:lov_init_sub()) ....lovsub@ffff880070a79e30[0]
00020000:00020000:1.0:1378698110.004489:0:10675:0:(lov_object.c:185:lov_init_sub()) ....osc@ffff88001b8b0e78id: 0x0:42 idx: 1 gen: 0 kms_valid: 0 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
00020000:00020000:1.0:1378698110.008323:0:10675:0:(lov_object.c:185:lov_init_sub()) } header@ffff880070a79d98
00020000:00020000:1.0:1378698110.009925:0:10675:0:(lov_object.c:185:lov_init_sub())
00020000:00020000:1.0:1378698110.011039:0:10675:0:(lov_object.c:185:lov_init_sub()) } header@ffff880005c70ef8
00020000:00020000:1.0:1378698110.012549:0:10675:0:(lov_object.c:185:lov_init_sub()) owned.
00020000:00020000:1.0:1378698110.013699:0:10675:0:(lov_object.c:186:lov_init_sub()) header@ffff880003ccdb18[0x0, 1, [0x200000400:0x6c:0x0]]
00020000:00020000:1.0:1378698110.015696:0:10675:0:(lov_object.c:186:lov_init_sub()) try to own.
00000020:00000001:1.0:1378698110.016908:0:10675:0:(lustre_fid.h:714:fid_flatten32()) Process leaving (rc=251658026 : 251658026 : effff2a)
00020000:00000001:1.0:1378698110.016910:0:10675:0:(lov_object.c:258:lov_init_raid0()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
00020000:00000001:1.0:1378698110.016911:0:10675:0:(lov_object.c:749:lov_object_init()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
00000020:00000001:1.0:1378698110.016912:0:10675:0:(lustre_fid.h:714:fid_flatten32()) Process leaving (rc=4194412 : 4194412 : 40006c)

I'll try to compose a reproducer later.

Comment by Niu Yawei (Inactive) [ 22/Jun/16 ]

Old lfsck has been replaced by new LFSCK.

Generated at Sat Feb 10 01:37:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.