Details

    • Bug
    • Resolution: Won't Fix
    • Critical
    • None
    • Lustre 2.5.0
    • None
    • 3
    • 9935

    Description

      There are two problems in current lfsck:

      1. when running lfsck to do some fix, it often shows following error messages:

      03:53:11:lfsck: ost_idx 0: pass1: check for duplicate objects
      03:53:11:lfsck: ost_idx 0: pass1 OK (12 files total)
      03:53:11:lfsck: ost_idx 0: pass2: check for missing inode objects
      03:53:11:Failed to find fid [0x2000013a1:0xda11:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda15:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda16:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda13:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda12:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda14:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda17:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda18:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda19:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda1b:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda1a:0x0]: DB_NOTFOUND: No matching key/data pair found
      03:53:11:Failed to find fid [0x2000013a1:0xda1c:0x0]: DB_NOTFOUND: No matching key/data pair found
      

      2. After running lfsck to fix problems, the second run of lfsck doesn't return 0 (filesystem is clean) as expected, it always return 1 (some errors fixed) instead.

      Attachments

        Issue Links

          Activity

            [LU-3838] lfsck: Failed to find fid

            Old lfsck has been replaced by new LFSCK.

            niu Niu Yawei (Inactive) added a comment - Old lfsck has been replaced by new LFSCK.

            Looks there is something wrong in the clio code: when application try to open files in a directory one by one, sometimes, the open could return -5, and log shows -5 returned from lov_init_sub().

            00020000:00020000:1.0:1378698109.981118:0:10675:0:(lov_object.c:184:lov_init_sub()) ....lovsub@ffff880070a79e30[0]
            00020000:00020000:1.0:1378698109.982691:0:10675:0:(lov_object.c:184:lov_init_sub()) ....osc@ffff88001b8b0e78id: 0x0:42 idx: 1 gen: 0 kms_valid: 0 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
            00020000:00020000:1.0:1378698109.987134:0:10675:0:(lov_object.c:184:lov_init_sub()) } header@ffff880070a79d98
            00020000:00020000:1.0:1378698109.988513:0:10675:0:(lov_object.c:184:lov_init_sub()) stripe 0 is already owned.
            00020000:00020000:1.0:1378698109.990800:0:10675:0:(lov_object.c:185:lov_init_sub()) header@ffff880005c70ef8[0x0, 1, [0xc8:0xe788e95c:0x0] hash]{
            00020000:00020000:1.0:1378698109.992896:0:10675:0:(lov_object.c:185:lov_init_sub()) ....vvp@ffff880005c70f90(- 0 0) inode: ffff88001101eb78 200/3884509532 100644 1 1 ffff880005c70f90 [0xc8:0xe788e95c:0x0]
            00020000:00020000:1.0:1378698109.997409:0:10675:0:(lov_object.c:185:lov_init_sub()) ....lov@ffff88000876fd98stripes: 1, valid, lsm{ffff88000d7fd1c0 0x0BD10BD0 1 1 0}:
            00020000:00020000:1.0:1378698110.000816:0:10675:0:(lov_object.c:185:lov_init_sub()) header@ffff880070a79d98[0x0, 2, [0x100010000:0x2a:0x0] hash]{
            00020000:00020000:1.0:1378698110.002922:0:10675:0:(lov_object.c:185:lov_init_sub()) ....lovsub@ffff880070a79e30[0]
            00020000:00020000:1.0:1378698110.004489:0:10675:0:(lov_object.c:185:lov_init_sub()) ....osc@ffff88001b8b0e78id: 0x0:42 idx: 1 gen: 0 kms_valid: 0 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
            00020000:00020000:1.0:1378698110.008323:0:10675:0:(lov_object.c:185:lov_init_sub()) } header@ffff880070a79d98
            00020000:00020000:1.0:1378698110.009925:0:10675:0:(lov_object.c:185:lov_init_sub())
            00020000:00020000:1.0:1378698110.011039:0:10675:0:(lov_object.c:185:lov_init_sub()) } header@ffff880005c70ef8
            00020000:00020000:1.0:1378698110.012549:0:10675:0:(lov_object.c:185:lov_init_sub()) owned.
            00020000:00020000:1.0:1378698110.013699:0:10675:0:(lov_object.c:186:lov_init_sub()) header@ffff880003ccdb18[0x0, 1, [0x200000400:0x6c:0x0]]
            00020000:00020000:1.0:1378698110.015696:0:10675:0:(lov_object.c:186:lov_init_sub()) try to own.
            00000020:00000001:1.0:1378698110.016908:0:10675:0:(lustre_fid.h:714:fid_flatten32()) Process leaving (rc=251658026 : 251658026 : effff2a)
            00020000:00000001:1.0:1378698110.016910:0:10675:0:(lov_object.c:258:lov_init_raid0()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
            00020000:00000001:1.0:1378698110.016911:0:10675:0:(lov_object.c:749:lov_object_init()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb)
            00000020:00000001:1.0:1378698110.016912:0:10675:0:(lustre_fid.h:714:fid_flatten32()) Process leaving (rc=4194412 : 4194412 : 40006c)
            

            I'll try to compose a reproducer later.

            niu Niu Yawei (Inactive) added a comment - Looks there is something wrong in the clio code: when application try to open files in a directory one by one, sometimes, the open could return -5, and log shows -5 returned from lov_init_sub(). 00020000:00020000:1.0:1378698109.981118:0:10675:0:(lov_object.c:184:lov_init_sub()) ....lovsub@ffff880070a79e30[0] 00020000:00020000:1.0:1378698109.982691:0:10675:0:(lov_object.c:184:lov_init_sub()) ....osc@ffff88001b8b0e78id: 0x0:42 idx: 1 gen: 0 kms_valid: 0 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0 00020000:00020000:1.0:1378698109.987134:0:10675:0:(lov_object.c:184:lov_init_sub()) } header@ffff880070a79d98 00020000:00020000:1.0:1378698109.988513:0:10675:0:(lov_object.c:184:lov_init_sub()) stripe 0 is already owned. 00020000:00020000:1.0:1378698109.990800:0:10675:0:(lov_object.c:185:lov_init_sub()) header@ffff880005c70ef8[0x0, 1, [0xc8:0xe788e95c:0x0] hash]{ 00020000:00020000:1.0:1378698109.992896:0:10675:0:(lov_object.c:185:lov_init_sub()) ....vvp@ffff880005c70f90(- 0 0) inode: ffff88001101eb78 200/3884509532 100644 1 1 ffff880005c70f90 [0xc8:0xe788e95c:0x0] 00020000:00020000:1.0:1378698109.997409:0:10675:0:(lov_object.c:185:lov_init_sub()) ....lov@ffff88000876fd98stripes: 1, valid, lsm{ffff88000d7fd1c0 0x0BD10BD0 1 1 0}: 00020000:00020000:1.0:1378698110.000816:0:10675:0:(lov_object.c:185:lov_init_sub()) header@ffff880070a79d98[0x0, 2, [0x100010000:0x2a:0x0] hash]{ 00020000:00020000:1.0:1378698110.002922:0:10675:0:(lov_object.c:185:lov_init_sub()) ....lovsub@ffff880070a79e30[0] 00020000:00020000:1.0:1378698110.004489:0:10675:0:(lov_object.c:185:lov_init_sub()) ....osc@ffff88001b8b0e78id: 0x0:42 idx: 1 gen: 0 kms_valid: 0 kms 0 rc: 0 force_sync: 0 min_xid: 0 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0 00020000:00020000:1.0:1378698110.008323:0:10675:0:(lov_object.c:185:lov_init_sub()) } header@ffff880070a79d98 00020000:00020000:1.0:1378698110.009925:0:10675:0:(lov_object.c:185:lov_init_sub()) 00020000:00020000:1.0:1378698110.011039:0:10675:0:(lov_object.c:185:lov_init_sub()) } header@ffff880005c70ef8 00020000:00020000:1.0:1378698110.012549:0:10675:0:(lov_object.c:185:lov_init_sub()) owned. 00020000:00020000:1.0:1378698110.013699:0:10675:0:(lov_object.c:186:lov_init_sub()) header@ffff880003ccdb18[0x0, 1, [0x200000400:0x6c:0x0]] 00020000:00020000:1.0:1378698110.015696:0:10675:0:(lov_object.c:186:lov_init_sub()) try to own. 00000020:00000001:1.0:1378698110.016908:0:10675:0:(lustre_fid.h:714:fid_flatten32()) Process leaving (rc=251658026 : 251658026 : effff2a) 00020000:00000001:1.0:1378698110.016910:0:10675:0:(lov_object.c:258:lov_init_raid0()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) 00020000:00000001:1.0:1378698110.016911:0:10675:0:(lov_object.c:749:lov_object_init()) Process leaving (rc=18446744073709551611 : -5 : fffffffffffffffb) 00000020:00000001:1.0:1378698110.016912:0:10675:0:(lustre_fid.h:714:fid_flatten32()) Process leaving (rc=4194412 : 4194412 : 40006c) I'll try to compose a reproducer later.

            Set LOV EA functionality was lost, which caused lfsck unable to fix orhpan object: http://review.whamcloud.com/7573

            niu Niu Yawei (Inactive) added a comment - Set LOV EA functionality was lost, which caused lfsck unable to fix orhpan object: http://review.whamcloud.com/7573

            Looks there are quite a few defects in lfsck, I'm wondering has it ever worked before?

            The first problem can be fixed by this patch: http://review.whamcloud.com/7563

            The second problem looks not so severe, it could probably caused by the 'saved orphan' created during first lfsck run, I'll dig into it further.

            niu Niu Yawei (Inactive) added a comment - Looks there are quite a few defects in lfsck, I'm wondering has it ever worked before? The first problem can be fixed by this patch: http://review.whamcloud.com/7563 The second problem looks not so severe, it could probably caused by the 'saved orphan' created during first lfsck run, I'll dig into it further.

            Is this problem already discussed in some other ticket? That bug should be linked to this one. I thought I recall discussion on both the DB_NOTFOUND and non-zero return code in some other ticket, or possibly in a patch.

            In any case, these seem like serious problems with lfsck.

            adilger Andreas Dilger added a comment - Is this problem already discussed in some other ticket? That bug should be linked to this one. I thought I recall discussion on both the DB_NOTFOUND and non-zero return code in some other ticket, or possibly in a patch. In any case, these seem like serious problems with lfsck.

            People

              niu Niu Yawei (Inactive)
              niu Niu Yawei (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: