Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2193

lvbo_init failed for resource XXX: rc -2, after recovery

Details

    • 3
    • Orion
    • 3014

    Description

      We are seeing a lot (>30,000) of the following errors following recovery on our OSTs. This is only observed when the OSTs are under load when they are restarted. I believe this is the same as issue LU-463 but I thought I'd open this as a separate issue until someone else can confirm it. If these errors actually do percolate all the way back to the client and are returned to the application then this is a serious issue for us.

      Lustre: Lustre: Build Version: orion-2_2_49_57_1-44chaos-44chaos--PRISTINE-2.6.32-220.13.1.2chaos.ch5.x86_64
      LustreError: 11-0: MGC10.1.1.211@o2ib9: Communicating with 10.1.1.211@o2ib9, operation llog_origin_handle_create failed with -2
      LustreError: 5497:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2
      Lustre: lcz-OST0000: Mounted
      Lustre: lcz-OST0000: Will be in recovery for at least 5:00, or until 60 clients reconnect.
      Lustre: lcz-OST0000: Recovery over after 1:12, of 60 clients 60 recovered and 0 were evicted.
      LustreError: 5669:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25345259: rc -2
      LustreError: 5663:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25352185: rc -2
      LustreError: 5669:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 2 previous similar messages
      LustreError: 5670:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25353836: rc -2
      LustreError: 5670:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 168 previous similar messages
      LustreError: 5665:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25418903: rc -2
      LustreError: 5665:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 561 previous similar messages
      LustreError: 5659:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25627504: rc -2
      LustreError: 5659:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 2043 previous similar messages
      LustreError: 5673:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25772020: rc -2
      LustreError: 5673:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 4424 previous similar messages
      LustreError: 5662:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 26456381: rc -2
      LustreError: 5662:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 10386 previous similar messages
      LustreError: 5672:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 28049681: rc -2
      LustreError: 5672:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 23629 previous similar messages
      

      Attachments

        Activity

          [LU-2193] lvbo_init failed for resource XXX: rc -2, after recovery
          bobijam Zhenyu Xu added a comment - - edited

          patch http://review.whamcloud.com/6266

          commit message
          LU-2193 ofd: check object exists before orphan destroy
          
          MDS replay object destroys after recovery could meet non-existing
          objects, skip it before following futile actions.
          

          need check existence before trying to destroy an orphan object.

          bobijam Zhenyu Xu added a comment - - edited patch http://review.whamcloud.com/6266 commit message LU-2193 ofd: check object exists before orphan destroy MDS replay object destroys after recovery could meet non-existing objects, skip it before following futile actions. need check existence before trying to destroy an orphan object.
          bobijam Zhenyu Xu added a comment -

          the lvbo_init still happens in ofd_orphans_destory().

          bobijam Zhenyu Xu added a comment - the lvbo_init still happens in ofd_orphans_destory().

          Seems that the initial patch to do the object lookup before destroy has slowed this message down but not removed it entirely, or maybe I just don't see as many messages under my test load. It appears that the lvbo_init objects are NOT in the range of the just-cleaned-up orphan objects, which is what I would expect from the original debugging:

          Lustre: testfs-OST0000: deleting orphan objects from 0x0:28913 to 28929
          Lustre: testfs-OST0001: deleting orphan objects from 0x0:29944 to 29985
          :
          :
          Lustre: testfs-OST0001: deleting orphan objects from 0x0:55389 to 57441
          Lustre: testfs-OST0000: deleting orphan objects from 0x0:54397 to 54977
          LustreError: 6979:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x7e32:0x0: rc = -2
          LustreError: 6999:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x80c4:0x0: rc = -2
          LustreError: 6995:0:ldlm_resource_get()) testfs-OST0000: lvbo_init failed for resource 0x998d:0x0: rc = -2
          (gdb) p 0x7e32
          $5 = 32306
          (gdb) p 0x80c4
          $6 = 32964
          (gdb) p 0x998d
          $7 = 39309
          

          There are of course messages being skipped in these console logs, but it seems like a disjoint set of objects.

          adilger Andreas Dilger added a comment - Seems that the initial patch to do the object lookup before destroy has slowed this message down but not removed it entirely, or maybe I just don't see as many messages under my test load. It appears that the lvbo_init objects are NOT in the range of the just-cleaned-up orphan objects, which is what I would expect from the original debugging: Lustre: testfs-OST0000: deleting orphan objects from 0x0:28913 to 28929 Lustre: testfs-OST0001: deleting orphan objects from 0x0:29944 to 29985 : : Lustre: testfs-OST0001: deleting orphan objects from 0x0:55389 to 57441 Lustre: testfs-OST0000: deleting orphan objects from 0x0:54397 to 54977 LustreError: 6979:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x7e32:0x0: rc = -2 LustreError: 6999:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x80c4:0x0: rc = -2 LustreError: 6995:0:ldlm_resource_get()) testfs-OST0000: lvbo_init failed for resource 0x998d:0x0: rc = -2 (gdb) p 0x7e32 $5 = 32306 (gdb) p 0x80c4 $6 = 32964 (gdb) p 0x998d $7 = 39309 There are of course messages being skipped in these console logs, but it seems like a disjoint set of objects.

          Hi, Andreas.
          After taking a glance at the patches here, I came to know that you're completely right. Something related to stale handler or after-recovery-behavior on MDS and OSS may cause my issues, not this issue.

          I am sorry to have caused trouble, and thank you !!!

          nozaki Hiroya Nozaki (Inactive) added a comment - Hi, Andreas. After taking a glance at the patches here, I came to know that you're completely right. Something related to stale handler or after-recovery-behavior on MDS and OSS may cause my issues, not this issue. I am sorry to have caused trouble, and thank you !!!

          Sorry, the patches here will not fix this problem. My first patch was only for a bug in the master branch, and second patch is only for improving the error message.

          adilger Andreas Dilger added a comment - Sorry, the patches here will not fix this problem. My first patch was only for a bug in the master branch, and second patch is only for improving the error message.

          Hi, Alex. Thank you for your instant reply.
          It looks possible to me that the oss issue affect client-side somehow because the both messages show me the same res_name[0].

          hm, but, anyway, it seems that I'd better read lustre source code much more.
          thank you.

          nozaki Hiroya Nozaki (Inactive) added a comment - Hi, Alex. Thank you for your instant reply. It looks possible to me that the oss issue affect client-side somehow because the both messages show me the same res_name [0] . hm, but, anyway, it seems that I'd better read lustre source code much more. thank you.

          no, that patch is not supposed to fix this LBUG.

          bzzz Alex Zhuravlev added a comment - no, that patch is not supposed to fix this LBUG.
          nozaki Hiroya Nozaki (Inactive) added a comment - - edited

          Hi, Andreas. We Fujitsu have almost the same issue besides one thing, So now, I'm planning to translate this patch (http://review.whamcloud.com/4276) into b1_8.
          But before making a patch, I'd like to ask you one question.

          Our customer got the below message, which is described here, on OSS.

          14761:0:(ldlm_resource.c:1011:ldlm_resource_add()) filter-fefs-OST0113_UUID: lvbo_init failed for resource 332822: rc -2
          

          Then, after a while, umount got LBUG in ldlm_lock_cancel on a client node because of an extra refcount of ldlm_lock->l_readers of the extent lock. And the messages showed the same res_name[0] to the oss message.
          I'm not so sure, but I think the both issues may come from the same root. So I'd like to know wheter or not this patch will work to the client issue too.

          5538:0:(ldlm_lock.c:1606:ldlm_lock_cancel()) ### lock still has references ns: cx-OST0113-osc-ffff883fd57f4800 lock: ffff881dc8ddb000/0xead40fdb3d64db52 lrc: 4/1,0 mode: PR/PR res: 332822/0 rrc: 2 type: EXT [0->18446744073
          5538:0:(ldlm_lock.c:1607:ldlm_lock_cancel()) LBUG
          
          crash> bt
          PID: 5538   TASK: ffff883577ac6080  CPU: 8   COMMAND: "umount"
           #0 [ffff883db6139640] machine_kexec at ffffffff810310db
           #1 [ffff883db61396a0] crash_kexec at ffffffff810b63b2
           #2 [ffff883db6139770] panic at ffffffff814dac2f
           #3 [ffff883db61397f0] lbug_with_loc at ffffffffa04ba453
           #4 [ffff883db6139870] ldlm_cli_cancel_local at ffffffffa0697651
           #5 [ffff883db61398d0] ldlm_cli_cancel at ffffffffa0699d23
           #6 [ffff883db6139950] cleanup_resource at ffffffffa0681e83
           #7 [ffff883db61399d0] ldlm_namespace_cleanup at ffffffffa068218f
           #8 [ffff883db6139a60] __ldlm_namespace_free at ffffffffa068239b
           #9 [ffff883db6139b40] ldlm_namespace_free_prior at ffffffffa06829bc
          #10 [ffff883db6139b80] client_disconnect_export at ffffffffa0687766
          #11 [ffff883db6139c00] osc_disconnect at ffffffffa07a6df4
          #12 [ffff883db6139c60] lov_putref at ffffffffa0829f21
          #13 [ffff883db6139d30] ll_put_super at ffffffffa08bbf0d
          #14 [ffff883db6139e30] generic_shutdown_super at ffffffff81174adb
          #15 [ffff883db6139e50] kill_anon_super at ffffffff81174bc6
          #16 [ffff883db6139e70] lustre_kill_super at ffffffffa05dc588
          #17 [ffff883db6139e90] deactivate_super at ffffffff81175c40
          #18 [ffff883db6139eb0] mntput_no_expire at ffffffff8119113f
          #19 [ffff883db6139ee0] sys_umount at ffffffff8119156b
          #20 [ffff883db6139f80] tracesys at ffffffff8100b387 (via system_call)
              RIP: 00007fdce2a7b717  RSP: 00007fff1f6b5728  RFLAGS: 00000246
              RAX: ffffffffffffffda  RBX: ffffffff8100b387  RCX: ffffffffffffffff
              RDX: 0000000000000000  RSI: 0000000000000000  RDI: 00007fdce3c11a10
              RBP: 00007fdce3c119d0   R8: 00007fdce3c11a30   R9: 0000000000000000
              R10: 00007fff1f6b54b0  R11: 0000000000000246  R12: 0000000000000000
              R13: 0000000000000000  R14: 0000000000000000  R15: 00007fdce3c11a70
              ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b
          
          nozaki Hiroya Nozaki (Inactive) added a comment - - edited Hi, Andreas. We Fujitsu have almost the same issue besides one thing, So now, I'm planning to translate this patch ( http://review.whamcloud.com/4276 ) into b1_8. But before making a patch, I'd like to ask you one question. Our customer got the below message, which is described here, on OSS. 14761:0:(ldlm_resource.c:1011:ldlm_resource_add()) filter-fefs-OST0113_UUID: lvbo_init failed for resource 332822: rc -2 Then, after a while, umount got LBUG in ldlm_lock_cancel on a client node because of an extra refcount of ldlm_lock->l_readers of the extent lock. And the messages showed the same res_name [0] to the oss message. I'm not so sure, but I think the both issues may come from the same root. So I'd like to know wheter or not this patch will work to the client issue too. 5538:0:(ldlm_lock.c:1606:ldlm_lock_cancel()) ### lock still has references ns: cx-OST0113-osc-ffff883fd57f4800 lock: ffff881dc8ddb000/0xead40fdb3d64db52 lrc: 4/1,0 mode: PR/PR res: 332822/0 rrc: 2 type: EXT [0->18446744073 5538:0:(ldlm_lock.c:1607:ldlm_lock_cancel()) LBUG crash> bt PID: 5538 TASK: ffff883577ac6080 CPU: 8 COMMAND: "umount" #0 [ffff883db6139640] machine_kexec at ffffffff810310db #1 [ffff883db61396a0] crash_kexec at ffffffff810b63b2 #2 [ffff883db6139770] panic at ffffffff814dac2f #3 [ffff883db61397f0] lbug_with_loc at ffffffffa04ba453 #4 [ffff883db6139870] ldlm_cli_cancel_local at ffffffffa0697651 #5 [ffff883db61398d0] ldlm_cli_cancel at ffffffffa0699d23 #6 [ffff883db6139950] cleanup_resource at ffffffffa0681e83 #7 [ffff883db61399d0] ldlm_namespace_cleanup at ffffffffa068218f #8 [ffff883db6139a60] __ldlm_namespace_free at ffffffffa068239b #9 [ffff883db6139b40] ldlm_namespace_free_prior at ffffffffa06829bc #10 [ffff883db6139b80] client_disconnect_export at ffffffffa0687766 #11 [ffff883db6139c00] osc_disconnect at ffffffffa07a6df4 #12 [ffff883db6139c60] lov_putref at ffffffffa0829f21 #13 [ffff883db6139d30] ll_put_super at ffffffffa08bbf0d #14 [ffff883db6139e30] generic_shutdown_super at ffffffff81174adb #15 [ffff883db6139e50] kill_anon_super at ffffffff81174bc6 #16 [ffff883db6139e70] lustre_kill_super at ffffffffa05dc588 #17 [ffff883db6139e90] deactivate_super at ffffffff81175c40 #18 [ffff883db6139eb0] mntput_no_expire at ffffffff8119113f #19 [ffff883db6139ee0] sys_umount at ffffffff8119156b #20 [ffff883db6139f80] tracesys at ffffffff8100b387 (via system_call) RIP: 00007fdce2a7b717 RSP: 00007fff1f6b5728 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8100b387 RCX: ffffffffffffffff RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fdce3c11a10 RBP: 00007fdce3c119d0 R8: 00007fdce3c11a30 R9: 0000000000000000 R10: 00007fff1f6b54b0 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 00007fdce3c11a70 ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b

          Minor patch to improve lvbo_init() error message to print full FID: http://review.whamcloud.com/4501

          adilger Andreas Dilger added a comment - Minor patch to improve lvbo_init() error message to print full FID: http://review.whamcloud.com/4501

          > Is it possible to just pass the lu_env down via the lvbo_init code?

          yes, and there are another places in ldlm we want to pass env in.

          bzzz Alex Zhuravlev added a comment - > Is it possible to just pass the lu_env down via the lvbo_init code? yes, and there are another places in ldlm we want to pass env in.
          adilger Andreas Dilger added a comment - Untested patch: http://review.whamcloud.com/4276

          People

            bobijam Zhenyu Xu
            behlendorf Brian Behlendorf
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: