Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2193

lvbo_init failed for resource XXX: rc -2, after recovery

Details

    • 3
    • Orion
    • 3014

    Description

      We are seeing a lot (>30,000) of the following errors following recovery on our OSTs. This is only observed when the OSTs are under load when they are restarted. I believe this is the same as issue LU-463 but I thought I'd open this as a separate issue until someone else can confirm it. If these errors actually do percolate all the way back to the client and are returned to the application then this is a serious issue for us.

      Lustre: Lustre: Build Version: orion-2_2_49_57_1-44chaos-44chaos--PRISTINE-2.6.32-220.13.1.2chaos.ch5.x86_64
      LustreError: 11-0: MGC10.1.1.211@o2ib9: Communicating with 10.1.1.211@o2ib9, operation llog_origin_handle_create failed with -2
      LustreError: 5497:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2
      Lustre: lcz-OST0000: Mounted
      Lustre: lcz-OST0000: Will be in recovery for at least 5:00, or until 60 clients reconnect.
      Lustre: lcz-OST0000: Recovery over after 1:12, of 60 clients 60 recovered and 0 were evicted.
      LustreError: 5669:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25345259: rc -2
      LustreError: 5663:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25352185: rc -2
      LustreError: 5669:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 2 previous similar messages
      LustreError: 5670:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25353836: rc -2
      LustreError: 5670:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 168 previous similar messages
      LustreError: 5665:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25418903: rc -2
      LustreError: 5665:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 561 previous similar messages
      LustreError: 5659:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25627504: rc -2
      LustreError: 5659:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 2043 previous similar messages
      LustreError: 5673:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 25772020: rc -2
      LustreError: 5673:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 4424 previous similar messages
      LustreError: 5662:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 26456381: rc -2
      LustreError: 5662:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 10386 previous similar messages
      LustreError: 5672:0:(ldlm_resource.c:1096:ldlm_resource_get()) lvbo_init failed for resource 28049681: rc -2
      LustreError: 5672:0:(ldlm_resource.c:1096:ldlm_resource_get()) Skipped 23629 previous similar messages
      

      Attachments

        Activity

          [LU-2193] lvbo_init failed for resource XXX: rc -2, after recovery

          Patch landed for 2.4.0.

          adilger Andreas Dilger added a comment - Patch landed for 2.4.0.

          The difficulty I foresee is related to regexp matching of FID and NID strings in the output messages. I don't know what kind of regexp matching is in expect, since I've only ever used it with e2fsprogs, where it just does literal output matching, and there is a sed preprocessor script that strips out all of the test-unique lines (e.g. e2fsprogs version, format time, UUID, etc).

          adilger Andreas Dilger added a comment - The difficulty I foresee is related to regexp matching of FID and NID strings in the output messages. I don't know what kind of regexp matching is in expect, since I've only ever used it with e2fsprogs, where it just does literal output matching, and there is a sed preprocessor script that strips out all of the test-unique lines (e.g. e2fsprogs version, format time, UUID, etc).

          > I'm not sure if it is practical, but I'm wondering if it would make sense to have test pass/fail also depend on "expect" output for both the stdout and console error message?

          I think that's an excellent idea. I currently do something similar with the ZFS builds by automatically flagging any tests which generate console messages with a warning. It's helped me catch several issues immediately which might otherwise have slipped through. For Lustre you'll probably need something more sophisticated to filter out the expected messages, or better yet to go on a crusade silencing those warnings.

          behlendorf Brian Behlendorf added a comment - > I'm not sure if it is practical, but I'm wondering if it would make sense to have test pass/fail also depend on "expect" output for both the stdout and console error message? I think that's an excellent idea. I currently do something similar with the ZFS builds by automatically flagging any tests which generate console messages with a warning. It's helped me catch several issues immediately which might otherwise have slipped through. For Lustre you'll probably need something more sophisticated to filter out the expected messages, or better yet to go on a crusade silencing those warnings.

          Looks like my earlier patch was close but not quite enough to fix the real issue, and I forgot to follow up after it was tested and landed.

          I think a real problem here is that everyone runs the tests in Gerrit and not enough people look at the console output anymore. I was just looking though the console logs from an acceptance-small test run on my local system on Friday, and there are a lot of extra messages that could be quieted (I submitted http://review.whamcloud.com/6264 to quiet some already), and issues like this are visible among the noise. I don't think we need to read the logs for every successful test, but I think it makes sense to look through it on occasion.

          I'm not sure if it is practical, but I'm wondering if it would make sense to have test pass/fail also depend on "expect" output for both the stdout and console error message?

          adilger Andreas Dilger added a comment - Looks like my earlier patch was close but not quite enough to fix the real issue, and I forgot to follow up after it was tested and landed. I think a real problem here is that everyone runs the tests in Gerrit and not enough people look at the console output anymore. I was just looking though the console logs from an acceptance-small test run on my local system on Friday, and there are a lot of extra messages that could be quieted (I submitted http://review.whamcloud.com/6264 to quiet some already), and issues like this are visible among the noise. I don't think we need to read the logs for every successful test, but I think it makes sense to look through it on occasion. I'm not sure if it is practical, but I'm wondering if it would make sense to have test pass/fail also depend on "expect" output for both the stdout and console error message?
          bobijam Zhenyu Xu added a comment - - edited

          patch http://review.whamcloud.com/6266

          commit message
          LU-2193 ofd: check object exists before orphan destroy
          
          MDS replay object destroys after recovery could meet non-existing
          objects, skip it before following futile actions.
          

          need check existence before trying to destroy an orphan object.

          bobijam Zhenyu Xu added a comment - - edited patch http://review.whamcloud.com/6266 commit message LU-2193 ofd: check object exists before orphan destroy MDS replay object destroys after recovery could meet non-existing objects, skip it before following futile actions. need check existence before trying to destroy an orphan object.
          bobijam Zhenyu Xu added a comment -

          the lvbo_init still happens in ofd_orphans_destory().

          bobijam Zhenyu Xu added a comment - the lvbo_init still happens in ofd_orphans_destory().

          Seems that the initial patch to do the object lookup before destroy has slowed this message down but not removed it entirely, or maybe I just don't see as many messages under my test load. It appears that the lvbo_init objects are NOT in the range of the just-cleaned-up orphan objects, which is what I would expect from the original debugging:

          Lustre: testfs-OST0000: deleting orphan objects from 0x0:28913 to 28929
          Lustre: testfs-OST0001: deleting orphan objects from 0x0:29944 to 29985
          :
          :
          Lustre: testfs-OST0001: deleting orphan objects from 0x0:55389 to 57441
          Lustre: testfs-OST0000: deleting orphan objects from 0x0:54397 to 54977
          LustreError: 6979:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x7e32:0x0: rc = -2
          LustreError: 6999:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x80c4:0x0: rc = -2
          LustreError: 6995:0:ldlm_resource_get()) testfs-OST0000: lvbo_init failed for resource 0x998d:0x0: rc = -2
          (gdb) p 0x7e32
          $5 = 32306
          (gdb) p 0x80c4
          $6 = 32964
          (gdb) p 0x998d
          $7 = 39309
          

          There are of course messages being skipped in these console logs, but it seems like a disjoint set of objects.

          adilger Andreas Dilger added a comment - Seems that the initial patch to do the object lookup before destroy has slowed this message down but not removed it entirely, or maybe I just don't see as many messages under my test load. It appears that the lvbo_init objects are NOT in the range of the just-cleaned-up orphan objects, which is what I would expect from the original debugging: Lustre: testfs-OST0000: deleting orphan objects from 0x0:28913 to 28929 Lustre: testfs-OST0001: deleting orphan objects from 0x0:29944 to 29985 : : Lustre: testfs-OST0001: deleting orphan objects from 0x0:55389 to 57441 Lustre: testfs-OST0000: deleting orphan objects from 0x0:54397 to 54977 LustreError: 6979:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x7e32:0x0: rc = -2 LustreError: 6999:0:ldlm_resource_get()) testfs-OST0001: lvbo_init failed for resource 0x80c4:0x0: rc = -2 LustreError: 6995:0:ldlm_resource_get()) testfs-OST0000: lvbo_init failed for resource 0x998d:0x0: rc = -2 (gdb) p 0x7e32 $5 = 32306 (gdb) p 0x80c4 $6 = 32964 (gdb) p 0x998d $7 = 39309 There are of course messages being skipped in these console logs, but it seems like a disjoint set of objects.

          Hi, Andreas.
          After taking a glance at the patches here, I came to know that you're completely right. Something related to stale handler or after-recovery-behavior on MDS and OSS may cause my issues, not this issue.

          I am sorry to have caused trouble, and thank you !!!

          nozaki Hiroya Nozaki (Inactive) added a comment - Hi, Andreas. After taking a glance at the patches here, I came to know that you're completely right. Something related to stale handler or after-recovery-behavior on MDS and OSS may cause my issues, not this issue. I am sorry to have caused trouble, and thank you !!!

          Sorry, the patches here will not fix this problem. My first patch was only for a bug in the master branch, and second patch is only for improving the error message.

          adilger Andreas Dilger added a comment - Sorry, the patches here will not fix this problem. My first patch was only for a bug in the master branch, and second patch is only for improving the error message.

          Hi, Alex. Thank you for your instant reply.
          It looks possible to me that the oss issue affect client-side somehow because the both messages show me the same res_name[0].

          hm, but, anyway, it seems that I'd better read lustre source code much more.
          thank you.

          nozaki Hiroya Nozaki (Inactive) added a comment - Hi, Alex. Thank you for your instant reply. It looks possible to me that the oss issue affect client-side somehow because the both messages show me the same res_name [0] . hm, but, anyway, it seems that I'd better read lustre source code much more. thank you.

          People

            bobijam Zhenyu Xu
            behlendorf Brian Behlendorf
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: