Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3848

Compute node crashes due to error in Lustre : ASSERTION( me == md->md_me ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.1.3
    • 3
    • 9966

    Description

      Hi,

      At a customer site, a compute node running Lustre crashes with the following error:

      2013-08-23 17:08:55 LustreError: 3049:0:(lib-move.c:185:lnet_match_md()) ASSERTION( me == md->md_me ) failed:
      2013-08-23 17:08:55 LustreError: 3049:0:(lib-move.c:185:lnet_match_md()) LBUG
      2013-08-23 17:08:55 Pid: 3049, comm: kiblnd_sd_11
      2013-08-23 17:08:55
      2013-08-23 17:08:55 Call Trace:
      2013-08-23 17:08:55 [<ffffffffa04c47f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-08-23 17:08:55 [<ffffffffa04c4e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-08-23 17:08:55 [<ffffffffa053c2bd>] lnet_match_md+0x35d/0x3a0 [lnet]
      2013-08-23 17:08:55 [<ffffffffa053a8dd>] ? lnet_ni_recv+0xad/0x2f0 [lnet]
      2013-08-23 17:08:55 [<ffffffffa0541482>] lnet_parse+0xac2/0x1b80 [lnet]
      2013-08-23 17:08:55 [<ffffffff81042f33>] ? enqueue_task+0x43/0x90
      2013-08-23 17:08:55 [<ffffffffa08d177b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90
      2013-08-23 17:08:55 [<ffffffffa08d2560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffff81042cf3>] ? __wake_up+0x53/0x70
      2013-08-23 17:08:55 [<ffffffffa08d2732>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffffa08d2ae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20
      2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffff8100412a>] child_rip+0xa/0x20
      2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffff81004120>] ? child_rip+0x0/0x20
      2013-08-23 17:08:55
      2013-08-23 17:08:55 Kernel panic - not syncing: LBUG
      2013-08-23 17:08:55 Pid: 3049, comm: kiblnd_sd_11 Tainted: G --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1
      2013-08-23 17:08:55 Call Trace:
      2013-08-23 17:08:55 [<ffffffff81495fe3>] ? panic+0xa0/0x168
      2013-08-23 17:08:55 [<ffffffffa04c4e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      2013-08-23 17:08:55 [<ffffffffa053c2bd>] ? lnet_match_md+0x35d/0x3a0 [lnet]
      2013-08-23 17:08:55 [<ffffffffa053a8dd>] ? lnet_ni_recv+0xad/0x2f0 [lnet]
      2013-08-23 17:08:55 [<ffffffffa0541482>] ? lnet_parse+0xac2/0x1b80 [lnet]
      2013-08-23 17:08:55 [<ffffffff81042f33>] ? enqueue_task+0x43/0x90
      2013-08-23 17:08:55 [<ffffffffa08d177b>] ? kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90
      2013-08-23 17:08:55 [<ffffffffa08d2560>] ? kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffff81042cf3>] ? __wake_up+0x53/0x70
      2013-08-23 17:08:55 [<ffffffffa08d2732>] ? kiblnd_complete+0x62/0xe0 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffffa08d2ae9>] ? kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20
      2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffff8100412a>] ? child_rip+0xa/0x20
      2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-08-23 17:08:55 [<ffffffff81004120>] ? child_rip+0x0/0x20
      

      The messages visible in the syslog 2 hours before the failed assertion are:

      2013-08-23 15:26:09 LustreError: 11-0: an error occurred while communicating with 10.200.200.66@o2ib. The obd_ping operation failed with -107
      2013-08-23 15:26:09 Lustre: scratch-OST0018-osc-ffff88087a659000: Connection to scratch-OST0018 (at 10.200.200.66@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      2013-08-23 15:26:09 LustreError: 167-0: This client was evicted by scratch-OST0018; in progress operations using this service will fail.
      2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff880e871c7d98[2 3 0 1 1 00000000] R(1):[4, 18446744073709551615]@[0x100180000:0x2e225e:0x0] {
      2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff880c2c27c8a0: [0 ffff880f31e5f268 R(1):[4, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0]]
      2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88104e603d78: ffff881072b1f240 40120002 0x4d3a65b709264355 3 ffff881012f1ae88 size: 26325 mtime: 1377264244 atime: 1377264247 ctime: 1377264244 blocks: 56
      2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff880e871c7d98
      2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5
      2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff880e871c7078[2 2 0 1 1 00000000] R(1):[2, 3]@[0x100180000:0x2e225e:0x0] {
      2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff880c2c27c920: [0 ffff8801fca3e4d8 R(1):[0, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0]]
      2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88104e603368: ffff88077ea89d80 40120002 0x4d3a65b709264363 2 (null) size: 26325 mtime: 1377264244 atime: 1377264247 ctime: 1377264244 blocks: 56
      2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff880e871c7078
      2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5
      2013-08-23 15:26:09 LustreError: 24191:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace scratch-OST0018-osc-ffff88087a659000 resource refcount nonzero (2) after lock cleanup; forcing cleanup.
      2013-08-23 15:26:09 LustreError: 24191:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880e973c7480 (3023454/0/0/0) (rc: 2)
      2013-08-23 15:26:09 Lustre: scratch-OST0018-osc-ffff88087a659000: Connection restored to scratch-OST0018 (at 10.200.200.66@o2ib)
      2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff880e871c7698[2 4 0 2 0 00000000] R(1):[4, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0] {
      2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) vvp@ffff880dafcd7d30:
      2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) lov@ffff880f31e5f268: 1
      2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) 0 0: ---
      2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try())
      2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880e871c7698
      2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -5
      

      After investigation, no error was seen on the Infiniband network by the time of the Lustre node crash.

      Thanks,
      Sebastien.

      Attachments

        Issue Links

          Activity

            [LU-3848] Compute node crashes due to error in Lustre : ASSERTION( me == md->md_me ) failed

            Follow on (master patch, for any other version, ...) to be tracked in LU-4330 now.

            bfaccini Bruno Faccini (Inactive) added a comment - Follow on (master patch, for any other version, ...) to be tracked in LU-4330 now.
            pjones Peter Jones added a comment -

            Great! Thanks for letting us know Hugo

            pjones Peter Jones added a comment - Great! Thanks for letting us know Hugo
            hugo_meiland Hugo Meiland added a comment -

            As an update: the fix has now been installed for a couple of weeks and no more of these LBUG's have been seen, so the fix looks good...

            hugo_meiland Hugo Meiland added a comment - As an update: the fix has now been installed for a couple of weeks and no more of these LBUG's have been seen, so the fix looks good...

            I need to indicate that after successful auto-tests, I made local extensive testing of http://review.whamcloud.com/8819/ and I can confirm it works as expected and with no MEs/MDs leak in Slabs.

            bfaccini Bruno Faccini (Inactive) added a comment - I need to indicate that after successful auto-tests, I made local extensive testing of http://review.whamcloud.com/8819/ and I can confirm it works as expected and with no MEs/MDs leak in Slabs.

            Hello Hugo,
            Thanks again for your help on this!
            I just had a look into the recent crash-dump you uploaded in LU3848-tcn23-vmcore-vmlinux-systemmap-weak-updates.zip, and I can confirm this the same scenario, only little difference is that the me_md field corruption is -1 instead of -3 before.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Hugo, Thanks again for your help on this! I just had a look into the recent crash-dump you uploaded in LU3848-tcn23-vmcore-vmlinux-systemmap-weak-updates.zip, and I can confirm this the same scenario, only little difference is that the me_md field corruption is -1 instead of -3 before.
            hugo_meiland Hugo Meiland added a comment -

            Hi Bruno and Liang;
            @Bruno: Thanks for the patch, I've asked Sebastien to prepare it for me to test at the client site. I do have plenty more of crashes available right now, but I'm not sure if this would be a huge help at this time.
            @Liang: Your patch has been integrated to our tree, but just after our upgrade to 2.1.6, so I'm not running this patch yet. It will be in the next lustre package.
            I'll try to give you feedback on these as soon as possible.

            hugo_meiland Hugo Meiland added a comment - Hi Bruno and Liang; @Bruno: Thanks for the patch, I've asked Sebastien to prepare it for me to test at the client site. I do have plenty more of crashes available right now, but I'm not sure if this would be a huge help at this time. @Liang: Your patch has been integrated to our tree, but just after our upgrade to 2.1.6, so I'm not running this patch yet. It will be in the next lustre package. I'll try to give you feedback on these as soon as possible.

            Hi Hugo, have you already applied the patch I mentioned previously? http://review.whamcloud.com/#/c/8511/ , it did fix potential memory corruption issue, although not sure if it's the reason of your issue here.

            liang Liang Zhen (Inactive) added a comment - Hi Hugo, have you already applied the patch I mentioned previously? http://review.whamcloud.com/#/c/8511/ , it did fix potential memory corruption issue, although not sure if it's the reason of your issue here.

            Hello Hugo, I will review the new crash-dump you uploaded to verify we still encounter the same kind of corruption.

            BTW, I have pushed a patch that should help to better identify the root cause of the MEs/MDs corruption in Slabs, for both LU-4330 and this ticket, it is available at http://review.whamcloud.com/8819/. It has already been successfully exposed to autotests, and you may want to give it a try. May be Sebastien could also validate it and try to integrate it in Bull distro?

            Since this patch extracts LNET MEs and small-MDs allocation from generic <size-128> kmem_cache, I strongly expect that you will no longer encounter LU-3848/LU-4330 LBUGs (and may be suffer new issues with others Kernel codes using <size-128> …). But if not, and you are able to reproduce the LBUGs, then I will need to investigate further.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Hugo, I will review the new crash-dump you uploaded to verify we still encounter the same kind of corruption. BTW, I have pushed a patch that should help to better identify the root cause of the MEs/MDs corruption in Slabs, for both LU-4330 and this ticket, it is available at http://review.whamcloud.com/8819/ . It has already been successfully exposed to autotests, and you may want to give it a try. May be Sebastien could also validate it and try to integrate it in Bull distro? Since this patch extracts LNET MEs and small-MDs allocation from generic <size-128> kmem_cache, I strongly expect that you will no longer encounter LU-3848 / LU-4330 LBUGs (and may be suffer new issues with others Kernel codes using <size-128> …). But if not, and you are able to reproduce the LBUGs, then I will need to investigate further.

            Bruno, Liang: This weekend I was able to obtain kernel coredumps again with this error (after a stable period os 3-4 weeks). As mentioned earlier the lustre version has been upgraded to 2.1.6. The uploaded file is LU3848-tcn23-vmcore-vmlinux-systemmap-weak-updates.zip; please let me know if you need additional files...

            hugo_meiland Hugo Meiland added a comment - Bruno, Liang: This weekend I was able to obtain kernel coredumps again with this error (after a stable period os 3-4 weeks). As mentioned earlier the lustre version has been upgraded to 2.1.6. The uploaded file is LU3848-tcn23-vmcore-vmlinux-systemmap-weak-updates.zip; please let me know if you need additional files...

            btw, I just posted a patch to fix potential memory corruption: http://review.whamcloud.com/#/c/8511/

            liang Liang Zhen (Inactive) added a comment - btw, I just posted a patch to fix potential memory corruption: http://review.whamcloud.com/#/c/8511/

            People

              bfaccini Bruno Faccini (Inactive)
              sebastien.buisson Sebastien Buisson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: