Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4330

LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.1.6
    • None
    • 3
    • 11850

    Description

      Hi,

      More and more compute nodes of several different customer clusters are hitting an LBUG on this 'assertion failed' issue:

      2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback ) failed:
      2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) LBUG
      2013-11-21 14:06:54 Nov 21 14:06:54 Pid: 46336, comm: kiblnd_sd_00
      2013-11-21 14:06:54 compute5666 kernel
      2013-11-21 14:06:54 : LustreError: 4Call Trace:
      2013-11-21 14:06:54 6336:0:(events.c [<ffffffffa041c7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-11-21 14:06:54 :433:ptlrpc_mast [<ffffffffa041ce07>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-11-21 14:06:54 er_callback()) A [<ffffffffa06a426c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc]
      2013-11-21 14:06:54 SSERTION( callba [<ffffffffa048ebd2>] lnet_enq_event_locked+0x62/0xd0 [lnet]
      2013-11-21 14:06:54 ck == request_ou [<ffffffffa048ecdb>] lnet_finalize+0x9b/0x2f0 [lnet]
      2013-11-21 14:06:54 t_callback || ca [<ffffffffa083d073>] kiblnd_recv+0x103/0x570 [ko2iblnd]
      2013-11-21 14:06:54 llback == reply_ [<ffffffffa04928dd>] lnet_ni_recv+0xad/0x2f0 [lnet]
      2013-11-21 14:06:54 in_callback || c [<ffffffffa0492c06>] lnet_recv_put+0xe6/0x120 [lnet]
      2013-11-21 14:06:54 allback == clien [<ffffffffa0499c33>] lnet_parse+0x1273/0x1b80 [lnet]
      2013-11-21 14:06:54 t_bulk_callback [<ffffffff81042ca3>] ? enqueue_task+0x43/0x90
      2013-11-21 14:06:54 || callback == r [<ffffffffa083d7ab>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
      2013-11-21 14:06:54 equest_in_callba [<ffffffffa083e590>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
      2013-11-21 14:06:54 ck || callback = [<ffffffff81042a63>] ? __wake_up+0x53/0x70
      2013-11-21 14:06:54 = reply_out_call [<ffffffffa083e762>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
      2013-11-21 14:06:54 back || callback [<ffffffffa083eb19>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 == server_bulk_ [<ffffffff8104a320>] ? default_wake_function+0x0/0x20
      2013-11-21 14:06:54 callback ) faile [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 d:
      2013-11-21 14:06:54 Nov 21 14:06 [<ffffffff8100412a>] child_rip+0xa/0x20
      2013-11-21 14:06:54 :54 compute5666 ke [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 rnel: LustreErro [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 r: 46336:0:(even [<ffffffff81004120>] ? child_rip+0x0/0x20
      2013-11-21 14:06:54 ts.c:433:ptlrpc_
      2013-11-21 14:06:54 master_callback()) LBUG
      2013-11-21 14:06:54 Nov 21 14:06:54 compute566Kernel panic - not syncing: LBUG
      

      For information, systems are running with kernel boot parameter 'tolerant=1' set.

      We have a crash dump that we will upload to ftp.

      Sebastien.

      Attachments

        Issue Links

          Activity

            [LU-4330] LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed
            hugo_meiland Hugo Meiland added a comment -

            Hi Bruno, no disfunction seen, the fix seems to solve the issue completely; thanks! Hugo

            hugo_meiland Hugo Meiland added a comment - Hi Bruno, no disfunction seen, the fix seems to solve the issue completely; thanks! Hugo

            Hello Hugo, thank's for your feedback!
            The fact that both asserts for this ticket and LU-3848 disappeared seems to confirm what I suspected from the beginning, that LNET/Lustre were not involved with the corruptions but only victims. BTW, did you notice any other sub-system (networking, ...) disfunction since ?

            On the other-hand I am now working on a more generic (not based on hard-coded 128 bytes length) patch version to address Liang's last input, and also answer to Isaac comments too. Ans also push a master version as Isaac requested.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Hugo, thank's for your feedback! The fact that both asserts for this ticket and LU-3848 disappeared seems to confirm what I suspected from the beginning, that LNET/Lustre were not involved with the corruptions but only victims. BTW, did you notice any other sub-system (networking, ...) disfunction since ? On the other-hand I am now working on a more generic (not based on hard-coded 128 bytes length) patch version to address Liang's last input, and also answer to Isaac comments too. Ans also push a master version as Isaac requested.
            hugo_meiland Hugo Meiland added a comment -

            As an update: the fix has now been installed for a couple of weeks and no more of these LBUG's have been seen, so the fix looks good...

            hugo_meiland Hugo Meiland added a comment - As an update: the fix has now been installed for a couple of weeks and no more of these LBUG's have been seen, so the fix looks good...

            I need to indicate that after successful auto-tests, I made local extensive testing of http://review.whamcloud.com/8819/ and I can confirm it works as expected and with no MEs/MDs leak in Slabs.

            bfaccini Bruno Faccini (Inactive) added a comment - I need to indicate that after successful auto-tests, I made local extensive testing of http://review.whamcloud.com/8819/ and I can confirm it works as expected and with no MEs/MDs leak in Slabs.

            Hello Hugo,
            Thanks again for your help on this!
            I just had a look into the recent crash-dump you uploaded in LU4330-tcn82-vmcore-vmlinux-systemmap-weak-updates.zip, and I can confirm this the same scenario, with the same/-3 corruption of MD's md_user_ptr field/pointer.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Hugo, Thanks again for your help on this! I just had a look into the recent crash-dump you uploaded in LU4330-tcn82-vmcore-vmlinux-systemmap-weak-updates.zip, and I can confirm this the same scenario, with the same/-3 corruption of MD's md_user_ptr field/pointer.
            hugo_meiland Hugo Meiland added a comment -

            just to keep the info with LU-3848 in sync:
            Hi Bruno and Liang;
            @Bruno: Thanks for the patch, I've asked Sebastien to prepare it for me to test at the client site. I do have plenty more of crashes available right now, but I'm not sure if this would be a huge help at this time.
            @Liang: Your patch has been integrated to our tree, but just after our upgrade to 2.1.6, so I'm not running this patch yet. It will be in the next lustre package.
            I'll try to give you feedback on these as soon as possible.

            hugo_meiland Hugo Meiland added a comment - just to keep the info with LU-3848 in sync: Hi Bruno and Liang; @Bruno: Thanks for the patch, I've asked Sebastien to prepare it for me to test at the client site. I do have plenty more of crashes available right now, but I'm not sure if this would be a huge help at this time. @Liang: Your patch has been integrated to our tree, but just after our upgrade to 2.1.6, so I'm not running this patch yet. It will be in the next lustre package. I'll try to give you feedback on these as soon as possible.

            Hello Hugo, I will review the new crash-dump you uploaded to verify we still encounter the same hind of corruption.

            BTW, I have pushed a patch that should help to better identify the root cause of the MEs/MDs corruption in Slabs, for both LU-3848 and this ticket, it is available at http://review.whamcloud.com/8819/. It has already been successfully exposed to autotests, and you may want to give it a try. May be Sebastien could also validate it and try to integrate it in Bull distro?

            Since this patch extracts LNET MEs and small-MDs allocation from generic <size-128> kmem_cache, I strongly expect that you will no longer encounter LU-3848/LU-4330 LBUGs (and may be suffer new issues with others Kernel codes using <size-128> …). But if not, and you are able to reproduce the LBUGs, then I will need to investigate further.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Hugo, I will review the new crash-dump you uploaded to verify we still encounter the same hind of corruption. BTW, I have pushed a patch that should help to better identify the root cause of the MEs/MDs corruption in Slabs, for both LU-3848 and this ticket, it is available at http://review.whamcloud.com/8819/ . It has already been successfully exposed to autotests, and you may want to give it a try. May be Sebastien could also validate it and try to integrate it in Bull distro? Since this patch extracts LNET MEs and small-MDs allocation from generic <size-128> kmem_cache, I strongly expect that you will no longer encounter LU-3848 / LU-4330 LBUGs (and may be suffer new issues with others Kernel codes using <size-128> …). But if not, and you are able to reproduce the LBUGs, then I will need to investigate further.

            Bruno, Liang: This weekend I was able to obtain kernel coredumps again with this error (after a stable period os 3-4 weeks). As mentioned earlier the lustre version has been upgraded to 2.1.6. The uploaded file is LU4330-tcn82-vmcore-vmlinux-systemmap-weak-updates.zip; please let me know if you need additional files...

            hugo_meiland Hugo Meiland added a comment - Bruno, Liang: This weekend I was able to obtain kernel coredumps again with this error (after a stable period os 3-4 weeks). As mentioned earlier the lustre version has been upgraded to 2.1.6. The uploaded file is LU4330-tcn82-vmcore-vmlinux-systemmap-weak-updates.zip; please let me know if you need additional files...

            I'm afraid no crash-dump on this one available, I will send one on the next occurance...

            hugo_meiland Hugo Meiland added a comment - I'm afraid no crash-dump on this one available, I will send one on the next occurance...

            Hello Hugo, Can you also upload this new crash-dump (and required vmlinux/lustre-modules additionnal stuff) with 2.1.6, so I can double-check if we still suffer the same kind of corruption ??

            On the other hand and if still same issue as very strongly expected, I will try to come up with a patch to implement the work-around (move allocation of MEs and small MDs out of <size-128> SLABs kmem-cache) described before.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Hugo, Can you also upload this new crash-dump (and required vmlinux/lustre-modules additionnal stuff) with 2.1.6, so I can double-check if we still suffer the same kind of corruption ?? On the other hand and if still same issue as very strongly expected, I will try to come up with a patch to implement the work-around (move allocation of MEs and small MDs out of <size-128> SLABs kmem-cache) described before.
            hugo_meiland Hugo Meiland added a comment -

            Hi all, best wishes for 2014!!

            after upgrading the clients to 2.1.6, this crash has again occured, this hopefully helps in finding the underlying problem...

            1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request
            _out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_cal
            lback || callback == server_bulk_callback ) failed:
            1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) LBUG
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Pid: 3259, comm: kiblnd_sd_04
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Call Trace:
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05317f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa0531e07>] lbug_with_loc+0x47/0xb0 [libcfs]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa07a225c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3bd2>] lnet_enq_event_locked+0x62/0xd0 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3cdb>] lnet_finalize+0x9b/0x2f0 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b043>] kiblnd_recv+0x103/0x570 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a78dd>] lnet_ni_recv+0xad/0x2f0 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a7c06>] lnet_recv_put+0xe6/0x120 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05aec33>] lnet_parse+0x1273/0x1b80 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b77b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81042cf3>] ? __wake_up+0x53/0x70
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c732>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093cae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81048df0>] ? default_wake_function+0x0/0x20
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8100412a>] child_rip+0xa/0x20
            1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
            1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
            1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffff81004120>] ? child_rip+0x0/0x20

            hugo_meiland Hugo Meiland added a comment - Hi all, best wishes for 2014!! after upgrading the clients to 2.1.6, this crash has again occured, this hopefully helps in finding the underlying problem... 1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request _out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_cal lback || callback == server_bulk_callback ) failed: 1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) LBUG 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Pid: 3259, comm: kiblnd_sd_04 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Call Trace: 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05317f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa0531e07>] lbug_with_loc+0x47/0xb0 [libcfs] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa07a225c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3bd2>] lnet_enq_event_locked+0x62/0xd0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3cdb>] lnet_finalize+0x9b/0x2f0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b043>] kiblnd_recv+0x103/0x570 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a78dd>] lnet_ni_recv+0xad/0x2f0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a7c06>] lnet_recv_put+0xe6/0x120 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05aec33>] lnet_parse+0x1273/0x1b80 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b77b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81042cf3>] ? __wake_up+0x53/0x70 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c732>] kiblnd_complete+0x62/0xe0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093cae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81048df0>] ? default_wake_function+0x0/0x20 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8100412a>] child_rip+0xa/0x20 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffff81004120>] ? child_rip+0x0/0x20

            People

              bfaccini Bruno Faccini (Inactive)
              sebastien.buisson Sebastien Buisson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: