Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4330

LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.1.6
    • None
    • 3
    • 11850

    Description

      Hi,

      More and more compute nodes of several different customer clusters are hitting an LBUG on this 'assertion failed' issue:

      2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback ) failed:
      2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) LBUG
      2013-11-21 14:06:54 Nov 21 14:06:54 Pid: 46336, comm: kiblnd_sd_00
      2013-11-21 14:06:54 compute5666 kernel
      2013-11-21 14:06:54 : LustreError: 4Call Trace:
      2013-11-21 14:06:54 6336:0:(events.c [<ffffffffa041c7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-11-21 14:06:54 :433:ptlrpc_mast [<ffffffffa041ce07>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-11-21 14:06:54 er_callback()) A [<ffffffffa06a426c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc]
      2013-11-21 14:06:54 SSERTION( callba [<ffffffffa048ebd2>] lnet_enq_event_locked+0x62/0xd0 [lnet]
      2013-11-21 14:06:54 ck == request_ou [<ffffffffa048ecdb>] lnet_finalize+0x9b/0x2f0 [lnet]
      2013-11-21 14:06:54 t_callback || ca [<ffffffffa083d073>] kiblnd_recv+0x103/0x570 [ko2iblnd]
      2013-11-21 14:06:54 llback == reply_ [<ffffffffa04928dd>] lnet_ni_recv+0xad/0x2f0 [lnet]
      2013-11-21 14:06:54 in_callback || c [<ffffffffa0492c06>] lnet_recv_put+0xe6/0x120 [lnet]
      2013-11-21 14:06:54 allback == clien [<ffffffffa0499c33>] lnet_parse+0x1273/0x1b80 [lnet]
      2013-11-21 14:06:54 t_bulk_callback [<ffffffff81042ca3>] ? enqueue_task+0x43/0x90
      2013-11-21 14:06:54 || callback == r [<ffffffffa083d7ab>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
      2013-11-21 14:06:54 equest_in_callba [<ffffffffa083e590>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
      2013-11-21 14:06:54 ck || callback = [<ffffffff81042a63>] ? __wake_up+0x53/0x70
      2013-11-21 14:06:54 = reply_out_call [<ffffffffa083e762>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
      2013-11-21 14:06:54 back || callback [<ffffffffa083eb19>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 == server_bulk_ [<ffffffff8104a320>] ? default_wake_function+0x0/0x20
      2013-11-21 14:06:54 callback ) faile [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 d:
      2013-11-21 14:06:54 Nov 21 14:06 [<ffffffff8100412a>] child_rip+0xa/0x20
      2013-11-21 14:06:54 :54 compute5666 ke [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 rnel: LustreErro [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 r: 46336:0:(even [<ffffffff81004120>] ? child_rip+0x0/0x20
      2013-11-21 14:06:54 ts.c:433:ptlrpc_
      2013-11-21 14:06:54 master_callback()) LBUG
      2013-11-21 14:06:54 Nov 21 14:06:54 compute566Kernel panic - not syncing: LBUG
      

      For information, systems are running with kernel boot parameter 'tolerant=1' set.

      We have a crash dump that we will upload to ftp.

      Sebastien.

      Attachments

        Issue Links

          Activity

            [LU-4330] LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed
            hugo_meiland Hugo Meiland added a comment -

            just to keep the info with LU-3848 in sync:
            Hi Bruno and Liang;
            @Bruno: Thanks for the patch, I've asked Sebastien to prepare it for me to test at the client site. I do have plenty more of crashes available right now, but I'm not sure if this would be a huge help at this time.
            @Liang: Your patch has been integrated to our tree, but just after our upgrade to 2.1.6, so I'm not running this patch yet. It will be in the next lustre package.
            I'll try to give you feedback on these as soon as possible.

            hugo_meiland Hugo Meiland added a comment - just to keep the info with LU-3848 in sync: Hi Bruno and Liang; @Bruno: Thanks for the patch, I've asked Sebastien to prepare it for me to test at the client site. I do have plenty more of crashes available right now, but I'm not sure if this would be a huge help at this time. @Liang: Your patch has been integrated to our tree, but just after our upgrade to 2.1.6, so I'm not running this patch yet. It will be in the next lustre package. I'll try to give you feedback on these as soon as possible.

            Hello Hugo, I will review the new crash-dump you uploaded to verify we still encounter the same hind of corruption.

            BTW, I have pushed a patch that should help to better identify the root cause of the MEs/MDs corruption in Slabs, for both LU-3848 and this ticket, it is available at http://review.whamcloud.com/8819/. It has already been successfully exposed to autotests, and you may want to give it a try. May be Sebastien could also validate it and try to integrate it in Bull distro?

            Since this patch extracts LNET MEs and small-MDs allocation from generic <size-128> kmem_cache, I strongly expect that you will no longer encounter LU-3848/LU-4330 LBUGs (and may be suffer new issues with others Kernel codes using <size-128> …). But if not, and you are able to reproduce the LBUGs, then I will need to investigate further.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Hugo, I will review the new crash-dump you uploaded to verify we still encounter the same hind of corruption. BTW, I have pushed a patch that should help to better identify the root cause of the MEs/MDs corruption in Slabs, for both LU-3848 and this ticket, it is available at http://review.whamcloud.com/8819/ . It has already been successfully exposed to autotests, and you may want to give it a try. May be Sebastien could also validate it and try to integrate it in Bull distro? Since this patch extracts LNET MEs and small-MDs allocation from generic <size-128> kmem_cache, I strongly expect that you will no longer encounter LU-3848 / LU-4330 LBUGs (and may be suffer new issues with others Kernel codes using <size-128> …). But if not, and you are able to reproduce the LBUGs, then I will need to investigate further.

            Bruno, Liang: This weekend I was able to obtain kernel coredumps again with this error (after a stable period os 3-4 weeks). As mentioned earlier the lustre version has been upgraded to 2.1.6. The uploaded file is LU4330-tcn82-vmcore-vmlinux-systemmap-weak-updates.zip; please let me know if you need additional files...

            hugo_meiland Hugo Meiland added a comment - Bruno, Liang: This weekend I was able to obtain kernel coredumps again with this error (after a stable period os 3-4 weeks). As mentioned earlier the lustre version has been upgraded to 2.1.6. The uploaded file is LU4330-tcn82-vmcore-vmlinux-systemmap-weak-updates.zip; please let me know if you need additional files...

            I'm afraid no crash-dump on this one available, I will send one on the next occurance...

            hugo_meiland Hugo Meiland added a comment - I'm afraid no crash-dump on this one available, I will send one on the next occurance...

            Hello Hugo, Can you also upload this new crash-dump (and required vmlinux/lustre-modules additionnal stuff) with 2.1.6, so I can double-check if we still suffer the same kind of corruption ??

            On the other hand and if still same issue as very strongly expected, I will try to come up with a patch to implement the work-around (move allocation of MEs and small MDs out of <size-128> SLABs kmem-cache) described before.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Hugo, Can you also upload this new crash-dump (and required vmlinux/lustre-modules additionnal stuff) with 2.1.6, so I can double-check if we still suffer the same kind of corruption ?? On the other hand and if still same issue as very strongly expected, I will try to come up with a patch to implement the work-around (move allocation of MEs and small MDs out of <size-128> SLABs kmem-cache) described before.
            hugo_meiland Hugo Meiland added a comment -

            Hi all, best wishes for 2014!!

            after upgrading the clients to 2.1.6, this crash has again occured, this hopefully helps in finding the underlying problem...

            1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request
            _out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_cal
            lback || callback == server_bulk_callback ) failed:
            1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) LBUG
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Pid: 3259, comm: kiblnd_sd_04
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Call Trace:
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05317f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa0531e07>] lbug_with_loc+0x47/0xb0 [libcfs]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa07a225c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3bd2>] lnet_enq_event_locked+0x62/0xd0 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3cdb>] lnet_finalize+0x9b/0x2f0 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b043>] kiblnd_recv+0x103/0x570 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a78dd>] lnet_ni_recv+0xad/0x2f0 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a7c06>] lnet_recv_put+0xe6/0x120 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05aec33>] lnet_parse+0x1273/0x1b80 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b77b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81042cf3>] ? __wake_up+0x53/0x70
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c732>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093cae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81048df0>] ? default_wake_function+0x0/0x20
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8100412a>] child_rip+0xa/0x20
            1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
            1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
            1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffff81004120>] ? child_rip+0x0/0x20

            hugo_meiland Hugo Meiland added a comment - Hi all, best wishes for 2014!! after upgrading the clients to 2.1.6, this crash has again occured, this hopefully helps in finding the underlying problem... 1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request _out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_cal lback || callback == server_bulk_callback ) failed: 1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) LBUG 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Pid: 3259, comm: kiblnd_sd_04 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Call Trace: 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05317f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa0531e07>] lbug_with_loc+0x47/0xb0 [libcfs] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa07a225c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3bd2>] lnet_enq_event_locked+0x62/0xd0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3cdb>] lnet_finalize+0x9b/0x2f0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b043>] kiblnd_recv+0x103/0x570 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a78dd>] lnet_ni_recv+0xad/0x2f0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a7c06>] lnet_recv_put+0xe6/0x120 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05aec33>] lnet_parse+0x1273/0x1b80 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b77b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81042cf3>] ? __wake_up+0x53/0x70 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c732>] kiblnd_complete+0x62/0xe0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093cae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81048df0>] ? default_wake_function+0x0/0x20 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8100412a>] child_rip+0xa/0x20 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffff81004120>] ? child_rip+0x0/0x20

            Since the same kind of corruption occured in a ME for LU-3848 and in the MD for this ticket, both being allocated in Slab/kmem-cache size-128.
            So, as per Liang's advice, I will try to develop a patch where MEs and small MDs (<128bytes) will be allocated in a specific kmem-cache. On the other hand and specifically for MDs, I may also try to use/define LNET_USE_LIB_FREELIST, this way we will use full-size and pre-allocated MDs …

            It is still a bit early to directly link this ticket to LU-3848, because even if it concern the same kind of corruption (-3), the same size-128 kmem-cache, a node of the same cluster/system, the offset in the affected Slab/structure is different.

            bfaccini Bruno Faccini (Inactive) added a comment - Since the same kind of corruption occured in a ME for LU-3848 and in the MD for this ticket, both being allocated in Slab/kmem-cache size-128. So, as per Liang's advice, I will try to develop a patch where MEs and small MDs (<128bytes) will be allocated in a specific kmem-cache. On the other hand and specifically for MDs, I may also try to use/define LNET_USE_LIB_FREELIST, this way we will use full-size and pre-allocated MDs … It is still a bit early to directly link this ticket to LU-3848 , because even if it concern the same kind of corruption (-3), the same size-128 kmem-cache, a node of the same cluster/system, the offset in the affected Slab/structure is different.

            Thanks Hugo, I got them and was able to look at the crash-dump then!
            BTW, analysis conclusion looks very similar of the one I just made for LU-3848 :

            crash> p/x ((lnet_me_t *)0xffff88053cc10dc0)->me_md->md_user_ptr
            $32 = 0xffff88087aa2c03d
            crash> 
            crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c03d
            $8 = {
              cbid_fn = 0xffa07a2b10ffff88, 
              cbid_arg = 0x87aa2c000ffffff
            }
            crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c040
            $9 = {
              cbid_fn = 0xffffffffa07a2b10, 
              cbid_arg = 0xffff88087aa2c000
            }
            crash> x/i 0xffffffffa07a2b10
               0xffffffffa07a2b10 <request_in_callback>:    push   %rbp
            crash> 
            

            which means that current MD's md_user_ptr field/pointer seems to have been decremented by 3 !!! Just like the ME's md_me pointer in crash-dump for LU-3848 that occured on an other Client of same Cluster/Customer.
            This is again in Slabs size-128, no OOM logs but VTune/Emon driver loaded.

            So, if we add LU-3848 and this ticket's symptoms, it looks like we suffer a new situation like the one described in LU-3010 !!!

            bfaccini Bruno Faccini (Inactive) added a comment - Thanks Hugo, I got them and was able to look at the crash-dump then! BTW, analysis conclusion looks very similar of the one I just made for LU-3848 : crash> p/x ((lnet_me_t *)0xffff88053cc10dc0)->me_md->md_user_ptr $32 = 0xffff88087aa2c03d crash> crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c03d $8 = { cbid_fn = 0xffa07a2b10ffff88, cbid_arg = 0x87aa2c000ffffff } crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c040 $9 = { cbid_fn = 0xffffffffa07a2b10, cbid_arg = 0xffff88087aa2c000 } crash> x/i 0xffffffffa07a2b10 0xffffffffa07a2b10 <request_in_callback>: push %rbp crash> which means that current MD's md_user_ptr field/pointer seems to have been decremented by 3 !!! Just like the ME's md_me pointer in crash-dump for LU-3848 that occured on an other Client of same Cluster/Customer. This is again in Slabs size-128, no OOM logs but VTune/Emon driver loaded. So, if we add LU-3848 and this ticket's symptoms, it looks like we suffer a new situation like the one described in LU-3010 !!!
            hugo_meiland Hugo Meiland added a comment -

            filenames are:
            LU-4330-vmcore-tcn91-LBUG
            LU-4330-vmlinux-systemmap-weak_updates.zip
            Let me know if you can find them, otherwise I'll upload again...
            Hugo

            hugo_meiland Hugo Meiland added a comment - filenames are: LU-4330 -vmcore-tcn91-LBUG LU-4330 -vmlinux-systemmap-weak_updates.zip Let me know if you can find them, otherwise I'll upload again... Hugo
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Hugo, thanks for uploading the crash-dump, but I need the exact names to be able to get them.
            Sebastien, thanks for the precision.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Hugo, thanks for uploading the crash-dump, but I need the exact names to be able to get them. Sebastien, thanks for the precision.

            Hi Bruno,

            Just a quick comment to mention that we are not running with the ipoib-locking-fix.patch, as we already run an OFED stack that integrates a more complete version of this patch, and it does not help.

            Cheers,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Bruno, Just a quick comment to mention that we are not running with the ipoib-locking-fix.patch, as we already run an OFED stack that integrates a more complete version of this patch, and it does not help. Cheers, Sebastien.

            People

              bfaccini Bruno Faccini (Inactive)
              sebastien.buisson Sebastien Buisson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: