Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4330

LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.1.6
    • None
    • 3
    • 11850

    Description

      Hi,

      More and more compute nodes of several different customer clusters are hitting an LBUG on this 'assertion failed' issue:

      2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback ) failed:
      2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) LBUG
      2013-11-21 14:06:54 Nov 21 14:06:54 Pid: 46336, comm: kiblnd_sd_00
      2013-11-21 14:06:54 compute5666 kernel
      2013-11-21 14:06:54 : LustreError: 4Call Trace:
      2013-11-21 14:06:54 6336:0:(events.c [<ffffffffa041c7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      2013-11-21 14:06:54 :433:ptlrpc_mast [<ffffffffa041ce07>] lbug_with_loc+0x47/0xb0 [libcfs]
      2013-11-21 14:06:54 er_callback()) A [<ffffffffa06a426c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc]
      2013-11-21 14:06:54 SSERTION( callba [<ffffffffa048ebd2>] lnet_enq_event_locked+0x62/0xd0 [lnet]
      2013-11-21 14:06:54 ck == request_ou [<ffffffffa048ecdb>] lnet_finalize+0x9b/0x2f0 [lnet]
      2013-11-21 14:06:54 t_callback || ca [<ffffffffa083d073>] kiblnd_recv+0x103/0x570 [ko2iblnd]
      2013-11-21 14:06:54 llback == reply_ [<ffffffffa04928dd>] lnet_ni_recv+0xad/0x2f0 [lnet]
      2013-11-21 14:06:54 in_callback || c [<ffffffffa0492c06>] lnet_recv_put+0xe6/0x120 [lnet]
      2013-11-21 14:06:54 allback == clien [<ffffffffa0499c33>] lnet_parse+0x1273/0x1b80 [lnet]
      2013-11-21 14:06:54 t_bulk_callback [<ffffffff81042ca3>] ? enqueue_task+0x43/0x90
      2013-11-21 14:06:54 || callback == r [<ffffffffa083d7ab>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
      2013-11-21 14:06:54 equest_in_callba [<ffffffffa083e590>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
      2013-11-21 14:06:54 ck || callback = [<ffffffff81042a63>] ? __wake_up+0x53/0x70
      2013-11-21 14:06:54 = reply_out_call [<ffffffffa083e762>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
      2013-11-21 14:06:54 back || callback [<ffffffffa083eb19>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 == server_bulk_ [<ffffffff8104a320>] ? default_wake_function+0x0/0x20
      2013-11-21 14:06:54 callback ) faile [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 d:
      2013-11-21 14:06:54 Nov 21 14:06 [<ffffffff8100412a>] child_rip+0xa/0x20
      2013-11-21 14:06:54 :54 compute5666 ke [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 rnel: LustreErro [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
      2013-11-21 14:06:54 r: 46336:0:(even [<ffffffff81004120>] ? child_rip+0x0/0x20
      2013-11-21 14:06:54 ts.c:433:ptlrpc_
      2013-11-21 14:06:54 master_callback()) LBUG
      2013-11-21 14:06:54 Nov 21 14:06:54 compute566Kernel panic - not syncing: LBUG
      

      For information, systems are running with kernel boot parameter 'tolerant=1' set.

      We have a crash dump that we will upload to ftp.

      Sebastien.

      Attachments

        Issue Links

          Activity

            [LU-4330] LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed

            Hello Hugo, Can you also upload this new crash-dump (and required vmlinux/lustre-modules additionnal stuff) with 2.1.6, so I can double-check if we still suffer the same kind of corruption ??

            On the other hand and if still same issue as very strongly expected, I will try to come up with a patch to implement the work-around (move allocation of MEs and small MDs out of <size-128> SLABs kmem-cache) described before.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Hugo, Can you also upload this new crash-dump (and required vmlinux/lustre-modules additionnal stuff) with 2.1.6, so I can double-check if we still suffer the same kind of corruption ?? On the other hand and if still same issue as very strongly expected, I will try to come up with a patch to implement the work-around (move allocation of MEs and small MDs out of <size-128> SLABs kmem-cache) described before.
            hugo_meiland Hugo Meiland added a comment -

            Hi all, best wishes for 2014!!

            after upgrading the clients to 2.1.6, this crash has again occured, this hopefully helps in finding the underlying problem...

            1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request
            _out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_cal
            lback || callback == server_bulk_callback ) failed:
            1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) LBUG
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Pid: 3259, comm: kiblnd_sd_04
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Call Trace:
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05317f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa0531e07>] lbug_with_loc+0x47/0xb0 [libcfs]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa07a225c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3bd2>] lnet_enq_event_locked+0x62/0xd0 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3cdb>] lnet_finalize+0x9b/0x2f0 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b043>] kiblnd_recv+0x103/0x570 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a78dd>] lnet_ni_recv+0xad/0x2f0 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a7c06>] lnet_recv_put+0xe6/0x120 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05aec33>] lnet_parse+0x1273/0x1b80 [lnet]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b77b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81042cf3>] ? __wake_up+0x53/0x70
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c732>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093cae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81048df0>] ? default_wake_function+0x0/0x20
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
            1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8100412a>] child_rip+0xa/0x20
            1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
            1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
            1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffff81004120>] ? child_rip+0x0/0x20

            hugo_meiland Hugo Meiland added a comment - Hi all, best wishes for 2014!! after upgrading the clients to 2.1.6, this crash has again occured, this hopefully helps in finding the underlying problem... 1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request _out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_cal lback || callback == server_bulk_callback ) failed: 1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) LBUG 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Pid: 3259, comm: kiblnd_sd_04 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Call Trace: 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05317f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa0531e07>] lbug_with_loc+0x47/0xb0 [libcfs] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa07a225c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3bd2>] lnet_enq_event_locked+0x62/0xd0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3cdb>] lnet_finalize+0x9b/0x2f0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b043>] kiblnd_recv+0x103/0x570 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a78dd>] lnet_ni_recv+0xad/0x2f0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a7c06>] lnet_recv_put+0xe6/0x120 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05aec33>] lnet_parse+0x1273/0x1b80 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b77b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81042cf3>] ? __wake_up+0x53/0x70 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c732>] kiblnd_complete+0x62/0xe0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093cae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81048df0>] ? default_wake_function+0x0/0x20 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8100412a>] child_rip+0xa/0x20 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffff81004120>] ? child_rip+0x0/0x20

            Since the same kind of corruption occured in a ME for LU-3848 and in the MD for this ticket, both being allocated in Slab/kmem-cache size-128.
            So, as per Liang's advice, I will try to develop a patch where MEs and small MDs (<128bytes) will be allocated in a specific kmem-cache. On the other hand and specifically for MDs, I may also try to use/define LNET_USE_LIB_FREELIST, this way we will use full-size and pre-allocated MDs …

            It is still a bit early to directly link this ticket to LU-3848, because even if it concern the same kind of corruption (-3), the same size-128 kmem-cache, a node of the same cluster/system, the offset in the affected Slab/structure is different.

            bfaccini Bruno Faccini (Inactive) added a comment - Since the same kind of corruption occured in a ME for LU-3848 and in the MD for this ticket, both being allocated in Slab/kmem-cache size-128. So, as per Liang's advice, I will try to develop a patch where MEs and small MDs (<128bytes) will be allocated in a specific kmem-cache. On the other hand and specifically for MDs, I may also try to use/define LNET_USE_LIB_FREELIST, this way we will use full-size and pre-allocated MDs … It is still a bit early to directly link this ticket to LU-3848 , because even if it concern the same kind of corruption (-3), the same size-128 kmem-cache, a node of the same cluster/system, the offset in the affected Slab/structure is different.

            Thanks Hugo, I got them and was able to look at the crash-dump then!
            BTW, analysis conclusion looks very similar of the one I just made for LU-3848 :

            crash> p/x ((lnet_me_t *)0xffff88053cc10dc0)->me_md->md_user_ptr
            $32 = 0xffff88087aa2c03d
            crash> 
            crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c03d
            $8 = {
              cbid_fn = 0xffa07a2b10ffff88, 
              cbid_arg = 0x87aa2c000ffffff
            }
            crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c040
            $9 = {
              cbid_fn = 0xffffffffa07a2b10, 
              cbid_arg = 0xffff88087aa2c000
            }
            crash> x/i 0xffffffffa07a2b10
               0xffffffffa07a2b10 <request_in_callback>:    push   %rbp
            crash> 
            

            which means that current MD's md_user_ptr field/pointer seems to have been decremented by 3 !!! Just like the ME's md_me pointer in crash-dump for LU-3848 that occured on an other Client of same Cluster/Customer.
            This is again in Slabs size-128, no OOM logs but VTune/Emon driver loaded.

            So, if we add LU-3848 and this ticket's symptoms, it looks like we suffer a new situation like the one described in LU-3010 !!!

            bfaccini Bruno Faccini (Inactive) added a comment - Thanks Hugo, I got them and was able to look at the crash-dump then! BTW, analysis conclusion looks very similar of the one I just made for LU-3848 : crash> p/x ((lnet_me_t *)0xffff88053cc10dc0)->me_md->md_user_ptr $32 = 0xffff88087aa2c03d crash> crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c03d $8 = { cbid_fn = 0xffa07a2b10ffff88, cbid_arg = 0x87aa2c000ffffff } crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c040 $9 = { cbid_fn = 0xffffffffa07a2b10, cbid_arg = 0xffff88087aa2c000 } crash> x/i 0xffffffffa07a2b10 0xffffffffa07a2b10 <request_in_callback>: push %rbp crash> which means that current MD's md_user_ptr field/pointer seems to have been decremented by 3 !!! Just like the ME's md_me pointer in crash-dump for LU-3848 that occured on an other Client of same Cluster/Customer. This is again in Slabs size-128, no OOM logs but VTune/Emon driver loaded. So, if we add LU-3848 and this ticket's symptoms, it looks like we suffer a new situation like the one described in LU-3010 !!!
            hugo_meiland Hugo Meiland added a comment -

            filenames are:
            LU-4330-vmcore-tcn91-LBUG
            LU-4330-vmlinux-systemmap-weak_updates.zip
            Let me know if you can find them, otherwise I'll upload again...
            Hugo

            hugo_meiland Hugo Meiland added a comment - filenames are: LU-4330 -vmcore-tcn91-LBUG LU-4330 -vmlinux-systemmap-weak_updates.zip Let me know if you can find them, otherwise I'll upload again... Hugo
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Hugo, thanks for uploading the crash-dump, but I need the exact names to be able to get them.
            Sebastien, thanks for the precision.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Hugo, thanks for uploading the crash-dump, but I need the exact names to be able to get them. Sebastien, thanks for the precision.

            Hi Bruno,

            Just a quick comment to mention that we are not running with the ipoib-locking-fix.patch, as we already run an OFED stack that integrates a more complete version of this patch, and it does not help.

            Cheers,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Bruno, Just a quick comment to mention that we are not running with the ipoib-locking-fix.patch, as we already run an OFED stack that integrates a more complete version of this patch, and it does not help. Cheers, Sebastien.

            just uploaded vmcore + vmlinux and Lustre modules to Intel's FTP: /uploads/LU-4330*

            hugo_meiland Hugo Meiland added a comment - just uploaded vmcore + vmlinux and Lustre modules to Intel's FTP: /uploads/ LU-4330 *

            Hello Sebastien,
            According to LU-1734/LU-3010 already describing the same issue/LBUG, it has been identified to be caused by some IB-related memory corruption.
            Thus I am wondering if the affected systems are running with ipoib-locking-fix.patch described in LU-3848 (where you recently noticed "please also note that this patch on IPoIB was removed as it was found to introduce a deadlock") ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Sebastien, According to LU-1734 / LU-3010 already describing the same issue/LBUG, it has been identified to be caused by some IB-related memory corruption. Thus I am wondering if the affected systems are running with ipoib-locking-fix.patch described in LU-3848 (where you recently noticed "please also note that this patch on IPoIB was removed as it was found to introduce a deadlock") ?

            People

              bfaccini Bruno Faccini (Inactive)
              sebastien.buisson Sebastien Buisson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: