[LU-4330] LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.1.6
Labels:
None

Severity:
3
Rank (Obsolete):
11850

Description

Hi,

More and more compute nodes of several different customer clusters are hitting an LBUG on this 'assertion failed' issue:

2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback ) failed:
2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) LBUG
2013-11-21 14:06:54 Nov 21 14:06:54 Pid: 46336, comm: kiblnd_sd_00
2013-11-21 14:06:54 compute5666 kernel
2013-11-21 14:06:54 : LustreError: 4Call Trace:
2013-11-21 14:06:54 6336:0:(events.c [<ffffffffa041c7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-11-21 14:06:54 :433:ptlrpc_mast [<ffffffffa041ce07>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-11-21 14:06:54 er_callback()) A [<ffffffffa06a426c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc]
2013-11-21 14:06:54 SSERTION( callba [<ffffffffa048ebd2>] lnet_enq_event_locked+0x62/0xd0 [lnet]
2013-11-21 14:06:54 ck == request_ou [<ffffffffa048ecdb>] lnet_finalize+0x9b/0x2f0 [lnet]
2013-11-21 14:06:54 t_callback || ca [<ffffffffa083d073>] kiblnd_recv+0x103/0x570 [ko2iblnd]
2013-11-21 14:06:54 llback == reply_ [<ffffffffa04928dd>] lnet_ni_recv+0xad/0x2f0 [lnet]
2013-11-21 14:06:54 in_callback || c [<ffffffffa0492c06>] lnet_recv_put+0xe6/0x120 [lnet]
2013-11-21 14:06:54 allback == clien [<ffffffffa0499c33>] lnet_parse+0x1273/0x1b80 [lnet]
2013-11-21 14:06:54 t_bulk_callback [<ffffffff81042ca3>] ? enqueue_task+0x43/0x90
2013-11-21 14:06:54 || callback == r [<ffffffffa083d7ab>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
2013-11-21 14:06:54 equest_in_callba [<ffffffffa083e590>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
2013-11-21 14:06:54 ck || callback = [<ffffffff81042a63>] ? __wake_up+0x53/0x70
2013-11-21 14:06:54 = reply_out_call [<ffffffffa083e762>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
2013-11-21 14:06:54 back || callback [<ffffffffa083eb19>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
2013-11-21 14:06:54 == server_bulk_ [<ffffffff8104a320>] ? default_wake_function+0x0/0x20
2013-11-21 14:06:54 callback ) faile [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-11-21 14:06:54 d:
2013-11-21 14:06:54 Nov 21 14:06 [<ffffffff8100412a>] child_rip+0xa/0x20
2013-11-21 14:06:54 :54 compute5666 ke [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-11-21 14:06:54 rnel: LustreErro [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-11-21 14:06:54 r: 46336:0:(even [<ffffffff81004120>] ? child_rip+0x0/0x20
2013-11-21 14:06:54 ts.c:433:ptlrpc_
2013-11-21 14:06:54 master_callback()) LBUG
2013-11-21 14:06:54 Nov 21 14:06:54 compute566Kernel panic - not syncing: LBUG

For information, systems are running with kernel boot parameter 'tolerant=1' set.

We have a crash dump that we will upload to ftp.

Sebastien.

Attachments

Issue Links

is related to

LU-8362 page fault: exception RIP: lnet_mt_match_md+135

Resolved

is related to

LU-3848 Compute node crashes due to error in Lustre : ASSERTION( me == md->md_me ) failed

Resolved

Activity

[LU-4330] LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed

Bruno Faccini (Inactive) added a comment - 06/Jan/14 10:07 AM

Hello Hugo, Can you also upload this new crash-dump (and required vmlinux/lustre-modules additionnal stuff) with 2.1.6, so I can double-check if we still suffer the same kind of corruption ??

On the other hand and if still same issue as very strongly expected, I will try to come up with a patch to implement the work-around (move allocation of MEs and small MDs out of <size-128> SLABs kmem-cache) described before.

Bruno Faccini (Inactive) added a comment - 06/Jan/14 10:07 AM Hello Hugo, Can you also upload this new crash-dump (and required vmlinux/lustre-modules additionnal stuff) with 2.1.6, so I can double-check if we still suffer the same kind of corruption ?? On the other hand and if still same issue as very strongly expected, I will try to come up with a patch to implement the work-around (move allocation of MEs and small MDs out of <size-128> SLABs kmem-cache) described before.

Hugo Meiland added a comment - 03/Jan/14 7:34 AM

Hi all, best wishes for 2014!!

after upgrading the clients to 2.1.6, this crash has again occured, this hopefully helps in finding the underlying problem...

1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request
_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_cal
lback || callback == server_bulk_callback ) failed:
1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) LBUG
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Pid: 3259, comm: kiblnd_sd_04
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Call Trace:
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05317f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa0531e07>] lbug_with_loc+0x47/0xb0 [libcfs]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa07a225c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3bd2>] lnet_enq_event_locked+0x62/0xd0 [lnet]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3cdb>] lnet_finalize+0x9b/0x2f0 [lnet]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b043>] kiblnd_recv+0x103/0x570 [ko2iblnd]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a78dd>] lnet_ni_recv+0xad/0x2f0 [lnet]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a7c06>] lnet_recv_put+0xe6/0x120 [lnet]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05aec33>] lnet_parse+0x1273/0x1b80 [lnet]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b77b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81042cf3>] ? __wake_up+0x53/0x70
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c732>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093cae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81048df0>] ? default_wake_function+0x0/0x20
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8100412a>] child_rip+0xa/0x20
1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffff81004120>] ? child_rip+0x0/0x20

Hugo Meiland added a comment - 03/Jan/14 7:34 AM Hi all, best wishes for 2014!! after upgrading the clients to 2.1.6, this crash has again occured, this hopefully helps in finding the underlying problem... 1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request _out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_cal lback || callback == server_bulk_callback ) failed: 1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) LBUG 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Pid: 3259, comm: kiblnd_sd_04 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel Call Trace: 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05317f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa0531e07>] lbug_with_loc+0x47/0xb0 [libcfs] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa07a225c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3bd2>] lnet_enq_event_locked+0x62/0xd0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a3cdb>] lnet_finalize+0x9b/0x2f0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b043>] kiblnd_recv+0x103/0x570 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a78dd>] lnet_ni_recv+0xad/0x2f0 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05a7c06>] lnet_recv_put+0xe6/0x120 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa05aec33>] lnet_parse+0x1273/0x1b80 [lnet] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093b77b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81042cf3>] ? __wake_up+0x53/0x70 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c732>] kiblnd_complete+0x62/0xe0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093cae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff81048df0>] ? default_wake_function+0x0/0x20 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685619 2014 Jan 2 19:00:19 tcn19 kern warning kernel [<ffffffff8100412a>] child_rip+0xa/0x20 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffffa093c7b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 1388685620 2014 Jan 2 19:00:20 tcn19 kern warning kernel [<ffffffff81004120>] ? child_rip+0x0/0x20

Bruno Faccini (Inactive) added a comment - 07/Dec/13 4:06 PM

Since the same kind of corruption occured in a ME for ~~LU-3848~~ and in the MD for this ticket, both being allocated in Slab/kmem-cache size-128.
So, as per Liang's advice, I will try to develop a patch where MEs and small MDs (<128bytes) will be allocated in a specific kmem-cache. On the other hand and specifically for MDs, I may also try to use/define LNET_USE_LIB_FREELIST, this way we will use full-size and pre-allocated MDs …

It is still a bit early to directly link this ticket to ~~LU-3848~~, because even if it concern the same kind of corruption (-3), the same size-128 kmem-cache, a node of the same cluster/system, the offset in the affected Slab/structure is different.

Bruno Faccini (Inactive) added a comment - 07/Dec/13 4:06 PM Since the same kind of corruption occured in a ME for LU-3848 and in the MD for this ticket, both being allocated in Slab/kmem-cache size-128. So, as per Liang's advice, I will try to develop a patch where MEs and small MDs (<128bytes) will be allocated in a specific kmem-cache. On the other hand and specifically for MDs, I may also try to use/define LNET_USE_LIB_FREELIST, this way we will use full-size and pre-allocated MDs … It is still a bit early to directly link this ticket to LU-3848 , because even if it concern the same kind of corruption (-3), the same size-128 kmem-cache, a node of the same cluster/system, the offset in the affected Slab/structure is different.

Bruno Faccini (Inactive) added a comment - 06/Dec/13 5:01 PM

Thanks Hugo, I got them and was able to look at the crash-dump then!
BTW, analysis conclusion looks very similar of the one I just made for ~~LU-3848~~ :

crash> p/x ((lnet_me_t *)0xffff88053cc10dc0)->me_md->md_user_ptr
$32 = 0xffff88087aa2c03d
crash> 
crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c03d
$8 = {
  cbid_fn = 0xffa07a2b10ffff88, 
  cbid_arg = 0x87aa2c000ffffff
}
crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c040
$9 = {
  cbid_fn = 0xffffffffa07a2b10, 
  cbid_arg = 0xffff88087aa2c000
}
crash> x/i 0xffffffffa07a2b10
   0xffffffffa07a2b10 <request_in_callback>:    push   %rbp
crash>

which means that current MD's md_user_ptr field/pointer seems to have been decremented by 3 !!! Just like the ME's md_me pointer in crash-dump for ~~LU-3848~~ that occured on an other Client of same Cluster/Customer.
This is again in Slabs size-128, no OOM logs but VTune/Emon driver loaded.

So, if we add ~~LU-3848~~ and this ticket's symptoms, it looks like we suffer a new situation like the one described in ~~LU-3010~~ !!!

Bruno Faccini (Inactive) added a comment - 06/Dec/13 5:01 PM Thanks Hugo, I got them and was able to look at the crash-dump then! BTW, analysis conclusion looks very similar of the one I just made for LU-3848 : crash> p/x ((lnet_me_t *)0xffff88053cc10dc0)->me_md->md_user_ptr $32 = 0xffff88087aa2c03d crash> crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c03d $8 = { cbid_fn = 0xffa07a2b10ffff88, cbid_arg = 0x87aa2c000ffffff } crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c040 $9 = { cbid_fn = 0xffffffffa07a2b10, cbid_arg = 0xffff88087aa2c000 } crash> x/i 0xffffffffa07a2b10 0xffffffffa07a2b10 <request_in_callback>: push %rbp crash> which means that current MD's md_user_ptr field/pointer seems to have been decremented by 3 !!! Just like the ME's md_me pointer in crash-dump for LU-3848 that occured on an other Client of same Cluster/Customer. This is again in Slabs size-128, no OOM logs but VTune/Emon driver loaded. So, if we add LU-3848 and this ticket's symptoms, it looks like we suffer a new situation like the one described in LU-3010 !!!

Hugo Meiland added a comment - 06/Dec/13 8:32 AM

filenames are:
LU-4330-vmcore-tcn91-LBUG
LU-4330-vmlinux-systemmap-weak_updates.zip
Let me know if you can find them, otherwise I'll upload again...
Hugo

Hugo Meiland added a comment - 06/Dec/13 8:32 AM filenames are: LU-4330 -vmcore-tcn91-LBUG LU-4330 -vmlinux-systemmap-weak_updates.zip Let me know if you can find them, otherwise I'll upload again... Hugo

Bruno Faccini (Inactive) added a comment - 05/Dec/13 2:55 PM - edited

Hugo, thanks for uploading the crash-dump, but I need the exact names to be able to get them.
Sebastien, thanks for the precision.

Bruno Faccini (Inactive) added a comment - 05/Dec/13 2:55 PM - edited Hugo, thanks for uploading the crash-dump, but I need the exact names to be able to get them. Sebastien, thanks for the precision.

Sebastien Buisson (Inactive) added a comment - 05/Dec/13 8:21 AM

Hi Bruno,

Just a quick comment to mention that we are not running with the ipoib-locking-fix.patch, as we already run an OFED stack that integrates a more complete version of this patch, and it does not help.

Cheers,
Sebastien.

Sebastien Buisson (Inactive) added a comment - 05/Dec/13 8:21 AM Hi Bruno, Just a quick comment to mention that we are not running with the ipoib-locking-fix.patch, as we already run an OFED stack that integrates a more complete version of this patch, and it does not help. Cheers, Sebastien.

Hugo Meiland added a comment - 03/Dec/13 12:02 PM

just uploaded vmcore + vmlinux and Lustre modules to Intel's FTP: /uploads/LU-4330*

Hugo Meiland added a comment - 03/Dec/13 12:02 PM just uploaded vmcore + vmlinux and Lustre modules to Intel's FTP: /uploads/ LU-4330 *

Bruno Faccini (Inactive) added a comment - 02/Dec/13 2:51 PM

Hello Sebastien,
According to ~~LU-1734~~/~~LU-3010~~ already describing the same issue/LBUG, it has been identified to be caused by some IB-related memory corruption.
Thus I am wondering if the affected systems are running with ipoib-locking-fix.patch described in ~~LU-3848~~ (where you recently noticed "please also note that this patch on IPoIB was removed as it was found to introduce a deadlock") ?

Bruno Faccini (Inactive) added a comment - 02/Dec/13 2:51 PM Hello Sebastien, According to LU-1734 / LU-3010 already describing the same issue/LBUG, it has been identified to be caused by some IB-related memory corruption. Thus I am wondering if the affected systems are running with ipoib-locking-fix.patch described in LU-3848 (where you recently noticed "please also note that this patch on IPoIB was removed as it was found to introduce a deadlock") ?

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Sebastien Buisson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 02/Dec/13 2:40 PM

Updated:: 14/Jun/18 9:41 PM