Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.1.6
-
None
-
3
-
11850
Description
Hi,
More and more compute nodes of several different customer clusters are hitting an LBUG on this 'assertion failed' issue:
2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback ) failed: 2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) LBUG 2013-11-21 14:06:54 Nov 21 14:06:54 Pid: 46336, comm: kiblnd_sd_00 2013-11-21 14:06:54 compute5666 kernel 2013-11-21 14:06:54 : LustreError: 4Call Trace: 2013-11-21 14:06:54 6336:0:(events.c [<ffffffffa041c7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-11-21 14:06:54 :433:ptlrpc_mast [<ffffffffa041ce07>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-11-21 14:06:54 er_callback()) A [<ffffffffa06a426c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc] 2013-11-21 14:06:54 SSERTION( callba [<ffffffffa048ebd2>] lnet_enq_event_locked+0x62/0xd0 [lnet] 2013-11-21 14:06:54 ck == request_ou [<ffffffffa048ecdb>] lnet_finalize+0x9b/0x2f0 [lnet] 2013-11-21 14:06:54 t_callback || ca [<ffffffffa083d073>] kiblnd_recv+0x103/0x570 [ko2iblnd] 2013-11-21 14:06:54 llback == reply_ [<ffffffffa04928dd>] lnet_ni_recv+0xad/0x2f0 [lnet] 2013-11-21 14:06:54 in_callback || c [<ffffffffa0492c06>] lnet_recv_put+0xe6/0x120 [lnet] 2013-11-21 14:06:54 allback == clien [<ffffffffa0499c33>] lnet_parse+0x1273/0x1b80 [lnet] 2013-11-21 14:06:54 t_bulk_callback [<ffffffff81042ca3>] ? enqueue_task+0x43/0x90 2013-11-21 14:06:54 || callback == r [<ffffffffa083d7ab>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd] 2013-11-21 14:06:54 equest_in_callba [<ffffffffa083e590>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd] 2013-11-21 14:06:54 ck || callback = [<ffffffff81042a63>] ? __wake_up+0x53/0x70 2013-11-21 14:06:54 = reply_out_call [<ffffffffa083e762>] kiblnd_complete+0x62/0xe0 [ko2iblnd] 2013-11-21 14:06:54 back || callback [<ffffffffa083eb19>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd] 2013-11-21 14:06:54 == server_bulk_ [<ffffffff8104a320>] ? default_wake_function+0x0/0x20 2013-11-21 14:06:54 callback ) faile [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-11-21 14:06:54 d: 2013-11-21 14:06:54 Nov 21 14:06 [<ffffffff8100412a>] child_rip+0xa/0x20 2013-11-21 14:06:54 :54 compute5666 ke [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-11-21 14:06:54 rnel: LustreErro [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-11-21 14:06:54 r: 46336:0:(even [<ffffffff81004120>] ? child_rip+0x0/0x20 2013-11-21 14:06:54 ts.c:433:ptlrpc_ 2013-11-21 14:06:54 master_callback()) LBUG 2013-11-21 14:06:54 Nov 21 14:06:54 compute566Kernel panic - not syncing: LBUG
For information, systems are running with kernel boot parameter 'tolerant=1' set.
We have a crash dump that we will upload to ftp.
Sebastien.
just to keep the info with
LU-3848in sync:Hi Bruno and Liang;
@Bruno: Thanks for the patch, I've asked Sebastien to prepare it for me to test at the client site. I do have plenty more of crashes available right now, but I'm not sure if this would be a huge help at this time.
@Liang: Your patch has been integrated to our tree, but just after our upgrade to 2.1.6, so I'm not running this patch yet. It will be in the next lustre package.
I'll try to give you feedback on these as soon as possible.