[LU-4330] LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed Created: 02/Dec/13 Updated: 14/Jun/18 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 11850 | ||||||||||||
| Description |
|
Hi, More and more compute nodes of several different customer clusters are hitting an LBUG on this 'assertion failed' issue: 2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback ) failed: 2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) LBUG 2013-11-21 14:06:54 Nov 21 14:06:54 Pid: 46336, comm: kiblnd_sd_00 2013-11-21 14:06:54 compute5666 kernel 2013-11-21 14:06:54 : LustreError: 4Call Trace: 2013-11-21 14:06:54 6336:0:(events.c [<ffffffffa041c7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-11-21 14:06:54 :433:ptlrpc_mast [<ffffffffa041ce07>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-11-21 14:06:54 er_callback()) A [<ffffffffa06a426c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc] 2013-11-21 14:06:54 SSERTION( callba [<ffffffffa048ebd2>] lnet_enq_event_locked+0x62/0xd0 [lnet] 2013-11-21 14:06:54 ck == request_ou [<ffffffffa048ecdb>] lnet_finalize+0x9b/0x2f0 [lnet] 2013-11-21 14:06:54 t_callback || ca [<ffffffffa083d073>] kiblnd_recv+0x103/0x570 [ko2iblnd] 2013-11-21 14:06:54 llback == reply_ [<ffffffffa04928dd>] lnet_ni_recv+0xad/0x2f0 [lnet] 2013-11-21 14:06:54 in_callback || c [<ffffffffa0492c06>] lnet_recv_put+0xe6/0x120 [lnet] 2013-11-21 14:06:54 allback == clien [<ffffffffa0499c33>] lnet_parse+0x1273/0x1b80 [lnet] 2013-11-21 14:06:54 t_bulk_callback [<ffffffff81042ca3>] ? enqueue_task+0x43/0x90 2013-11-21 14:06:54 || callback == r [<ffffffffa083d7ab>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd] 2013-11-21 14:06:54 equest_in_callba [<ffffffffa083e590>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd] 2013-11-21 14:06:54 ck || callback = [<ffffffff81042a63>] ? __wake_up+0x53/0x70 2013-11-21 14:06:54 = reply_out_call [<ffffffffa083e762>] kiblnd_complete+0x62/0xe0 [ko2iblnd] 2013-11-21 14:06:54 back || callback [<ffffffffa083eb19>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd] 2013-11-21 14:06:54 == server_bulk_ [<ffffffff8104a320>] ? default_wake_function+0x0/0x20 2013-11-21 14:06:54 callback ) faile [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-11-21 14:06:54 d: 2013-11-21 14:06:54 Nov 21 14:06 [<ffffffff8100412a>] child_rip+0xa/0x20 2013-11-21 14:06:54 :54 compute5666 ke [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-11-21 14:06:54 rnel: LustreErro [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-11-21 14:06:54 r: 46336:0:(even [<ffffffff81004120>] ? child_rip+0x0/0x20 2013-11-21 14:06:54 ts.c:433:ptlrpc_ 2013-11-21 14:06:54 master_callback()) LBUG 2013-11-21 14:06:54 Nov 21 14:06:54 compute566Kernel panic - not syncing: LBUG For information, systems are running with kernel boot parameter 'tolerant=1' set. We have a crash dump that we will upload to ftp. Sebastien. |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 02/Dec/13 ] |
|
Hello Sebastien, |
| Comment by Hugo Meiland [ 03/Dec/13 ] |
|
just uploaded vmcore + vmlinux and Lustre modules to Intel's FTP: /uploads/LU-4330* |
| Comment by Sebastien Buisson (Inactive) [ 05/Dec/13 ] |
|
Hi Bruno, Just a quick comment to mention that we are not running with the ipoib-locking-fix.patch, as we already run an OFED stack that integrates a more complete version of this patch, and it does not help. Cheers, |
| Comment by Bruno Faccini (Inactive) [ 05/Dec/13 ] |
|
Hugo, thanks for uploading the crash-dump, but I need the exact names to be able to get them. |
| Comment by Hugo Meiland [ 06/Dec/13 ] |
|
filenames are: |
| Comment by Bruno Faccini (Inactive) [ 06/Dec/13 ] |
|
Thanks Hugo, I got them and was able to look at the crash-dump then! crash> p/x ((lnet_me_t *)0xffff88053cc10dc0)->me_md->md_user_ptr
$32 = 0xffff88087aa2c03d
crash>
crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c03d
$8 = {
cbid_fn = 0xffa07a2b10ffff88,
cbid_arg = 0x87aa2c000ffffff
}
crash> p/x *(struct ptlrpc_cb_id *)0xffff88087aa2c040
$9 = {
cbid_fn = 0xffffffffa07a2b10,
cbid_arg = 0xffff88087aa2c000
}
crash> x/i 0xffffffffa07a2b10
0xffffffffa07a2b10 <request_in_callback>: push %rbp
crash>
which means that current MD's md_user_ptr field/pointer seems to have been decremented by 3 !!! Just like the ME's md_me pointer in crash-dump for So, if we add |
| Comment by Bruno Faccini (Inactive) [ 07/Dec/13 ] |
|
Since the same kind of corruption occured in a ME for It is still a bit early to directly link this ticket to |
| Comment by Hugo Meiland [ 03/Jan/14 ] |
|
Hi all, best wishes for 2014!! after upgrading the clients to 2.1.6, this crash has again occured, this hopefully helps in finding the underlying problem... 1388685619 2014 Jan 2 19:00:19 tcn19 kern emerg kernel LustreError: 3259:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request |
| Comment by Bruno Faccini (Inactive) [ 06/Jan/14 ] |
|
Hello Hugo, Can you also upload this new crash-dump (and required vmlinux/lustre-modules additionnal stuff) with 2.1.6, so I can double-check if we still suffer the same kind of corruption ?? On the other hand and if still same issue as very strongly expected, I will try to come up with a patch to implement the work-around (move allocation of MEs and small MDs out of <size-128> SLABs kmem-cache) described before. |
| Comment by Hugo Meiland [ 07/Jan/14 ] |
|
I'm afraid no crash-dump on this one available, I will send one on the next occurance... |
| Comment by Hugo Meiland [ 27/Jan/14 ] |
|
Bruno, Liang: This weekend I was able to obtain kernel coredumps again with this error (after a stable period os 3-4 weeks). As mentioned earlier the lustre version has been upgraded to 2.1.6. The uploaded file is LU4330-tcn82-vmcore-vmlinux-systemmap-weak-updates.zip; please let me know if you need additional files... |
| Comment by Bruno Faccini (Inactive) [ 28/Jan/14 ] |
|
Hello Hugo, I will review the new crash-dump you uploaded to verify we still encounter the same hind of corruption. BTW, I have pushed a patch that should help to better identify the root cause of the MEs/MDs corruption in Slabs, for both Since this patch extracts LNET MEs and small-MDs allocation from generic <size-128> kmem_cache, I strongly expect that you will no longer encounter |
| Comment by Hugo Meiland [ 29/Jan/14 ] |
|
just to keep the info with |
| Comment by Bruno Faccini (Inactive) [ 30/Jan/14 ] |
|
Hello Hugo, |
| Comment by Bruno Faccini (Inactive) [ 09/Feb/14 ] |
|
I need to indicate that after successful auto-tests, I made local extensive testing of http://review.whamcloud.com/8819/ and I can confirm it works as expected and with no MEs/MDs leak in Slabs. |
| Comment by Hugo Meiland [ 05/Mar/14 ] |
|
As an update: the fix has now been installed for a couple of weeks and no more of these LBUG's have been seen, so the fix looks good... |
| Comment by Bruno Faccini (Inactive) [ 07/Mar/14 ] |
|
Hello Hugo, thank's for your feedback! On the other-hand I am now working on a more generic (not based on hard-coded 128 bytes length) patch version to address Liang's last input, and also answer to Isaac comments too. Ans also push a master version as Isaac requested. |
| Comment by Hugo Meiland [ 19/Mar/14 ] |
|
Hi Bruno, no disfunction seen, the fix seems to solve the issue completely; thanks! Hugo |
| Comment by Gerrit Updater [ 23/Feb/16 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/18586 |
| Comment by Andreas Dilger [ 24/Feb/16 ] |
|
The scary thing is that there would continue to be random memory corruptions in the size-128 slab, but they will just be corrupting some other part of memory. If this problem can be found in a relatively short amount of testing time, then there are debugging patches available that could be applied to the kernel to make all kmalloc() calls actually map to vmalloc() internally and have vmalloc() always use a new memory address, and then when the memory is freed the page is unmapped and the address never used again. If another thread is incorrectly accessing an unmapped address (use after free) it will fault and then the source of the corruption may be found. Unfortunately, this impacts the performance and can only be used for debugging and not in production. Patches are available in https://bugzilla.lustre.org/show_bug.cgi?id=22471 but the would likely need to be updated for newer kernels. They can definitely help find memory corruption problems that are otherwise very difficult to find. |
| Comment by Andreas Dilger [ 24/Feb/16 ] |
|
Bruno, if you update those patches for newer kernels, please submit the new patches into lustre/kernel_patches/patches so they are available for use in the future, since bugzilla may disappear at some point. |
| Comment by James A Simmons [ 31/Mar/16 ] |
|
I think I might know what the problem is. A recent patch for ko2iblnd in the upstream kernel landed that exposed a serious memory corruption. |
| Comment by Bruno Faccini (Inactive) [ 13/Apr/16 ] |
|
James, |
| Comment by Gerrit Updater [ 31/May/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18586/ |
| Comment by Peter Jones [ 31/May/16 ] |
|
Landed for 2.9 |
| Comment by James A Simmons [ 10/Jun/16 ] |
|
Attempted to merge the patch for LU-4330 upstream and it was rejected since according to Greg it will not fix the problem. See email http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2016-June/090787.html Bruno can you reproduce this problem? Have you tried debugging with slub? |
| Comment by Bruno Faccini (Inactive) [ 05/Jul/16 ] |
|
James, sorry to be late to comment/answer you. |
| Comment by Peter Jones [ 05/Aug/16 ] |
|
Regardless of what upstream think of this fix, I think that we no longer need to track it for 2.9 so I will remove the fix version |