[LU-4330] LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.1.6
Labels:
None

Severity:
3
Rank (Obsolete):
11850

Description

Hi,

More and more compute nodes of several different customer clusters are hitting an LBUG on this 'assertion failed' issue:

2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback || callback == server_bulk_callback ) failed:
2013-11-21 14:06:54 LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) LBUG
2013-11-21 14:06:54 Nov 21 14:06:54 Pid: 46336, comm: kiblnd_sd_00
2013-11-21 14:06:54 compute5666 kernel
2013-11-21 14:06:54 : LustreError: 4Call Trace:
2013-11-21 14:06:54 6336:0:(events.c [<ffffffffa041c7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-11-21 14:06:54 :433:ptlrpc_mast [<ffffffffa041ce07>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-11-21 14:06:54 er_callback()) A [<ffffffffa06a426c>] ptlrpc_master_callback+0xcc/0xd0 [ptlrpc]
2013-11-21 14:06:54 SSERTION( callba [<ffffffffa048ebd2>] lnet_enq_event_locked+0x62/0xd0 [lnet]
2013-11-21 14:06:54 ck == request_ou [<ffffffffa048ecdb>] lnet_finalize+0x9b/0x2f0 [lnet]
2013-11-21 14:06:54 t_callback || ca [<ffffffffa083d073>] kiblnd_recv+0x103/0x570 [ko2iblnd]
2013-11-21 14:06:54 llback == reply_ [<ffffffffa04928dd>] lnet_ni_recv+0xad/0x2f0 [lnet]
2013-11-21 14:06:54 in_callback || c [<ffffffffa0492c06>] lnet_recv_put+0xe6/0x120 [lnet]
2013-11-21 14:06:54 allback == clien [<ffffffffa0499c33>] lnet_parse+0x1273/0x1b80 [lnet]
2013-11-21 14:06:54 t_bulk_callback [<ffffffff81042ca3>] ? enqueue_task+0x43/0x90
2013-11-21 14:06:54 || callback == r [<ffffffffa083d7ab>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
2013-11-21 14:06:54 equest_in_callba [<ffffffffa083e590>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
2013-11-21 14:06:54 ck || callback = [<ffffffff81042a63>] ? __wake_up+0x53/0x70
2013-11-21 14:06:54 = reply_out_call [<ffffffffa083e762>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
2013-11-21 14:06:54 back || callback [<ffffffffa083eb19>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
2013-11-21 14:06:54 == server_bulk_ [<ffffffff8104a320>] ? default_wake_function+0x0/0x20
2013-11-21 14:06:54 callback ) faile [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-11-21 14:06:54 d:
2013-11-21 14:06:54 Nov 21 14:06 [<ffffffff8100412a>] child_rip+0xa/0x20
2013-11-21 14:06:54 :54 compute5666 ke [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-11-21 14:06:54 rnel: LustreErro [<ffffffffa083e7e0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-11-21 14:06:54 r: 46336:0:(even [<ffffffff81004120>] ? child_rip+0x0/0x20
2013-11-21 14:06:54 ts.c:433:ptlrpc_
2013-11-21 14:06:54 master_callback()) LBUG
2013-11-21 14:06:54 Nov 21 14:06:54 compute566Kernel panic - not syncing: LBUG

For information, systems are running with kernel boot parameter 'tolerant=1' set.

We have a crash dump that we will upload to ftp.

Sebastien.

Attachments

Issue Links

is related to

LU-8362 page fault: exception RIP: lnet_mt_match_md+135

Resolved

is related to

LU-3848 Compute node crashes due to error in Lustre : ASSERTION( me == md->md_me ) failed

Resolved

Activity

[LU-4330] LustreError: 46336:0:(events.c:433:ptlrpc_master_callback()) ASSERTION( callback == request_out_callback || callback == reply_in_callback || callback == client_bulk_callback || callback == request_in_callback || callback == reply_out_callback ... ) failed

Andreas Dilger added a comment - 24/Feb/16 7:23 PM

Bruno, if you update those patches for newer kernels, please submit the new patches into lustre/kernel_patches/patches so they are available for use in the future, since bugzilla may disappear at some point.

Andreas Dilger added a comment - 24/Feb/16 7:23 PM Bruno, if you update those patches for newer kernels, please submit the new patches into lustre/kernel_patches/patches so they are available for use in the future, since bugzilla may disappear at some point.

Andreas Dilger added a comment - 24/Feb/16 6:31 PM - edited

The scary thing is that there would continue to be random memory corruptions in the size-128 slab, but they will just be corrupting some other part of memory.

If this problem can be found in a relatively short amount of testing time, then there are debugging patches available that could be applied to the kernel to make all kmalloc() calls actually map to vmalloc() internally and have vmalloc() always use a new memory address, and then when the memory is freed the page is unmapped and the address never used again. If another thread is incorrectly accessing an unmapped address (use after free) it will fault and then the source of the corruption may be found. Unfortunately, this impacts the performance and can only be used for debugging and not in production.

Patches are available in https://bugzilla.lustre.org/show_bug.cgi?id=22471 but the would likely need to be updated for newer kernels. They can definitely help find memory corruption problems that are otherwise very difficult to find.

Andreas Dilger added a comment - 24/Feb/16 6:31 PM - edited The scary thing is that there would continue to be random memory corruptions in the size-128 slab, but they will just be corrupting some other part of memory. If this problem can be found in a relatively short amount of testing time, then there are debugging patches available that could be applied to the kernel to make all kmalloc() calls actually map to vmalloc() internally and have vmalloc() always use a new memory address, and then when the memory is freed the page is unmapped and the address never used again. If another thread is incorrectly accessing an unmapped address (use after free) it will fault and then the source of the corruption may be found. Unfortunately, this impacts the performance and can only be used for debugging and not in production. Patches are available in https://bugzilla.lustre.org/show_bug.cgi?id=22471 but the would likely need to be updated for newer kernels. They can definitely help find memory corruption problems that are otherwise very difficult to find.

Gerrit Updater added a comment - 23/Feb/16 7:39 PM

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/18586
Subject: LU-4330 lnet: Allocate MEs and small MDs in own kmem_caches
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 11e4b34d75505476f363c1cf4400755a7f30f766

Gerrit Updater added a comment - 23/Feb/16 7:39 PM Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/18586 Subject: LU-4330 lnet: Allocate MEs and small MDs in own kmem_caches Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 11e4b34d75505476f363c1cf4400755a7f30f766

Hugo Meiland added a comment - 19/Mar/14 1:57 PM

Hi Bruno, no disfunction seen, the fix seems to solve the issue completely; thanks! Hugo

Hugo Meiland added a comment - 19/Mar/14 1:57 PM Hi Bruno, no disfunction seen, the fix seems to solve the issue completely; thanks! Hugo

Bruno Faccini (Inactive) added a comment - 07/Mar/14 2:09 PM

Hello Hugo, thank's for your feedback!
The fact that both asserts for this ticket and ~~LU-3848~~ disappeared seems to confirm what I suspected from the beginning, that LNET/Lustre were not involved with the corruptions but only victims. BTW, did you notice any other sub-system (networking, ...) disfunction since ?

On the other-hand I am now working on a more generic (not based on hard-coded 128 bytes length) patch version to address Liang's last input, and also answer to Isaac comments too. Ans also push a master version as Isaac requested.

Bruno Faccini (Inactive) added a comment - 07/Mar/14 2:09 PM Hello Hugo, thank's for your feedback! The fact that both asserts for this ticket and LU-3848 disappeared seems to confirm what I suspected from the beginning, that LNET/Lustre were not involved with the corruptions but only victims. BTW, did you notice any other sub-system (networking, ...) disfunction since ? On the other-hand I am now working on a more generic (not based on hard-coded 128 bytes length) patch version to address Liang's last input, and also answer to Isaac comments too. Ans also push a master version as Isaac requested.

Hugo Meiland added a comment - 05/Mar/14 3:01 PM

As an update: the fix has now been installed for a couple of weeks and no more of these LBUG's have been seen, so the fix looks good...

Hugo Meiland added a comment - 05/Mar/14 3:01 PM As an update: the fix has now been installed for a couple of weeks and no more of these LBUG's have been seen, so the fix looks good...

Bruno Faccini (Inactive) added a comment - 09/Feb/14 5:15 PM

I need to indicate that after successful auto-tests, I made local extensive testing of http://review.whamcloud.com/8819/ and I can confirm it works as expected and with no MEs/MDs leak in Slabs.

Bruno Faccini (Inactive) added a comment - 09/Feb/14 5:15 PM I need to indicate that after successful auto-tests, I made local extensive testing of http://review.whamcloud.com/8819/ and I can confirm it works as expected and with no MEs/MDs leak in Slabs.

Bruno Faccini (Inactive) added a comment - 30/Jan/14 9:48 AM

Hello Hugo,
Thanks again for your help on this!
I just had a look into the recent crash-dump you uploaded in LU4330-tcn82-vmcore-vmlinux-systemmap-weak-updates.zip, and I can confirm this the same scenario, with the same/-3 corruption of MD's md_user_ptr field/pointer.

Bruno Faccini (Inactive) added a comment - 30/Jan/14 9:48 AM Hello Hugo, Thanks again for your help on this! I just had a look into the recent crash-dump you uploaded in LU4330-tcn82-vmcore-vmlinux-systemmap-weak-updates.zip, and I can confirm this the same scenario, with the same/-3 corruption of MD's md_user_ptr field/pointer.

Hugo Meiland added a comment - 29/Jan/14 3:12 PM

just to keep the info with ~~LU-3848~~ in sync:
Hi Bruno and Liang;
@Bruno: Thanks for the patch, I've asked Sebastien to prepare it for me to test at the client site. I do have plenty more of crashes available right now, but I'm not sure if this would be a huge help at this time.
@Liang: Your patch has been integrated to our tree, but just after our upgrade to 2.1.6, so I'm not running this patch yet. It will be in the next lustre package.
I'll try to give you feedback on these as soon as possible.

Hugo Meiland added a comment - 29/Jan/14 3:12 PM just to keep the info with LU-3848 in sync: Hi Bruno and Liang; @Bruno: Thanks for the patch, I've asked Sebastien to prepare it for me to test at the client site. I do have plenty more of crashes available right now, but I'm not sure if this would be a huge help at this time. @Liang: Your patch has been integrated to our tree, but just after our upgrade to 2.1.6, so I'm not running this patch yet. It will be in the next lustre package. I'll try to give you feedback on these as soon as possible.

Bruno Faccini (Inactive) added a comment - 28/Jan/14 9:31 AM

Hello Hugo, I will review the new crash-dump you uploaded to verify we still encounter the same hind of corruption.

BTW, I have pushed a patch that should help to better identify the root cause of the MEs/MDs corruption in Slabs, for both ~~LU-3848~~ and this ticket, it is available at http://review.whamcloud.com/8819/. It has already been successfully exposed to autotests, and you may want to give it a try. May be Sebastien could also validate it and try to integrate it in Bull distro?

Since this patch extracts LNET MEs and small-MDs allocation from generic <size-128> kmem_cache, I strongly expect that you will no longer encounter ~~LU-3848~~/LU-4330 LBUGs (and may be suffer new issues with others Kernel codes using <size-128> …). But if not, and you are able to reproduce the LBUGs, then I will need to investigate further.

Bruno Faccini (Inactive) added a comment - 28/Jan/14 9:31 AM Hello Hugo, I will review the new crash-dump you uploaded to verify we still encounter the same hind of corruption. BTW, I have pushed a patch that should help to better identify the root cause of the MEs/MDs corruption in Slabs, for both LU-3848 and this ticket, it is available at http://review.whamcloud.com/8819/ . It has already been successfully exposed to autotests, and you may want to give it a try. May be Sebastien could also validate it and try to integrate it in Bull distro? Since this patch extracts LNET MEs and small-MDs allocation from generic <size-128> kmem_cache, I strongly expect that you will no longer encounter LU-3848 / LU-4330 LBUGs (and may be suffer new issues with others Kernel codes using <size-128> …). But if not, and you are able to reproduce the LBUGs, then I will need to investigate further.

Hugo Meiland added a comment - 27/Jan/14 12:27 PM

Bruno, Liang: This weekend I was able to obtain kernel coredumps again with this error (after a stable period os 3-4 weeks). As mentioned earlier the lustre version has been upgraded to 2.1.6. The uploaded file is LU4330-tcn82-vmcore-vmlinux-systemmap-weak-updates.zip; please let me know if you need additional files...

Hugo Meiland added a comment - 27/Jan/14 12:27 PM Bruno, Liang: This weekend I was able to obtain kernel coredumps again with this error (after a stable period os 3-4 weeks). As mentioned earlier the lustre version has been upgraded to 2.1.6. The uploaded file is LU4330-tcn82-vmcore-vmlinux-systemmap-weak-updates.zip; please let me know if you need additional files...

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Sebastien Buisson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 02/Dec/13 2:40 PM

Updated:: 14/Jun/18 9:41 PM