[LU-3848] Compute node crashes due to error in Lustre : ASSERTION( me == md->md_me ) failed Created: 28/Aug/13 Updated: 24/Feb/16 Resolved: 05/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn1 | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9966 | ||||||||
| Description |
|
Hi, At a customer site, a compute node running Lustre crashes with the following error: 2013-08-23 17:08:55 LustreError: 3049:0:(lib-move.c:185:lnet_match_md()) ASSERTION( me == md->md_me ) failed: 2013-08-23 17:08:55 LustreError: 3049:0:(lib-move.c:185:lnet_match_md()) LBUG 2013-08-23 17:08:55 Pid: 3049, comm: kiblnd_sd_11 2013-08-23 17:08:55 2013-08-23 17:08:55 Call Trace: 2013-08-23 17:08:55 [<ffffffffa04c47f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-08-23 17:08:55 [<ffffffffa04c4e07>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-08-23 17:08:55 [<ffffffffa053c2bd>] lnet_match_md+0x35d/0x3a0 [lnet] 2013-08-23 17:08:55 [<ffffffffa053a8dd>] ? lnet_ni_recv+0xad/0x2f0 [lnet] 2013-08-23 17:08:55 [<ffffffffa0541482>] lnet_parse+0xac2/0x1b80 [lnet] 2013-08-23 17:08:55 [<ffffffff81042f33>] ? enqueue_task+0x43/0x90 2013-08-23 17:08:55 [<ffffffffa08d177b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90 2013-08-23 17:08:55 [<ffffffffa08d2560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81042cf3>] ? __wake_up+0x53/0x70 2013-08-23 17:08:55 [<ffffffffa08d2732>] kiblnd_complete+0x62/0xe0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffffa08d2ae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff8100412a>] child_rip+0xa/0x20 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81004120>] ? child_rip+0x0/0x20 2013-08-23 17:08:55 2013-08-23 17:08:55 Kernel panic - not syncing: LBUG 2013-08-23 17:08:55 Pid: 3049, comm: kiblnd_sd_11 Tainted: G --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1 2013-08-23 17:08:55 Call Trace: 2013-08-23 17:08:55 [<ffffffff81495fe3>] ? panic+0xa0/0x168 2013-08-23 17:08:55 [<ffffffffa04c4e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2013-08-23 17:08:55 [<ffffffffa053c2bd>] ? lnet_match_md+0x35d/0x3a0 [lnet] 2013-08-23 17:08:55 [<ffffffffa053a8dd>] ? lnet_ni_recv+0xad/0x2f0 [lnet] 2013-08-23 17:08:55 [<ffffffffa0541482>] ? lnet_parse+0xac2/0x1b80 [lnet] 2013-08-23 17:08:55 [<ffffffff81042f33>] ? enqueue_task+0x43/0x90 2013-08-23 17:08:55 [<ffffffffa08d177b>] ? kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90 2013-08-23 17:08:55 [<ffffffffa08d2560>] ? kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81042cf3>] ? __wake_up+0x53/0x70 2013-08-23 17:08:55 [<ffffffffa08d2732>] ? kiblnd_complete+0x62/0xe0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffffa08d2ae9>] ? kiblnd_scheduler+0x339/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff8100412a>] ? child_rip+0xa/0x20 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd] 2013-08-23 17:08:55 [<ffffffff81004120>] ? child_rip+0x0/0x20 The messages visible in the syslog 2 hours before the failed assertion are: 2013-08-23 15:26:09 LustreError: 11-0: an error occurred while communicating with 10.200.200.66@o2ib. The obd_ping operation failed with -107
2013-08-23 15:26:09 Lustre: scratch-OST0018-osc-ffff88087a659000: Connection to scratch-OST0018 (at 10.200.200.66@o2ib) was lost; in progress operations using this service will wait for recovery to complete
2013-08-23 15:26:09 LustreError: 167-0: This client was evicted by scratch-OST0018; in progress operations using this service will fail.
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff880e871c7d98[2 3 0 1 1 00000000] R(1):[4, 18446744073709551615]@[0x100180000:0x2e225e:0x0] {
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff880c2c27c8a0: [0 ffff880f31e5f268 R(1):[4, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0]]
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88104e603d78: ffff881072b1f240 40120002 0x4d3a65b709264355 3 ffff881012f1ae88 size: 26325 mtime: 1377264244 atime: 1377264247 ctime: 1377264244 blocks: 56
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff880e871c7d98
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff880e871c7078[2 2 0 1 1 00000000] R(1):[2, 3]@[0x100180000:0x2e225e:0x0] {
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff880c2c27c920: [0 ffff8801fca3e4d8 R(1):[0, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0]]
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88104e603368: ffff88077ea89d80 40120002 0x4d3a65b709264363 2 (null) size: 26325 mtime: 1377264244 atime: 1377264247 ctime: 1377264244 blocks: 56
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff880e871c7078
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5
2013-08-23 15:26:09 LustreError: 24191:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace scratch-OST0018-osc-ffff88087a659000 resource refcount nonzero (2) after lock cleanup; forcing cleanup.
2013-08-23 15:26:09 LustreError: 24191:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880e973c7480 (3023454/0/0/0) (rc: 2)
2013-08-23 15:26:09 Lustre: scratch-OST0018-osc-ffff88087a659000: Connection restored to scratch-OST0018 (at 10.200.200.66@o2ib)
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff880e871c7698[2 4 0 2 0 00000000] R(1):[4, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0] {
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) vvp@ffff880dafcd7d30:
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) lov@ffff880f31e5f268: 1
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) 0 0: ---
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try())
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880e871c7698
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -5
After investigation, no error was seen on the Infiniband network by the time of the Lustre node crash. Thanks, |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 28/Aug/13 ] |
|
Hello Seb! |
| Comment by Liang Zhen (Inactive) [ 28/Aug/13 ] |
|
this could be an instance of bz11130 which exists ages (I believe there are a few more instances of it in BZ, but I'm not authorized to access those tickets), I suspect it's a nasty memory corruption which is irrelevant to LNet. |
| Comment by Kit Westneat (Inactive) [ 28/Aug/13 ] |
|
FYI we were getting this in |
| Comment by Peter Jones [ 28/Aug/13 ] |
|
Kit Thanks for the tip. Could you please share the details of the RH version you experienced the bug in and what version contained the fix (or the RH ticket number if you know it) Thanks Bob Could you please oversee this issue? Peter |
| Comment by Kit Westneat (Inactive) [ 28/Aug/13 ] |
|
We were running 2.6.32-279.19.1.el6.x86_64.rpm. I think it is fixed in the RH 6.4 line, unfortunately I am not sure what the bz or errata was. My understanding is that no 6.3 kernel release has the fix. |
| Comment by Bob Glossman (Inactive) [ 28/Aug/13 ] |
|
Kit, thanks for those details. Sebastian, since the probable bug is in the upstream kernel you will need to move to 6.4 in order to get the fix. We shifted support in lustre from 6.3 kernels to 6.4 kernels soon after the 2.1.5 lustre version. You will need a lustre build from b2_1 that matches the specific kernel version you move to or build your own from source. The most recent version we support on the b2_1 branch is RHEL/Centos 6.4, 2.6.32-358.11.1.el6 |
| Comment by Alexandre Louvet [ 29/Aug/13 ] |
|
Kit, Bob, Is this RDMA corruption has any relation with the patch lustre/kernel_patches/patches/ipoib-locking-fix.patch that was in |
| Comment by Kit Westneat (Inactive) [ 29/Aug/13 ] |
|
Hi Alexandre, I believe it was that patch. The customer is now running 2.6.32-279.32.1.el6, which is not public, but the changelog for it only has that patch. Strangely though, according to the RHEL errata the bug wasn't introduced until 279.22.. Perhaps there was another memory corruption bug that the patch fixed? In any case, this patch helped us with the LBUG in kiblnd.
|
| Comment by Bob Glossman (Inactive) [ 29/Aug/13 ] |
|
Going back over our change history, the patch being referred to was lustre/kernel_patches/patches/ipoib-locking-fix.patch It was added to our kernel patch series for application to the 6.4 kernel (2.6.32-358.2.1.el6) in http://review.whamcloud.com/#/c/5952/3 It was deleted in http://review.whamcloud.com/#/c/6615/3 during our support upgrade to the later 6.4 version (2.6.32-358.11.1.el6) when that change landed in the upstream kernel. While it should work in principal we have never applied or tested that patch on a 6.3 kernel ourselves. Or you can try to use the 2.6.32-279.32.1.el6 kernel that Kit says has the fix. Either of those options will probably work but haven't been proofed or tested here. |
| Comment by Sebastien Buisson (Inactive) [ 12/Sep/13 ] |
|
Hi, The patch lustre/kernel_patches/patches/ipoib-locking-fix.patch has been installed on-site, and so far it improves the situation very well: Lustre is much more stable now. Thanks, |
| Comment by Sebastien Buisson (Inactive) [ 07/Nov/13 ] |
|
Hi, Unfortunately I think we mixed different issues: the problem initially described here was hit at SARA customer, whereas my last comment was only true for MeteoFrance customer (please also note that this patch on IPoIB was removed as it was found to introduce a deadlock). So back to active as this issue has been hit 11 times during October. Thanks, |
| Comment by Bob Glossman (Inactive) [ 13/Nov/13 ] |
|
Very strongly suspect you are hitting this problem because you have removed the IPoIB patch. It was directly cribbed from upstream linux and addressed precisely this sort of memory corruption in IB. If it causes a deadlock for you the only remedy I can suggest is to move to a later kernel version that already incorporates the equivalent fix. As the problem is in the upstream linux, not in lustre, there's not a lot we can do about it. |
| Comment by Sebastien Buisson (Inactive) [ 14/Nov/13 ] |
|
From |
| Comment by Kit Westneat (Inactive) [ 14/Nov/13 ] |
|
Hi Sebastien, The issue in (bz 928817) FWIW I don't think NOAA has seen any client deadlocks, but we did run into an OSS o2iblnd deadlock ( HTH, |
| Comment by Bob Glossman (Inactive) [ 14/Nov/13 ] |
|
Kit, Thanks for that reference. It was mentioned in the commit header of http://review.whamcloud.com/5952 that our ipoib patch came from upstream commit fa16ebed31f336e41970f3f0ea9e8279f6be2d27. I would have had a hard time tracking down a better pointer. |
| Comment by Sebastien Buisson (Inactive) [ 29/Nov/13 ] |
|
Hi, This issue is not fixed by the 'ipoib' patch. Precisely we already run an OFED stack that integrates a more complete version of this patch, and it does not help. In order to help investigations, we have uploaded a crash dump on Whamcloud's ftp, at ftp://ftp.whamcloud.com/uploads/vmcore-tcn127-LBUG . Sebastien. |
| Comment by Bruno Faccini (Inactive) [ 30/Nov/13 ] |
|
Hello Sebastien, |
| Comment by Hugo Meiland [ 02/Dec/13 ] |
|
added zip file to ftp with vmlinuz, systemmap and mlx4-ib.ko: /uploads/LU3848-vmlinuz-systemmap-mlx4_ib_module.zip |
| Comment by Hugo Meiland [ 02/Dec/13 ] |
|
new zip uploaded, containing vmlinux, systemmap and weak-updates |
| Comment by Peter Jones [ 02/Dec/13 ] |
|
Bruno Could you please take over assisting with this ticket? Thanks Peter |
| Comment by Bruno Faccini (Inactive) [ 02/Dec/13 ] |
|
Hugo, can you indicate the name of the 2nd file because I can not list files on the upload! |
| Comment by Bruno Faccini (Inactive) [ 03/Dec/13 ] |
|
Humm based on the 1st name construct, I think I was able to guess it, is'nt it |
| Comment by Hugo Meiland [ 03/Dec/13 ] |
|
ack, this is the one... please let me know if you need additional info! |
| Comment by Bruno Faccini (Inactive) [ 05/Dec/13 ] |
|
Thanks Hugo! |
| Comment by Bruno Faccini (Inactive) [ 05/Dec/13 ] |
|
Argh, this is weird looks like the me->me_md pointer, of the 1st ME linked on the_lnet.ln_portals[15]-> ptl_mlist (there are 1536 on the linked-list) being parsed in lnet_match_md(), is -3 bytes (!!) than the correct MD address ... : crash> bt
PID: 3265 TASK: ffff881079e3c7b0 CPU: 0 COMMAND: "kiblnd_sd_19"
#0 [ffff88107aed7a48] machine_kexec at ffffffff8102902b
#1 [ffff88107aed7aa8] crash_kexec at ffffffff810a5292
#2 [ffff88107aed7b78] panic at ffffffff81495fea
#3 [ffff88107aed7bf8] lbug_with_loc at ffffffffa04fce5b [libcfs]
#4 [ffff88107aed7c18] lnet_match_md at ffffffffa05742bd [lnet]
#5 [ffff88107aed7cc8] lnet_parse at ffffffffa0579482 [lnet]
#6 [ffff88107aed7db8] kiblnd_handle_rx at ffffffffa090977b [ko2iblnd]
#7 [ffff88107aed7e08] kiblnd_rx_complete at ffffffffa090a560 [ko2iblnd]
#8 [ffff88107aed7e58] kiblnd_complete at ffffffffa090a732 [ko2iblnd]
#9 [ffff88107aed7e68] kiblnd_scheduler at ffffffffa090aae9 [ko2iblnd]
#10 [ffff88107aed7f48] kernel_thread at ffffffff8100412a
crash>
crash> p/x the_lnet.ln_portals[15]
$6 = {
ptl_mhash = 0x0,
ptl_mlist = {
next = 0xffff88083674e3c0,
prev = 0xffff880866eaebc0
},
ptl_msgq = {
next = 0xffff88107ce0d3d8,
prev = 0xffff88107ce0d3d8
},
ptl_ml_version = 0x35c7,
ptl_msgq_version = 0x1,
ptl_options = 0x5 <<<<<<<<----- LNET_PTL_LAZY | LNET_PTL_MATCH_WILDCARD
}
crash> p/x *(lnet_me_t *)0xffff88083674e3c0
$13 = {
me_list = {
next = 0xffff88032837b3c0,
prev = 0xffff88107ce0d3c8
},
me_lh = {
lh_hash_chain = {
next = 0xffff880865e2b450,
prev = 0xffff880413b34850
},
lh_cookie = 0x45d7862
},
me_match_id = {
nid = 0xffffffffffffffff,
pid = 0xffffffff
},
me_portal = 0xf,
me_match_bits = 0x0,
me_ignore_bits = 0xffffffffffffffff,
me_unlink = 0x1,
me_md = 0xffff88034c96543d <<<<<<------ VERY BAD!!
}
crash>
crash> p/x ((lnet_me_t *)0xffff88083674e3c0)->me_md
$18 = 0xffff88034c96543d
crash> p/x ((lnet_me_t *)0xffff88083674e3c0)->me_md->md_me
$19 = 0x83674e3c0000000
crash>
crash> p/x ((struct lnet_libmd *)0xffff88034c96543d)->md_me
$20 = 0x83674e3c0000000
crash> p/x ((struct lnet_libmd *)0xffff88034c965440)->md_me
$21 = 0xffff88083674e3c0
crash>
But how can this happen ??!! When the only place where MD/me->me_md is set is in LNetMDAttach() and this after being allocated from SLabs. |
| Comment by Hugo Meiland [ 06/Dec/13 ] |
|
no usefull info on drivers loading during crash, for completeness syslog uploaded as |
| Comment by Liang Zhen (Inactive) [ 07/Dec/13 ] |
|
Because this kind of memory corruption can be from different modules, how about creating dedicated slab cache for both lnet_me_t and lnet_libmd_t (LU-4330)? If these are indeed defects in LNet, we still can reproduce, otherwise we should see different symptoms which might help to find out the real reason. |
| Comment by Bruno Faccini (Inactive) [ 07/Dec/13 ] |
|
Hugo: thanks for the log, and yes I agree that no specific activity/msg can be found in the Syslog at crash/LBUG time. Liang: thanks for your comment. It is still a bit early to directly link this ticket to LU-4330, because even if it concern the same kind of corruption (-3), the same size-128 kmem-cache, a node of the same cluster/system, the offset in the affected Slab/structure is different. |
| Comment by Liang Zhen (Inactive) [ 10/Dec/13 ] |
|
btw, I just posted a patch to fix potential memory corruption: http://review.whamcloud.com/#/c/8511/ |
| Comment by Hugo Meiland [ 27/Jan/14 ] |
|
Bruno, Liang: This weekend I was able to obtain kernel coredumps again with this error (after a stable period os 3-4 weeks). As mentioned earlier the lustre version has been upgraded to 2.1.6. The uploaded file is LU3848-tcn23-vmcore-vmlinux-systemmap-weak-updates.zip; please let me know if you need additional files... |
| Comment by Bruno Faccini (Inactive) [ 28/Jan/14 ] |
|
Hello Hugo, I will review the new crash-dump you uploaded to verify we still encounter the same kind of corruption. BTW, I have pushed a patch that should help to better identify the root cause of the MEs/MDs corruption in Slabs, for both LU-4330 and this ticket, it is available at http://review.whamcloud.com/8819/. It has already been successfully exposed to autotests, and you may want to give it a try. May be Sebastien could also validate it and try to integrate it in Bull distro? Since this patch extracts LNET MEs and small-MDs allocation from generic <size-128> kmem_cache, I strongly expect that you will no longer encounter |
| Comment by Liang Zhen (Inactive) [ 28/Jan/14 ] |
|
Hi Hugo, have you already applied the patch I mentioned previously? http://review.whamcloud.com/#/c/8511/ , it did fix potential memory corruption issue, although not sure if it's the reason of your issue here. |
| Comment by Hugo Meiland [ 29/Jan/14 ] |
|
Hi Bruno and Liang; |
| Comment by Bruno Faccini (Inactive) [ 30/Jan/14 ] |
|
Hello Hugo, |
| Comment by Bruno Faccini (Inactive) [ 09/Feb/14 ] |
|
I need to indicate that after successful auto-tests, I made local extensive testing of http://review.whamcloud.com/8819/ and I can confirm it works as expected and with no MEs/MDs leak in Slabs. |
| Comment by Hugo Meiland [ 05/Mar/14 ] |
|
As an update: the fix has now been installed for a couple of weeks and no more of these LBUG's have been seen, so the fix looks good... |
| Comment by Peter Jones [ 05/Mar/14 ] |
|
Great! Thanks for letting us know Hugo |
| Comment by Bruno Faccini (Inactive) [ 24/Feb/16 ] |
|
Follow on (master patch, for any other version, ...) to be tracked in LU-4330 now. |