[LU-3848] Compute node crashes due to error in Lustre : ASSERTION( me == md->md_me ) failed Created: 28/Aug/13  Updated: 24/Feb/16  Resolved: 05/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Sebastien Buisson (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: mn1

Issue Links:
Related
is related to LU-4330 LustreError: 46336:0:(events.c:433:pt... Reopened
Severity: 3
Rank (Obsolete): 9966

 Description   

Hi,

At a customer site, a compute node running Lustre crashes with the following error:

2013-08-23 17:08:55 LustreError: 3049:0:(lib-move.c:185:lnet_match_md()) ASSERTION( me == md->md_me ) failed:
2013-08-23 17:08:55 LustreError: 3049:0:(lib-move.c:185:lnet_match_md()) LBUG
2013-08-23 17:08:55 Pid: 3049, comm: kiblnd_sd_11
2013-08-23 17:08:55
2013-08-23 17:08:55 Call Trace:
2013-08-23 17:08:55 [<ffffffffa04c47f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-08-23 17:08:55 [<ffffffffa04c4e07>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-08-23 17:08:55 [<ffffffffa053c2bd>] lnet_match_md+0x35d/0x3a0 [lnet]
2013-08-23 17:08:55 [<ffffffffa053a8dd>] ? lnet_ni_recv+0xad/0x2f0 [lnet]
2013-08-23 17:08:55 [<ffffffffa0541482>] lnet_parse+0xac2/0x1b80 [lnet]
2013-08-23 17:08:55 [<ffffffff81042f33>] ? enqueue_task+0x43/0x90
2013-08-23 17:08:55 [<ffffffffa08d177b>] kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90
2013-08-23 17:08:55 [<ffffffffa08d2560>] kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81042cf3>] ? __wake_up+0x53/0x70
2013-08-23 17:08:55 [<ffffffffa08d2732>] kiblnd_complete+0x62/0xe0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffffa08d2ae9>] kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff8100412a>] child_rip+0xa/0x20
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81004120>] ? child_rip+0x0/0x20
2013-08-23 17:08:55
2013-08-23 17:08:55 Kernel panic - not syncing: LBUG
2013-08-23 17:08:55 Pid: 3049, comm: kiblnd_sd_11 Tainted: G --------------- H 2.6.32-279.5.2.bl6.Bull.36.x86_64 #1
2013-08-23 17:08:55 Call Trace:
2013-08-23 17:08:55 [<ffffffff81495fe3>] ? panic+0xa0/0x168
2013-08-23 17:08:55 [<ffffffffa04c4e5b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2013-08-23 17:08:55 [<ffffffffa053c2bd>] ? lnet_match_md+0x35d/0x3a0 [lnet]
2013-08-23 17:08:55 [<ffffffffa053a8dd>] ? lnet_ni_recv+0xad/0x2f0 [lnet]
2013-08-23 17:08:55 [<ffffffffa0541482>] ? lnet_parse+0xac2/0x1b80 [lnet]
2013-08-23 17:08:55 [<ffffffff81042f33>] ? enqueue_task+0x43/0x90
2013-08-23 17:08:55 [<ffffffffa08d177b>] ? kiblnd_handle_rx+0x2cb/0x680 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff8103d349>] ? __wake_up_common+0x59/0x90
2013-08-23 17:08:55 [<ffffffffa08d2560>] ? kiblnd_rx_complete+0x2d0/0x440 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81042cf3>] ? __wake_up+0x53/0x70
2013-08-23 17:08:55 [<ffffffffa08d2732>] ? kiblnd_complete+0x62/0xe0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffffa08d2ae9>] ? kiblnd_scheduler+0x339/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81048df0>] ? default_wake_function+0x0/0x20
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff8100412a>] ? child_rip+0xa/0x20
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffffa08d27b0>] ? kiblnd_scheduler+0x0/0x7a0 [ko2iblnd]
2013-08-23 17:08:55 [<ffffffff81004120>] ? child_rip+0x0/0x20

The messages visible in the syslog 2 hours before the failed assertion are:

2013-08-23 15:26:09 LustreError: 11-0: an error occurred while communicating with 10.200.200.66@o2ib. The obd_ping operation failed with -107
2013-08-23 15:26:09 Lustre: scratch-OST0018-osc-ffff88087a659000: Connection to scratch-OST0018 (at 10.200.200.66@o2ib) was lost; in progress operations using this service will wait for recovery to complete
2013-08-23 15:26:09 LustreError: 167-0: This client was evicted by scratch-OST0018; in progress operations using this service will fail.
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff880e871c7d98[2 3 0 1 1 00000000] R(1):[4, 18446744073709551615]@[0x100180000:0x2e225e:0x0] {
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff880c2c27c8a0: [0 ffff880f31e5f268 R(1):[4, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0]]
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88104e603d78: ffff881072b1f240 40120002 0x4d3a65b709264355 3 ffff881012f1ae88 size: 26325 mtime: 1377264244 atime: 1377264247 ctime: 1377264244 blocks: 56
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff880e871c7d98
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lock@ffff880e871c7078[2 2 0 1 1 00000000] R(1):[2, 3]@[0x100180000:0x2e225e:0x0] {
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) lovsub@ffff880c2c27c920: [0 ffff8801fca3e4d8 R(1):[0, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0]]
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) osc@ffff88104e603368: ffff88077ea89d80 40120002 0x4d3a65b709264363 2 (null) size: 26325 mtime: 1377264244 atime: 1377264247 ctime: 1377264244 blocks: 56
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) } lock@ffff880e871c7078
2013-08-23 15:26:09 LustreError: 24191:0:(osc_lock.c:816:osc_ldlm_completion_ast()) dlmlock returned -5
2013-08-23 15:26:09 LustreError: 24191:0:(ldlm_resource.c:749:ldlm_resource_complain()) Namespace scratch-OST0018-osc-ffff88087a659000 resource refcount nonzero (2) after lock cleanup; forcing cleanup.
2013-08-23 15:26:09 LustreError: 24191:0:(ldlm_resource.c:755:ldlm_resource_complain()) Resource: ffff880e973c7480 (3023454/0/0/0) (rc: 2)
2013-08-23 15:26:09 Lustre: scratch-OST0018-osc-ffff88087a659000: Connection restored to scratch-OST0018 (at 10.200.200.66@o2ib)
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) lock@ffff880e871c7698[2 4 0 2 0 00000000] R(1):[4, 18446744073709551615]@[0x2006bc5f0:0x1ac23:0x0] {
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) vvp@ffff880dafcd7d30:
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) lov@ffff880f31e5f268: 1
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) 0 0: ---
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try())
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) } lock@ffff880e871c7698
2013-08-23 15:26:09 LustreError: 24071:0:(cl_lock.c:1413:cl_unuse_try()) unuse return -5

After investigation, no error was seen on the Infiniband network by the time of the Lustre node crash.

Thanks,
Sebastien.



 Comments   
Comment by Bruno Faccini (Inactive) [ 28/Aug/13 ]

Hello Seb!
Is there a crash-dump available ?? Is it a one shoot ??

Comment by Liang Zhen (Inactive) [ 28/Aug/13 ]

this could be an instance of bz11130 which exists ages (I believe there are a few more instances of it in BZ, but I'm not authorized to access those tickets), I suspect it's a nasty memory corruption which is irrelevant to LNet.

Comment by Kit Westneat (Inactive) [ 28/Aug/13 ]

FYI we were getting this in LU-3010. It ended up being an RDMA bug in that version of redhat.

Comment by Peter Jones [ 28/Aug/13 ]

Kit

Thanks for the tip. Could you please share the details of the RH version you experienced the bug in and what version contained the fix (or the RH ticket number if you know it)

Thanks

Bob

Could you please oversee this issue?

Peter

Comment by Kit Westneat (Inactive) [ 28/Aug/13 ]

We were running 2.6.32-279.19.1.el6.x86_64.rpm. I think it is fixed in the RH 6.4 line, unfortunately I am not sure what the bz or errata was. My understanding is that no 6.3 kernel release has the fix.

Comment by Bob Glossman (Inactive) [ 28/Aug/13 ]

Kit, thanks for those details.

Sebastian, since the probable bug is in the upstream kernel you will need to move to 6.4 in order to get the fix. We shifted support in lustre from 6.3 kernels to 6.4 kernels soon after the 2.1.5 lustre version. You will need a lustre build from b2_1 that matches the specific kernel version you move to or build your own from source. The most recent version we support on the b2_1 branch is RHEL/Centos 6.4, 2.6.32-358.11.1.el6

Comment by Alexandre Louvet [ 29/Aug/13 ]

Kit, Bob,

Is this RDMA corruption has any relation with the patch lustre/kernel_patches/patches/ipoib-locking-fix.patch that was in LU-2846, commit 0f1d64724480bdfa268d18153272c3484c90e578 ?

Comment by Kit Westneat (Inactive) [ 29/Aug/13 ]

Hi Alexandre,

I believe it was that patch. The customer is now running 2.6.32-279.32.1.el6, which is not public, but the changelog for it only has that patch. Strangely though, according to the RHEL errata the bug wasn't introduced until 279.22.. Perhaps there was another memory corruption bug that the patch fixed? In any case, this patch helped us with the LBUG in kiblnd.

  • Kit
Comment by Bob Glossman (Inactive) [ 29/Aug/13 ]

Going back over our change history, the patch being referred to was

lustre/kernel_patches/patches/ipoib-locking-fix.patch

It was added to our kernel patch series for application to the 6.4 kernel (2.6.32-358.2.1.el6) in http://review.whamcloud.com/#/c/5952/3

It was deleted in http://review.whamcloud.com/#/c/6615/3 during our support upgrade to the later 6.4 version (2.6.32-358.11.1.el6) when that change landed in the upstream kernel.

While it should work in principal we have never applied or tested that patch on a 6.3 kernel ourselves. Or you can try to use the 2.6.32-279.32.1.el6 kernel that Kit says has the fix. Either of those options will probably work but haven't been proofed or tested here.

Comment by Sebastien Buisson (Inactive) [ 12/Sep/13 ]

Hi,

The patch lustre/kernel_patches/patches/ipoib-locking-fix.patch has been installed on-site, and so far it improves the situation very well: Lustre is much more stable now.

Thanks,
Sebastien.

Comment by Sebastien Buisson (Inactive) [ 07/Nov/13 ]

Hi,

Unfortunately I think we mixed different issues: the problem initially described here was hit at SARA customer, whereas my last comment was only true for MeteoFrance customer (please also note that this patch on IPoIB was removed as it was found to introduce a deadlock).

So back to active as this issue has been hit 11 times during October.

Thanks,
Sebastien.

Comment by Bob Glossman (Inactive) [ 13/Nov/13 ]

Very strongly suspect you are hitting this problem because you have removed the IPoIB patch. It was directly cribbed from upstream linux and addressed precisely this sort of memory corruption in IB. If it causes a deadlock for you the only remedy I can suggest is to move to a later kernel version that already incorporates the equivalent fix. As the problem is in the upstream linux, not in lustre, there's not a lot we can do about it.

Comment by Sebastien Buisson (Inactive) [ 14/Nov/13 ]

From LU-3010, it seems that a modification in the RH6.4 kernels obsoletes the ipoib-locking-fix patch.
Can we have a pointer to this modification ?

Comment by Kit Westneat (Inactive) [ 14/Nov/13 ]

Hi Sebastien,

The issue in LU-3010 was resolved by the ipoib patch, which RHEL included in the 6.4 kernel:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/6.4_Technical_Notes/kernel.html#RHSA-2013-0911

(bz 928817)

FWIW I don't think NOAA has seen any client deadlocks, but we did run into an OSS o2iblnd deadlock (LU-3596).

HTH,
Kit

Comment by Bob Glossman (Inactive) [ 14/Nov/13 ]

Kit,

Thanks for that reference. It was mentioned in the commit header of http://review.whamcloud.com/5952 that our ipoib patch came from upstream commit fa16ebed31f336e41970f3f0ea9e8279f6be2d27. I would have had a hard time tracking down a better pointer.

Comment by Sebastien Buisson (Inactive) [ 29/Nov/13 ]

Hi,

This issue is not fixed by the 'ipoib' patch. Precisely we already run an OFED stack that integrates a more complete version of this patch, and it does not help.

In order to help investigations, we have uploaded a crash dump on Whamcloud's ftp, at ftp://ftp.whamcloud.com/uploads/vmcore-tcn127-LBUG .
Please tell us if you need anything else.

Sebastien.

Comment by Bruno Faccini (Inactive) [ 30/Nov/13 ]

Hello Sebastien,
Seems we are back to the beginning on this, and with your answer to my very 1st demand !!
BTW, since it appear to be the solo vmcore file (header shows 2.6.32-279.5.2.bl6.Bull.36.x86_64 Kernel version), can you also provide the associated vmlinux (or kernel-dbuginfo-* RPMs) and Lustre modules ?
Also, does this still occur running with 2.1.3?
Thanks again and in advance.

Comment by Hugo Meiland [ 02/Dec/13 ]

added zip file to ftp with vmlinuz, systemmap and mlx4-ib.ko: /uploads/LU3848-vmlinuz-systemmap-mlx4_ib_module.zip

Comment by Hugo Meiland [ 02/Dec/13 ]

new zip uploaded, containing vmlinux, systemmap and weak-updates

Comment by Peter Jones [ 02/Dec/13 ]

Bruno

Could you please take over assisting with this ticket?

Thanks

Peter

Comment by Bruno Faccini (Inactive) [ 02/Dec/13 ]

Hugo, can you indicate the name of the 2nd file because I can not list files on the upload!

Comment by Bruno Faccini (Inactive) [ 03/Dec/13 ]

Humm based on the 1st name construct, I think I was able to guess it, is'nt it LU-3848-vmlinux-systemmap-weak_updates.zip ?

Comment by Hugo Meiland [ 03/Dec/13 ]

ack, this is the one... please let me know if you need additional info!
Hugo

Comment by Bruno Faccini (Inactive) [ 05/Dec/13 ]

Thanks Hugo!
In the Kernel log of the crash-dump you uploaded, I see some OOM traces and also VTune driver being loaded, since there are no time-stamps can you check in syslog (or uplod it too ?) if it occurs around the time of the LBUG ??
Continuing crash-dump analysis.

Comment by Bruno Faccini (Inactive) [ 05/Dec/13 ]

Argh, this is weird looks like the me->me_md pointer, of the 1st ME linked on the_lnet.ln_portals[15]-> ptl_mlist (there are 1536 on the linked-list) being parsed in lnet_match_md(), is -3 bytes (!!) than the correct MD address ... :

crash> bt
PID: 3265   TASK: ffff881079e3c7b0  CPU: 0   COMMAND: "kiblnd_sd_19"
 #0 [ffff88107aed7a48] machine_kexec at ffffffff8102902b
 #1 [ffff88107aed7aa8] crash_kexec at ffffffff810a5292
 #2 [ffff88107aed7b78] panic at ffffffff81495fea
 #3 [ffff88107aed7bf8] lbug_with_loc at ffffffffa04fce5b [libcfs]
 #4 [ffff88107aed7c18] lnet_match_md at ffffffffa05742bd [lnet]
 #5 [ffff88107aed7cc8] lnet_parse at ffffffffa0579482 [lnet]
 #6 [ffff88107aed7db8] kiblnd_handle_rx at ffffffffa090977b [ko2iblnd]
 #7 [ffff88107aed7e08] kiblnd_rx_complete at ffffffffa090a560 [ko2iblnd]
 #8 [ffff88107aed7e58] kiblnd_complete at ffffffffa090a732 [ko2iblnd]
 #9 [ffff88107aed7e68] kiblnd_scheduler at ffffffffa090aae9 [ko2iblnd]
#10 [ffff88107aed7f48] kernel_thread at ffffffff8100412a
crash> 

crash> p/x the_lnet.ln_portals[15]
$6 = {
  ptl_mhash = 0x0, 
  ptl_mlist = {
    next = 0xffff88083674e3c0, 
    prev = 0xffff880866eaebc0
  }, 
  ptl_msgq = {
    next = 0xffff88107ce0d3d8, 
    prev = 0xffff88107ce0d3d8
  }, 
  ptl_ml_version = 0x35c7, 
  ptl_msgq_version = 0x1, 
  ptl_options = 0x5   <<<<<<<<----- LNET_PTL_LAZY | LNET_PTL_MATCH_WILDCARD
}
crash> p/x *(lnet_me_t *)0xffff88083674e3c0
$13 = {
  me_list = {
    next = 0xffff88032837b3c0, 
    prev = 0xffff88107ce0d3c8
  }, 
  me_lh = {
    lh_hash_chain = {
      next = 0xffff880865e2b450, 
      prev = 0xffff880413b34850
    }, 
    lh_cookie = 0x45d7862
  }, 
  me_match_id = {
    nid = 0xffffffffffffffff, 
    pid = 0xffffffff
  }, 
  me_portal = 0xf, 
  me_match_bits = 0x0, 
  me_ignore_bits = 0xffffffffffffffff, 
  me_unlink = 0x1, 
  me_md = 0xffff88034c96543d   <<<<<<------ VERY BAD!!
}
crash> 
crash> p/x ((lnet_me_t *)0xffff88083674e3c0)->me_md
$18 = 0xffff88034c96543d
crash> p/x ((lnet_me_t *)0xffff88083674e3c0)->me_md->md_me
$19 = 0x83674e3c0000000
crash> 
crash> p/x ((struct lnet_libmd *)0xffff88034c96543d)->md_me
$20 = 0x83674e3c0000000
crash> p/x ((struct lnet_libmd *)0xffff88034c965440)->md_me
$21 = 0xffff88083674e3c0
crash> 

But how can this happen ??!! When the only place where MD/me->me_md is set is in LNetMDAttach() and this after being allocated from SLabs.

Comment by Hugo Meiland [ 06/Dec/13 ]

no usefull info on drivers loading during crash, for completeness syslog uploaded as LU-3848-syslog.zip

Comment by Liang Zhen (Inactive) [ 07/Dec/13 ]

Because this kind of memory corruption can be from different modules, how about creating dedicated slab cache for both lnet_me_t and lnet_libmd_t (LU-4330)? If these are indeed defects in LNet, we still can reproduce, otherwise we should see different symptoms which might help to find out the real reason.

Comment by Bruno Faccini (Inactive) [ 07/Dec/13 ]

Hugo: thanks for the log, and yes I agree that no specific activity/msg can be found in the Syslog at crash/LBUG time.

Liang: thanks for your comment.
Since the same kind of corruption occured in a MD for LU-4330 and in the ME for this ticket, both being allocated in Slab/kmem-cache size-128. So, as per your advice, I will try to develop a patch where MEs and small MDs (<128bytes) will be allocated in a specific kmem-cache. On the other hand and specifically for MDs, I may also try to use/define LNET_USE_LIB_FREELIST, this way we will use full-size and pre-allocated MDs …

It is still a bit early to directly link this ticket to LU-4330, because even if it concern the same kind of corruption (-3), the same size-128 kmem-cache, a node of the same cluster/system, the offset in the affected Slab/structure is different.

Comment by Liang Zhen (Inactive) [ 10/Dec/13 ]

btw, I just posted a patch to fix potential memory corruption: http://review.whamcloud.com/#/c/8511/

Comment by Hugo Meiland [ 27/Jan/14 ]

Bruno, Liang: This weekend I was able to obtain kernel coredumps again with this error (after a stable period os 3-4 weeks). As mentioned earlier the lustre version has been upgraded to 2.1.6. The uploaded file is LU3848-tcn23-vmcore-vmlinux-systemmap-weak-updates.zip; please let me know if you need additional files...

Comment by Bruno Faccini (Inactive) [ 28/Jan/14 ]

Hello Hugo, I will review the new crash-dump you uploaded to verify we still encounter the same kind of corruption.

BTW, I have pushed a patch that should help to better identify the root cause of the MEs/MDs corruption in Slabs, for both LU-4330 and this ticket, it is available at http://review.whamcloud.com/8819/. It has already been successfully exposed to autotests, and you may want to give it a try. May be Sebastien could also validate it and try to integrate it in Bull distro?

Since this patch extracts LNET MEs and small-MDs allocation from generic <size-128> kmem_cache, I strongly expect that you will no longer encounter LU-3848/LU-4330 LBUGs (and may be suffer new issues with others Kernel codes using <size-128> …). But if not, and you are able to reproduce the LBUGs, then I will need to investigate further.

Comment by Liang Zhen (Inactive) [ 28/Jan/14 ]

Hi Hugo, have you already applied the patch I mentioned previously? http://review.whamcloud.com/#/c/8511/ , it did fix potential memory corruption issue, although not sure if it's the reason of your issue here.

Comment by Hugo Meiland [ 29/Jan/14 ]

Hi Bruno and Liang;
@Bruno: Thanks for the patch, I've asked Sebastien to prepare it for me to test at the client site. I do have plenty more of crashes available right now, but I'm not sure if this would be a huge help at this time.
@Liang: Your patch has been integrated to our tree, but just after our upgrade to 2.1.6, so I'm not running this patch yet. It will be in the next lustre package.
I'll try to give you feedback on these as soon as possible.

Comment by Bruno Faccini (Inactive) [ 30/Jan/14 ]

Hello Hugo,
Thanks again for your help on this!
I just had a look into the recent crash-dump you uploaded in LU3848-tcn23-vmcore-vmlinux-systemmap-weak-updates.zip, and I can confirm this the same scenario, only little difference is that the me_md field corruption is -1 instead of -3 before.

Comment by Bruno Faccini (Inactive) [ 09/Feb/14 ]

I need to indicate that after successful auto-tests, I made local extensive testing of http://review.whamcloud.com/8819/ and I can confirm it works as expected and with no MEs/MDs leak in Slabs.

Comment by Hugo Meiland [ 05/Mar/14 ]

As an update: the fix has now been installed for a couple of weeks and no more of these LBUG's have been seen, so the fix looks good...

Comment by Peter Jones [ 05/Mar/14 ]

Great! Thanks for letting us know Hugo

Comment by Bruno Faccini (Inactive) [ 24/Feb/16 ]

Follow on (master patch, for any other version, ...) to be tracked in LU-4330 now.

Generated at Sat Feb 10 01:37:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.