[LU-3010] client crashes on RHEL6 with Lustre 1.8.8 Created: 21/Mar/13 Updated: 06/Nov/13 Resolved: 06/Nov/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Kit Westneat (Inactive) | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 7331 | ||||||||
| Description |
|
After upgrading the clients at NOAA to RHEL6 and Lustre 1.8.8, we're running into an issue where we are seeing kernel panics. Here is the analysis by Redhat: I have checked the provided vmcore and below is my diagnosis on same . . It Go through below shared analysis and contact Lustre team . . If there is any ~~~~~~~~~~~~~~~~~~~~~~~~~~ KERNEL: vmlinux
[2] Seems like there was some resource constraint for the concern PID 2889. [3] I see few segfaults and LustreError prior to crash . . crash> log PROLOGUE-CHKNODE Wed Mar 6 18:55:45 UTC 2013 Job 16422368: chk_node.pl Pid: 2889, comm: kiblnd_sd_04 Not tainted 2.6.32-279.5.2.el6.x86_64 #1 SGI.COM |
| Comments |
| Comment by Kit Westneat (Inactive) [ 21/Mar/13 ] |
|
Here are the messages: Before the crash there appears to be a network problem: Here is the vmcore: Let me know if there is any more information I can provide. |
| Comment by Oleg Drokin [ 21/Mar/13 ] |
|
is this from our rpms or did you build it yourself? If you built it, we also need kernel-debuginfo rpm and lustre-modules rpm to make the crashdump useful. |
| Comment by Oleg Drokin [ 21/Mar/13 ] |
|
Also the assertion is the same as old bugzilla 14238 that we landed a patch for quite a while ago (patch by Isaac http://git.whamcloud.com/gitweb?p=fs/lustre-dev.git;a=commitdiff;h=85f59695534ddd167fa491c091ed64b1504cdaf7 ) |
| Comment by Doug Oucharek (Inactive) [ 22/Mar/13 ] |
|
I'm seeing two things: 1- The log "Timed out RDMA.." is indicating that we have not seen any activity from 10.175.31.242@o2ib4 for 65 seconds. As such, we are giving up on it and closing the connection. 2- There is only one assert in lnet_match_md() and that is when we see something unexpected with our MD queue. I am assuming this has happened as a result of the closed connection. But, that is no reason to assert (if something can feasibly happen, we should not be asserting). I checked the latest code tree and there seems to be some changes to how locks work in this area of code. It may be possible there is a race condition here. I need to check with a couple of other devs more familiar with this area of code to get their opinion. |
| Comment by Liang Zhen (Inactive) [ 22/Mar/13 ] |
|
first, the assertion is reasonable at here, no matter what happened, @me has to equal to @me->me_md->md_me unless me->me_md is NULL, that's the way we implement this. Because locking changes are only on 2.3 or later, and this bug is on 1.8.8, so at least it's not a new race condition, and unfortunately I didn't see any race even when I make those lock changes. |
| Comment by Oleg Drokin [ 22/Mar/13 ] |
|
Liang, there's a crashdump available, so we can check all the values there. |
| Comment by Kit Westneat (Inactive) [ 22/Mar/13 ] |
|
Hi Oleg, We built them ourselves. Here's the kernel-debug: Here's the lustre-modules: We haven't been able to find the lustre-modules debuginfo yet..is that also needed? If so, could we potentially try to rebuild identical modules in order to regenerate the debug info? Thanks, |
| Comment by Oleg Drokin [ 22/Mar/13 ] |
|
Thanks. As far as lustre debug info goes, we don't really strip debug symbols in 1.8 (and in 2.x prior to 2.4), so just lustre modules rpm is enough, debug module symbols are embedded in the modules themselves. |
| Comment by Isaac Huang (Inactive) [ 22/Mar/13 ] |
|
I believe it's a bug in the generic LNet layer, i.e. under lnet/lnet. It has happened over different LNDs in the past, and shouldn't be LND specific. My patch in Bug 14238 wasn't a real fix - it just closed a couple of cases where corruption or dangling pointer could happen. The root cause wasn't found out, due to lack of debug information. But this time we have a good dump. |
| Comment by Kit Westneat (Inactive) [ 22/Mar/13 ] |
|
Oleg, yes, it is an unmodified kernel. FWIW, we are also seeing this at another customer site, but only their RHEL6 frontends are affected. In that case appears to be coincident with OOM errors, so maybe it's related to memory pressure? Thanks |
| Comment by Kit Westneat (Inactive) [ 26/Mar/13 ] |
|
I was wondering how the analysis was going. Is there anything we can do to help? |
| Comment by Isaac Huang (Inactive) [ 27/Mar/13 ] |
|
Hi Kit, The kernel-debug-2.6.32-279.5.2.el6.x86_64.rpm you pointed to was a kernel with all sorts of run-time debugging options enabled. Usually it's not used in production system. Also, when I used the vmlinux from the corresponding debuginfo RPM, crash wouldn't even start. Then I tried to use the debuginfo RPM for normal (i.e. without debugging options) kernel of the same verion, crash would start with a warning on kernel version inconsistency between vmlinux and dumpfile. Can you please verify the client kernel version and point me to where the vmlinux is? With vmlinux from kernel-debuginfo-2.6.32-279.5.2.el6.x86_64.rpm, I got a panic that's different from the one reported. Instead of "(lib-move.c:184:lnet_match_md()) ASSERTION(me == md->md_me) failed" in process kiblnd_sd_04, I got: |
| Comment by Kit Westneat (Inactive) [ 27/Mar/13 ] |
|
Hi Isaac, It looks like the RHEL and Centos kernel RPMs are slightly different, and they were running the RHEL one. I am getting the RHEL debuginfo currently and will update the ticket when I get it uploaded. If you happen to have an RHN account, that should work too. |
| Comment by Kit Westneat (Inactive) [ 28/Mar/13 ] |
|
Here are the kerneldebug rpms: Thanks, |
| Comment by Isaac Huang (Inactive) [ 28/Mar/13 ] |
|
Strange, with the RHEL debuginfo, I got the same version warning and then crash failed with other errors. I'll continue to debug with the CentOS debuginfo which seemed to work despite the version warning. Also, it seemed that the original report was based on another crash dump, because both the DATE and UPTIME in the dump differed. It'd give me more data if that dump is also available. Although this dump had a different failure, it's similar in many ways. |
| Comment by Kit Westneat (Inactive) [ 28/Mar/13 ] |
|
Oh doh, I uploaded the wrong debuginfo, 279.19.1 vs 279.5.2. I am uploading the correct one now and will send you the link. |
| Comment by Kit Westneat (Inactive) [ 28/Mar/13 ] |
|
correct kerneldebug rpms: |
| Comment by Isaac Huang (Inactive) [ 28/Mar/13 ] |
|
Thanks - this one worked perfectly, no warning at all. |
| Comment by Isaac Huang (Inactive) [ 29/Mar/13 ] |
|
I've been digging through the crash dump, and it appeared that the kernel slab allocation states got corrupted somehow. The assertion failure on callback pointer in ptlrpc_master_callback() was an indication of memory corruption because the 'callback' pointer from the MD object was NEVER changed by the code after initialization. It looked almost impossible to be a result of racing code - no code changes that 'callback' pointer at all. Use-after-free can also be ruled out because: the MD object is bigger than a page, and crash dump contained only the initial part of the object, because the rest of the object resided in a free page that was not included in the partial dump - if it were use-after-free, both pages should have been excluded from the dump. Moreover, the initial part of the MD object contained correct reference counter. So I was led to believe that the SLAB allocation states got screwed up and a same chunk of memory was returned to multiple callers, and thus the 'callback' pointer got messed up by someone else unknowingly. In the past, we've seen similar bugs where the root cause was never found out, and the solution was to disable some experimental Lustre feature that looked suspicious. In fact, the culprit might not be Lustre at all - any kernel code is technically capable of such screw-ups. My suggestion is to: 2. Doug, I think it'd make sense to have some Lustre folk to double check the errors from file system level. I was glancing over the code, and some didn't look good, e.g.: igrab() can return NULL yet the code didn't check that. Maybe ll_file_join() wasn't even compiled, it was just something that raised my eyebrows. |
| Comment by Kit Westneat (Inactive) [ 01/Apr/13 ] |
|
Hi Isaac, Thanks for the analysis. I will try to get the debug kernel installed, but it might take a little bit of time. Would the full vmcore be of any use? Also this seems to appear fairly consistently with RHEL6 clients. Do you think it would be possible to look at differences in the kernel functions that the Lustre client calls? Perhaps there are too many, but if at all possible, it would be good to keep looking for the cause. |
| Comment by Oleg Drokin [ 01/Apr/13 ] |
|
Isaac: ll_file_join is (was) a known problematic area that was removed in later versions. |
| Comment by Isaac Huang (Inactive) [ 01/Apr/13 ] |
|
Hi Kit, a vmcore with free pages included would be a bit more helpful. Also it doesn't require kdump to go through data structures tracking free pages, so in case that those structures also get corrupted kdump would still be able to create the dump - kdump may hang if asked to exclude free pages but those data structures got corrupted. At this point, I think kernel-bug is the best way to go - we'd be extremely lucky to be able to find something useful by going through that much code almost blindly, like trying to walk out of a dark rain forest with little guide. The callback pointer corruption was a good indication that somebody else stepped on our toes because our code doesn't change it at all and I've already double checked all pointer arithmetic code in lnet to make sure that it was not ourselves that shot our own feet. Oleg, thanks the comment on file join. I saw errors like ...: ... before the assertion happened. Does it ring any alarm to you? |
| Comment by Oleg Drokin [ 01/Apr/13 ] |
|
Isaac, this error is a somewhat usual race. |
| Comment by Kit Westneat (Inactive) [ 09/Apr/13 ] |
|
Isaac, I've gotten a lot more stack traces from 1.8.9 clients. Some of them are only in the IB functions, not Lustre related, which I find interesting. Are you aware of any memory corruption bugs in the RHEL6 line of RDMA modules? I spent some time looking, but couldn't find anything definitive. There is: BZ#873949 There is also a crash in Lustre code that doesn't seem to be related to IB: So it's very confusing. Here are all the stack traces if you would like to take a look at them: Our next step I think is to put the debug kernel on some of the clients and see what happens. It's been impossible to find any pattern to what triggers the crashes, so it will be only by random chance that we reproduce it. Let me know if you see anything in the stack traces, or if you would like to see a vmcore. Do you know if there are already any prebuilt client RPMs against the kernel-debug RPM that we could use? Thanks. |
| Comment by Isaac Huang (Inactive) [ 10/Apr/13 ] |
|
Hi, I've looked at the crashes and they all appeared like memory corruptions: bad pointer dereferences, inconsistent data structures caught by assertions, and even a BUG in mm/slab.c. They seemed to have been triggered by memory pressures and increased chances of racing through the 24 CPUs. The crashes in IB had nothing to do with Lustre - the o2iblnd never uses the IB functions involved in the crashes. In the past, there had a been a few memory corruption issues fixed in OFED like the one you pointed out. But so far no clue has pointed at OFED yet - anything in the kernel space could have been the culprit. I still believe that kernel-debug would shed some light and move us closer to the root cause. Unfortunately we don't build RPMs for kernel-debug. Another option would be to upgrade some clients to RHEL 6.4 which included the fix of the IB memory corruption you mentioned, but that's likely more work than trying kernel-debug of 6.3. |
| Comment by Andreas Dilger [ 06/Nov/13 ] |
|
Per Isaac's comments, this appears to be some form of memory corruption, possibly related to the IB code in the RHEL kernel. |