[LU-1663] MDS threads hang for over 725s, causing fail over Created: 23/Jul/12 Updated: 04/Oct/13 Resolved: 04/Oct/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Kit Westneat (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn8, patch | ||
| Environment: |
Lustre 1.8.6.80, jenkins-g9d9d86f-PRISTINE-2.6.18-238.12.1.el5_lustre.gd70e443 |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 4055 |
| Description |
|
At NOAA, there are two filesystems that were installed at the same time, lfs1 and lfs2. Recently lfs2 has been having MDS lockups, which cause a failover to the second MDS. It seems to run ok for a couple days and then whichever MDS is currently running will lockup and failover to the other one. lfs1, however, is not affected, though it runs an identical set up as far as hardware goes. We have the stack traces that get logged, but not the lustre-logs, as they have been on tmpfs. We've changed the debug_file location, so hopefully we'll get the next batch. I'll put a sampling of the interesting call traces, and attach the rest. Here is the root cause of the failover. The health_check times out and prints NOT HEALTHY, which causes ha to failover: This one makes it look like it might be quota related: This one looks a little like |
| Comments |
| Comment by Kit Westneat (Inactive) [ 23/Jul/12 ] |
|
grepped out call traces |
| Comment by Kit Westneat (Inactive) [ 23/Jul/12 ] |
|
mds1 log |
| Comment by Kit Westneat (Inactive) [ 23/Jul/12 ] |
|
mds2 log |
| Comment by Peter Jones [ 24/Jul/12 ] |
|
Oleg is already looking into |
| Comment by Oleg Drokin [ 24/Jul/12 ] |
|
Kit, in your logs you only show mds side. |
| Comment by Kit Westneat (Inactive) [ 24/Jul/12 ] |
|
Oleg, I will try to get this from the customer for the next failover event. Are there any additional logs from the MDS that might be useful? |
| Comment by Oleg Drokin [ 24/Jul/12 ] |
|
Well, getting a matching dlmtrace-enabled log from MDS at the same time would be helpful too |
| Comment by Kit Westneat (Inactive) [ 30/Jul/12 ] |
|
logs from the client |
| Comment by Kit Westneat (Inactive) [ 30/Jul/12 ] |
|
MDS logs |
| Comment by Kit Westneat (Inactive) [ 30/Jul/12 ] |
|
These are logs from the last time the failover happened, before the debugging got turned on. It looks like many of the clients are dumping stacks on close operations: /var/log/messages-20120722:Jul 21 12:22:25 r32i3n11 kernel: [<ffffffffa06dbf25>] ll_close_inode_openhandle+0x175/0x6a0 [lustre] /var/log/messages-20120722:Jul 21 12:21:21 r32i3n5 kernel: [<ffffffffa06dbf25>] ll_close_inode_openhandle+0x175/0x6a0 [lustre] etc. |
| Comment by Kit Westneat (Inactive) [ 15/Aug/12 ] |
|
It happened again, but we were unable to get the debug logs. I've modified the failover script to kernel panic the MDS, so hopefully we'll get a core dump. Is there anything useful in the logs I uploaded before? I can upload the logs we did get from this last event, but it looks like more of the same. |
| Comment by Kit Westneat (Inactive) [ 22/Aug/12 ] |
|
NOAA has had to turn off the dlmtrace debug, as it causes too big of a performance hit. I'm still trying to get lctl dk from one of the events though. Can I get an opinion on the logs that are uploaded? Any thing interesting in them at all? |
| Comment by Kit Westneat (Inactive) [ 22/Aug/12 ] |
|
Also, this bug looks similar in terms of symptoms, though it seems to be more directly related to unlinks than to closes: |
| Comment by Oleg Drokin [ 22/Aug/12 ] |
|
The last set of logs looks to be pretty different from first set of logs. |
| Comment by Peter Jones [ 27/Aug/12 ] |
|
Kit Would it be possible to collect a crashdump the next time that this occurs? Thanks Peter |
| Comment by Kit Westneat (Inactive) [ 12/Sep/12 ] |
|
We've been trying to get stack traces and crashdumps, but for one reason or another it hasn't worked. We did find a lustre-log recently that may or may not be related. I will upload it anyway. We also recorded some information after the last failover, but without the stack trace or crashdump, I'm afraid it's useless. I will upload it anyway, maybe you all can suggest some other information to get. |
| Comment by Kit Westneat (Inactive) [ 12/Sep/12 ] |
|
lctl dk from certain clients, logs from MDS (cut off by premature forced kernel panic) |
| Comment by Kit Westneat (Inactive) [ 12/Sep/12 ] |
|
lustre log from 9/10 |
| Comment by Kit Westneat (Inactive) [ 13/Sep/12 ] |
|
The customer is getting anxious about this issue. We are testing our kdump configuration now to ensure that we have a crashdump, but can you please review the logs and see if there is any more information we should gather? The next hang needs to be the last. Thanks, Kit. |
| Comment by Kit Westneat (Inactive) [ 25/Feb/13 ] |
|
kern.log from most recent failover (with sysrq-t) |
| Comment by Kit Westneat (Inactive) [ 25/Feb/13 ] |
|
Hi we were finally able to get some good debugging data from the system during a failover. Here is the sysrq-t from before it went down. I also managed to save off all the lustre-logs, there are a ton as you can see in the kern.log. Let me know if you want to see them. I noticed that there was a slab reclaim going on.. Wasn't there another issue with slab reclaims taking a long time? |
| Comment by Kit Westneat (Inactive) [ 01/Mar/13 ] |
|
Any update? Thanks, |
| Comment by Oleg Drokin [ 04/Mar/13 ] |
|
Hm, I do not see any obvious culprits in the traces. Also I see an umount in the logs, but I assume that's after the system was wedged for some time and your scripts just attempt it? |
| Comment by Kit Westneat (Inactive) [ 04/Mar/13 ] |
|
Hi Oleg, That's correct, the system tried to umount once the OSTs went unhealthy, but failed. The lustre-log looks pretty slim, hopefully you can find something I missed. I will try to get a crashdump. Unfortunately all the ones we've been able to get so far have been incomplete. I'll try to come up with a way to get a full crashdump. Are there any other logs/debug information we can get that won't have too much impact on the running system? Thanks, |
| Comment by Oleg Drokin [ 06/Mar/13 ] |
|
Getting the full crashdump from a wedged transaction commit like this should be most useful I suspect. |
| Comment by Oleg Drokin [ 04/Apr/13 ] |
|
I suspect what you are seeing is a generic jbd2 transaction wraparound problem. |
| Comment by Kit Westneat (Inactive) [ 04/Apr/13 ] |
|
I don't see that bug fixed in any RHEL5 kernel. Since Redhat no longer breaks out their patches, it seems like it would be difficult to figure out what they did to fix it. I will look at the SRPMS for the jbd stuff to see if I can figure it out. Have you seen the transaction wraparound problem in Lustre before? Are there any workarounds? It seems like we can't be the first to hit it. |
| Comment by Oleg Drokin [ 04/Apr/13 ] |
|
I have a high-load test system for lustre regression testing and I am hitting two sorts of jbd issues somewhat frequently, the rh bz735768 and then some jbd thread lockups that I think are related. Some of them are somewhat similar to what you see, I think. So I was just doing some more investigations into that and uncovered that they are apparently known problems if you have high transaction rates (which I certainly do). I suspect you can cherry-pick the patches from mainline linux kernel. |
| Comment by Kit Westneat (Inactive) [ 04/Apr/13 ] |
|
It looks like the patch RH took to solve bz735768 is this one: Oh I just realized that the debian-kernel patch references the two commits already, doh. Are there any plans to carry the debian-kernel patch until RH backports it? How risky do you think it is to backport them to RHEL5? |
| Comment by Kit Westneat (Inactive) [ 15/Apr/13 ] |
|
We finally got a full vmcore. It's 12GB.. Is there a good place to put it? Alternatively I could run commands on it. What would be the best way to prove that it's the tid wraparound issue? Thanks. |
| Comment by Kit Westneat (Inactive) [ 15/Apr/13 ] |
|
Backtrace: Dennis pointed out that it looks like |
| Comment by Oleg Drokin [ 18/Apr/13 ] |
|
yes, this latest trace is a different issue similar to lu-3071, but in your trace there is no lustre in the picture at all. I checked in rhel5 kernel I have around and the bug is there in fs/mbcache.c: struct mb_cache_entry *
mb_cache_entry_alloc(struct mb_cache *cache)
{
struct mb_cache_entry *ce;
atomic_inc(&cache->c_entry_count);
ce = kmem_cache_alloc(cache->c_entry_cache, GFP_KERNEL); <===== BUG
In rhel6 I see they actually pass a gfp flags to this function to be used by the alloc to avoid this, you might be able to hunt down the patch that fixes that I guess. |
| Comment by Oleg Drokin [ 18/Apr/13 ] |
|
as for the vmcore, put it somewhere where we can get it? we also would need kernel-debuginfo you used at the very least. lustre modules might be useful just in case too. |
| Comment by Kit Westneat (Inactive) [ 23/Apr/13 ] |
|
Does this patch look correct? http://git.openvz.org/?p=linux-2.6.32-openvz;a=commitdiff;h=335e92e8a515420bd47a6b0f01cb9a206c0ed6e4 We are just using the default 1.8.9 kernel: Dennis is uploading the coredump now and will update the ticket when it is finished. |
| Comment by Kit Westneat (Inactive) [ 24/Apr/13 ] |
|
I pushed the patch we are using for the transaction wrap around in case it would be useful: |
| Comment by Oleg Drokin [ 25/Apr/13 ] |
|
Ok, I got vmcore, but the vmlinux from openvz is not it. In any case I looked inside of the core and it looks like it's for this latest crash that we already know what happened that is a different bug altogether. What I am looking for is vmcore from a situation where jbd2 commit threads locks up. |
| Comment by Kit Westneat (Inactive) [ 01/May/13 ] |
|
The openviz link was just supposed to be a link to the patch, the kernel we are running is a stock Centos kernel. The backtrace and the core come from the same crash. There is another core, but it was placed on the wrong disk and we need to wait for a downtime to get it (next Tuesday). If it is a different backtrace, I'll update the ticket then. As far as the fix for the GFP_NOFS bug, does that patch work? I found the patch on git.kernel.org as well: Should I backport that to b1_8/RHEL5? Thanks. |
| Comment by Oleg Drokin [ 03/May/13 ] |
|
yes, this patch looks like it would do the right thing. |
| Comment by Shuichi Ihara (Inactive) [ 27/Sep/13 ] |
|
Hi, would you please review this patch sonner? we are hitting multple server crash due to this issue. |
| Comment by Peter Jones [ 27/Sep/13 ] |
|
ok Ihara. Patch is at http://review.whamcloud.com/#/c/6147/ |
| Comment by Shuichi Ihara (Inactive) [ 27/Sep/13 ] |
|
Hi Peter, yes, but, we are waiting inspection. Once code review is done, we will apply this to the kernel for servers. |
| Comment by Peter Jones [ 27/Sep/13 ] |
|
Ihara Sorry that my comment was not clear enough. I understand that you wish to have reviews on the patch and I was acknowledging that and then separately adding the reference to the patch. It is required to include such a link in the JIRA ticket to cross-reference between JIRA and gerrit and this step had previously been overlooked Peter |
| Comment by Kit Westneat (Inactive) [ 27/Sep/13 ] |
|
I also just submitted the xattrs patch that was also referenced. We are carrying this patch already at NOAA and it seems to improve stability. |
| Comment by Peter Jones [ 04/Oct/13 ] |
|
RH patches ok to use in production |