[LU-1663] MDS threads hang for over 725s, causing fail over - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 1.8.7
Labels:
- mn8
- patch
Environment:
Lustre 1.8.6.80, jenkins-g9d9d86f-PRISTINE-2.6.18-238.12.1.el5_lustre.gd70e443
Centos 5.5

Severity:
2
Rank (Obsolete):
4055

Description

At NOAA, there are two filesystems that were installed at the same time, lfs1 and lfs2. Recently lfs2 has been having MDS lockups, which cause a failover to the second MDS. It seems to run ok for a couple days and then whichever MDS is currently running will lockup and failover to the other one. lfs1, however, is not affected, though it runs an identical set up as far as hardware goes.

We have the stack traces that get logged, but not the lustre-logs, as they have been on tmpfs. We've changed the debug_file location, so hopefully we'll get the next batch. I'll put a sampling of the interesting call traces, and attach the rest.

Here is the root cause of the failover. The health_check times out and prints NOT HEALTHY, which causes ha to failover:
Jul 17 17:23:30 lfs-mds-2-2 kernel: LustreError: 16021:0:(service.c:2124:ptlrpc_service_health_check()) mds: unhealthy - request has been waiting 725s

This one makes it look like it might be quota related:
Jul 17 17:14:04 lfs-mds-2-2 kernel: Call Trace:
Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff887f9220>] :lnet:LNetPut+0x730/0x840
Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff800649fb>] __down+0xc3/0xd8
Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff88a29490>] :lquota:dqacq_handler+0x0/0xc20
...

This one looks a little like ~~LU-1395~~ or ~~LU-1269~~:
Jul 4 17:58:29 lfs-mds-2-2 kernel: Call Trace:
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888ceb51>] ldlm_resource_add_lock+0xb1/0x180 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e2a00>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8006388b>] schedule_timeout+0x8a/0xad
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8009987d>] process_timeout+0x0/0x5
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e4555>] ldlm_completion_ast+0x4d5/0x880 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c9709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c4b6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e30bb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff88c611a7>] enqueue_ordered_locks+0x387/0x4d0 [mds]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e09a0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

09sep.tar.bz2
8.40 MB
12/Sep/12 11:19 AM
call_traces
392 kB
23/Jul/12 8:31 PM
kern.log.2013-02-23.gz
104 kB
25/Feb/13 10:03 AM
kern.log-20120721
191 kB
30/Jul/12 10:27 AM
ll-1181-decoded.txt.gz
0.2 kB
04/Mar/13 1:41 PM
log1.bz2
438 kB
30/Jul/12 10:20 AM
lustre-log.txt.bz2
4.55 MB
12/Sep/12 11:22 AM
mds1.log
8.37 MB
23/Jul/12 8:38 PM
mds2.log
3.45 MB
23/Jul/12 8:39 PM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-1663] MDS threads hang for over 725s, causing fail over

Kit Westneat (Inactive) added a comment - 15/Apr/13 7:38 PM

Backtrace:
PID: 23917 TASK: ffff810abe18c7e0 CPU: 2 COMMAND: "vgck"
#0 [ffff810b35bd75a0] crash_kexec at ffffffff800b1192
#1 [ffff810b35bd7660] __die at ffffffff80065137
#2 [ffff810b35bd76a0] die at ffffffff8006c735
#3 [ffff810b35bd76d0] do_invalid_op at ffffffff8006ccf5
#4 [ffff810b35bd7790] error_exit at ffffffff8005ddf9
[exception RIP: jbd2_journal_start+58]
RIP: ffffffff8b32ae31 RSP: ffff810b35bd7848 RFLAGS: 00010206
RAX: ffff810c1aef19c0 RBX: ffff81061476dd38 RCX: ffff810910fdcc00
RDX: ffffffffffffffe2 RSI: 0000000000000012 RDI: ffff81061476dd38
RBP: ffff810919769800 R8: ffff81000005b600 R9: ffff8104cd371a80
R10: 0000000000000000 R11: ffffffff8b36cb60 R12: 0000000000000012
R13: ffff810b35bd78c8 R14: 0000000000000080 R15: 0000000000000100
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffff810b35bd7860] ldiskfs_dquot_drop at ffffffff8b36fdbe
#6 [ffff810b35bd7880] clear_inode at ffffffff800234e7
#7 [ffff810b35bd7890] dispose_list at ffffffff800352d6
#8 [ffff810b35bd78c0] shrink_icache_memory at ffffffff8002dcfb
#9 [ffff810b35bd7900] shrink_slab at ffffffff8003f7cd
#10 [ffff810b35bd7940] zone_reclaim at ffffffff800cf9ae
#11 [ffff810b35bd79f0] get_page_from_freelist at ffffffff8000a8a7
#12 [ffff810b35bd7a60] __alloc_pages at ffffffff8000f48a
#13 [ffff810b35bd7ad0] cache_grow at ffffffff80017a52
#14 [ffff810b35bd7b20] cache_alloc_refill at ffffffff8005c3ee
#15 [ffff810b35bd7b60] kmem_cache_alloc at ffffffff8000ac96
#16 [ffff810b35bd7b80] mb_cache_entry_alloc at ffffffff801063df
#17 [ffff810b35bd7ba0] ext3_xattr_cache_insert at ffffffff8805bb1e
#18 [ffff810b35bd7bd0] ext3_xattr_get at ffffffff8805cd3a
#19 [ffff810b35bd7c50] ext3_get_acl at ffffffff8805d4e2
#20 [ffff810b35bd7ca0] ext3_init_acl at ffffffff8805d8de
#21 [ffff810b35bd7ce0] ext3_new_inode at ffffffff8804e21f
#22 [ffff810b35bd7db0] ext3_create at ffffffff88054b6b
#23 [ffff810b35bd7e00] vfs_create at ffffffff8003aa1b
#24 [ffff810b35bd7e30] open_namei at ffffffff8001b749
#25 [ffff810b35bd7ea0] do_filp_open at ffffffff80027ba1
#26 [ffff810b35bd7f50] do_sys_open at ffffffff80019fa5
#27 [ffff810b35bd7f80] tracesys at ffffffff8005d29e (via system_call)
RIP: 0000003d598c5b40 RSP: 00007fff8f6ae3b8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff8005d29e RCX: ffffffffffffffff
RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007fff8f6ae490
RBP: 00007fff8f6ae4e0 R8: 0000000000000004 R9: 0000000000000000
R10: 0000000000000241 R11: 0000000000000246 R12: 000000000e3a5e90
R13: 0000000000484f87 R14: 0000000000000004 R15: 000000000e3a5e90
ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b

Dennis pointed out that it looks like ~~LU-3071~~. Reading that bug, it looks like it should affect MDTs and 1.8 as well, is that correct? The assertion hit is at the same place:
J_ASSERT(handle->h_transaction->t_journal == journal);

Kit Westneat (Inactive) added a comment - 15/Apr/13 7:38 PM Backtrace: PID: 23917 TASK: ffff810abe18c7e0 CPU: 2 COMMAND: "vgck" #0 [ffff810b35bd75a0] crash_kexec at ffffffff800b1192 #1 [ffff810b35bd7660] __die at ffffffff80065137 #2 [ffff810b35bd76a0] die at ffffffff8006c735 #3 [ffff810b35bd76d0] do_invalid_op at ffffffff8006ccf5 #4 [ffff810b35bd7790] error_exit at ffffffff8005ddf9 [exception RIP: jbd2_journal_start+58] RIP: ffffffff8b32ae31 RSP: ffff810b35bd7848 RFLAGS: 00010206 RAX: ffff810c1aef19c0 RBX: ffff81061476dd38 RCX: ffff810910fdcc00 RDX: ffffffffffffffe2 RSI: 0000000000000012 RDI: ffff81061476dd38 RBP: ffff810919769800 R8: ffff81000005b600 R9: ffff8104cd371a80 R10: 0000000000000000 R11: ffffffff8b36cb60 R12: 0000000000000012 R13: ffff810b35bd78c8 R14: 0000000000000080 R15: 0000000000000100 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffff810b35bd7860] ldiskfs_dquot_drop at ffffffff8b36fdbe #6 [ffff810b35bd7880] clear_inode at ffffffff800234e7 #7 [ffff810b35bd7890] dispose_list at ffffffff800352d6 #8 [ffff810b35bd78c0] shrink_icache_memory at ffffffff8002dcfb #9 [ffff810b35bd7900] shrink_slab at ffffffff8003f7cd #10 [ffff810b35bd7940] zone_reclaim at ffffffff800cf9ae #11 [ffff810b35bd79f0] get_page_from_freelist at ffffffff8000a8a7 #12 [ffff810b35bd7a60] __alloc_pages at ffffffff8000f48a #13 [ffff810b35bd7ad0] cache_grow at ffffffff80017a52 #14 [ffff810b35bd7b20] cache_alloc_refill at ffffffff8005c3ee #15 [ffff810b35bd7b60] kmem_cache_alloc at ffffffff8000ac96 #16 [ffff810b35bd7b80] mb_cache_entry_alloc at ffffffff801063df #17 [ffff810b35bd7ba0] ext3_xattr_cache_insert at ffffffff8805bb1e #18 [ffff810b35bd7bd0] ext3_xattr_get at ffffffff8805cd3a #19 [ffff810b35bd7c50] ext3_get_acl at ffffffff8805d4e2 #20 [ffff810b35bd7ca0] ext3_init_acl at ffffffff8805d8de #21 [ffff810b35bd7ce0] ext3_new_inode at ffffffff8804e21f #22 [ffff810b35bd7db0] ext3_create at ffffffff88054b6b #23 [ffff810b35bd7e00] vfs_create at ffffffff8003aa1b #24 [ffff810b35bd7e30] open_namei at ffffffff8001b749 #25 [ffff810b35bd7ea0] do_filp_open at ffffffff80027ba1 #26 [ffff810b35bd7f50] do_sys_open at ffffffff80019fa5 #27 [ffff810b35bd7f80] tracesys at ffffffff8005d29e (via system_call) RIP: 0000003d598c5b40 RSP: 00007fff8f6ae3b8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8005d29e RCX: ffffffffffffffff RDX: 00000000000001b6 RSI: 0000000000000241 RDI: 00007fff8f6ae490 RBP: 00007fff8f6ae4e0 R8: 0000000000000004 R9: 0000000000000000 R10: 0000000000000241 R11: 0000000000000246 R12: 000000000e3a5e90 R13: 0000000000484f87 R14: 0000000000000004 R15: 000000000e3a5e90 ORIG_RAX: 0000000000000002 CS: 0033 SS: 002b Dennis pointed out that it looks like LU-3071 . Reading that bug, it looks like it should affect MDTs and 1.8 as well, is that correct? The assertion hit is at the same place: J_ASSERT(handle->h_transaction->t_journal == journal);

Kit Westneat (Inactive) added a comment - 15/Apr/13 6:54 PM

We finally got a full vmcore. It's 12GB.. Is there a good place to put it? Alternatively I could run commands on it. What would be the best way to prove that it's the tid wraparound issue?

Thanks.

Kit Westneat (Inactive) added a comment - 15/Apr/13 6:54 PM We finally got a full vmcore. It's 12GB.. Is there a good place to put it? Alternatively I could run commands on it. What would be the best way to prove that it's the tid wraparound issue? Thanks.

Kit Westneat (Inactive) added a comment - 04/Apr/13 5:16 PM

It looks like the patch RH took to solve bz735768 is this one:
https://github.com/torvalds/linux/commit/deeeaf13b291420fe4a4a52606b9fc9128387340

Oh I just realized that the debian-kernel patch references the two commits already, doh.

Are there any plans to carry the debian-kernel patch until RH backports it? How risky do you think it is to backport them to RHEL5?

Kit Westneat (Inactive) added a comment - 04/Apr/13 5:16 PM It looks like the patch RH took to solve bz735768 is this one: https://github.com/torvalds/linux/commit/deeeaf13b291420fe4a4a52606b9fc9128387340 Oh I just realized that the debian-kernel patch references the two commits already, doh. Are there any plans to carry the debian-kernel patch until RH backports it? How risky do you think it is to backport them to RHEL5?

Oleg Drokin added a comment - 04/Apr/13 4:19 PM

I have a high-load test system for lustre regression testing and I am hitting two sorts of jbd issues somewhat frequently, the rh bz735768 and then some jbd thread lockups that I think are related. Some of them are somewhat similar to what you see, I think. So I was just doing some more investigations into that and uncovered that they are apparently known problems if you have high transaction rates (which I certainly do).

I suspect you can cherry-pick the patches from mainline linux kernel.

Oleg Drokin added a comment - 04/Apr/13 4:19 PM I have a high-load test system for lustre regression testing and I am hitting two sorts of jbd issues somewhat frequently, the rh bz735768 and then some jbd thread lockups that I think are related. Some of them are somewhat similar to what you see, I think. So I was just doing some more investigations into that and uncovered that they are apparently known problems if you have high transaction rates (which I certainly do). I suspect you can cherry-pick the patches from mainline linux kernel.

Kit Westneat (Inactive) added a comment - 04/Apr/13 4:02 PM

I don't see that bug fixed in any RHEL5 kernel. Since Redhat no longer breaks out their patches, it seems like it would be difficult to figure out what they did to fix it. I will look at the SRPMS for the jbd stuff to see if I can figure it out. Have you seen the transaction wraparound problem in Lustre before? Are there any workarounds? It seems like we can't be the first to hit it.

Kit Westneat (Inactive) added a comment - 04/Apr/13 4:02 PM I don't see that bug fixed in any RHEL5 kernel. Since Redhat no longer breaks out their patches, it seems like it would be difficult to figure out what they did to fix it. I will look at the SRPMS for the jbd stuff to see if I can figure it out. Have you seen the transaction wraparound problem in Lustre before? Are there any workarounds? It seems like we can't be the first to hit it.

Oleg Drokin added a comment - 04/Apr/13 3:35 PM

I suspect what you are seeing is a generic jbd2 transaction wraparound problem.
There were several rounds at RedHat to fix this, previous round here (already part of rhel6.4 kernel, no idea if they backported to rhel5 too): https://bugzilla.redhat.com/show_bug.cgi?id=735768 and then the latest one seems to be unfolding here: http://lists.debian.org/debian-kernel/2013/03/msg00313.html though I don't know when it will make it to rhel kernels either.

Oleg Drokin added a comment - 04/Apr/13 3:35 PM I suspect what you are seeing is a generic jbd2 transaction wraparound problem. There were several rounds at RedHat to fix this, previous round here (already part of rhel6.4 kernel, no idea if they backported to rhel5 too): https://bugzilla.redhat.com/show_bug.cgi?id=735768 and then the latest one seems to be unfolding here: http://lists.debian.org/debian-kernel/2013/03/msg00313.html though I don't know when it will make it to rhel kernels either.

Oleg Drokin added a comment - 06/Mar/13 7:01 PM

Getting the full crashdump from a wedged transaction commit like this should be most useful I suspect.

Oleg Drokin added a comment - 06/Mar/13 7:01 PM Getting the full crashdump from a wedged transaction commit like this should be most useful I suspect.

Kit Westneat (Inactive) added a comment - 04/Mar/13 1:24 PM

Hi Oleg,

That's correct, the system tried to umount once the OSTs went unhealthy, but failed.

The lustre-log looks pretty slim, hopefully you can find something I missed.

I will try to get a crashdump. Unfortunately all the ones we've been able to get so far have been incomplete. I'll try to come up with a way to get a full crashdump. Are there any other logs/debug information we can get that won't have too much impact on the running system?

Thanks,
Kit

Kit Westneat (Inactive) added a comment - 04/Mar/13 1:24 PM Hi Oleg, That's correct, the system tried to umount once the OSTs went unhealthy, but failed. The lustre-log looks pretty slim, hopefully you can find something I missed. I will try to get a crashdump. Unfortunately all the ones we've been able to get so far have been incomplete. I'll try to come up with a way to get a full crashdump. Are there any other logs/debug information we can get that won't have too much impact on the running system? Thanks, Kit

Oleg Drokin added a comment - 04/Mar/13 12:38 PM

Hm, I do not see any obvious culprits in the traces.
A bit strange how jbd thread is the first to complain it's hung and in the middle of transaction commit with no report of any sleep, yet in D state (I wish you had a crashdump so that it could be checked more closely).
Anyway the first log that should hold most of the data is /tmp/lustre-log.1361661252.1181 I guess? (the biggest one around that timestamp is the one we need) and I guess we need it to at least try to see what was going on at the time.

Also I see an umount in the logs, but I assume that's after the system was wedged for some time and your scripts just attempt it?

Oleg Drokin added a comment - 04/Mar/13 12:38 PM Hm, I do not see any obvious culprits in the traces. A bit strange how jbd thread is the first to complain it's hung and in the middle of transaction commit with no report of any sleep, yet in D state (I wish you had a crashdump so that it could be checked more closely). Anyway the first log that should hold most of the data is /tmp/lustre-log.1361661252.1181 I guess? (the biggest one around that timestamp is the one we need) and I guess we need it to at least try to see what was going on at the time. Also I see an umount in the logs, but I assume that's after the system was wedged for some time and your scripts just attempt it?

Kit Westneat (Inactive) added a comment - 01/Mar/13 10:22 AM

Any update?

Thanks,
Kit

Kit Westneat (Inactive) added a comment - 01/Mar/13 10:22 AM Any update? Thanks, Kit

Kit Westneat (Inactive) added a comment - 25/Feb/13 10:04 AM

Hi we were finally able to get some good debugging data from the system during a failover. Here is the sysrq-t from before it went down. I also managed to save off all the lustre-logs, there are a ton as you can see in the kern.log. Let me know if you want to see them.

I noticed that there was a slab reclaim going on.. Wasn't there another issue with slab reclaims taking a long time?

Kit Westneat (Inactive) added a comment - 25/Feb/13 10:04 AM Hi we were finally able to get some good debugging data from the system during a failover. Here is the sysrq-t from before it went down. I also managed to save off all the lustre-logs, there are a ton as you can see in the kern.log. Let me know if you want to see them. I noticed that there was a slab reclaim going on.. Wasn't there another issue with slab reclaims taking a long time?

People

Assignee:: Oleg Drokin

Reporter:: Kit Westneat (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 23/Jul/12 8:31 PM

Updated:: 04/Oct/13 3:33 PM

Resolved:: 04/Oct/13 3:33 PM