[LU-1663] MDS threads hang for over 725s, causing fail over - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 1.8.7
Labels:
- mn8
- patch
Environment:
Lustre 1.8.6.80, jenkins-g9d9d86f-PRISTINE-2.6.18-238.12.1.el5_lustre.gd70e443
Centos 5.5

Severity:
2
Rank (Obsolete):
4055

Description

At NOAA, there are two filesystems that were installed at the same time, lfs1 and lfs2. Recently lfs2 has been having MDS lockups, which cause a failover to the second MDS. It seems to run ok for a couple days and then whichever MDS is currently running will lockup and failover to the other one. lfs1, however, is not affected, though it runs an identical set up as far as hardware goes.

We have the stack traces that get logged, but not the lustre-logs, as they have been on tmpfs. We've changed the debug_file location, so hopefully we'll get the next batch. I'll put a sampling of the interesting call traces, and attach the rest.

Here is the root cause of the failover. The health_check times out and prints NOT HEALTHY, which causes ha to failover:
Jul 17 17:23:30 lfs-mds-2-2 kernel: LustreError: 16021:0:(service.c:2124:ptlrpc_service_health_check()) mds: unhealthy - request has been waiting 725s

This one makes it look like it might be quota related:
Jul 17 17:14:04 lfs-mds-2-2 kernel: Call Trace:
Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff887f9220>] :lnet:LNetPut+0x730/0x840
Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff800649fb>] __down+0xc3/0xd8
Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
Jul 17 17:14:04 lfs-mds-2-2 kernel: [<ffffffff88a29490>] :lquota:dqacq_handler+0x0/0xc20
...

This one looks a little like ~~LU-1395~~ or ~~LU-1269~~:
Jul 4 17:58:29 lfs-mds-2-2 kernel: Call Trace:
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888ceb51>] ldlm_resource_add_lock+0xb1/0x180 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e2a00>] ldlm_expired_completion_wait+0x0/0x250 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8006388b>] schedule_timeout+0x8a/0xad
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8009987d>] process_timeout+0x0/0x5
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e4555>] ldlm_completion_ast+0x4d5/0x880 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c9709>] ldlm_lock_enqueue+0x9d9/0xb20 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff8008e421>] default_wake_function+0x0/0xe
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888c4b6a>] ldlm_lock_addref_internal_nolock+0x3a/0x90 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e30bb>] ldlm_cli_enqueue_local+0x46b/0x520 [ptlrpc]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff88c611a7>] enqueue_ordered_locks+0x387/0x4d0 [mds]
Jul 4 17:58:29 lfs-mds-2-2 kernel: [<ffffffff888e09a0>] ldlm_blocking_ast+0x0/0x2a0 [ptlrpc]
...

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

09sep.tar.bz2
12/Sep/12 11:19 AM
8.40 MB
Kit Westneat
call_traces
23/Jul/12 8:31 PM
392 kB
Kit Westneat
kern.log.2013-02-23.gz
25/Feb/13 10:03 AM
104 kB
Kit Westneat
kern.log-20120721
30/Jul/12 10:27 AM
191 kB
Kit Westneat
ll-1181-decoded.txt.gz
04/Mar/13 1:41 PM
0.2 kB
Kit Westneat
log1.bz2
30/Jul/12 10:20 AM
438 kB
Kit Westneat
lustre-log.txt.bz2
12/Sep/12 11:22 AM
4.55 MB
Kit Westneat
mds1.log
23/Jul/12 8:38 PM
8.37 MB
Kit Westneat
mds2.log
23/Jul/12 8:39 PM
3.45 MB
Kit Westneat

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

People

Assignee:: Oleg Drokin

Reporter:: Kit Westneat (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 23/Jul/12 8:31 PM

Updated:: 04/Oct/13 3:33 PM

Resolved:: 04/Oct/13 3:33 PM