Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.10.3, Lustre 2.10.4
Labels:
None
Environment:
x86_64 centos7.4 OPA ZFS 0.7.9 lustre 2.10.4

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Hi,

our main filesystem has hung approximately 10 times since April.

the symptoms are stuck thread messages on the MDS (it has happened on both our MDS's), and then soon afterwards clients can't stat any files in the filesystem.

we have 4 filesystems that share the same MDS hardware and only the large DNE filesystem with 3 MDTs has hung, so we think this is probably a DNE issue.

there are no logs that indicate this is a MDS hardware problem. despite that we've swapped out a lot of hardware and tested the MDS components as well as we can and it hasn't helped at all. 'zpool iostat 2' shows some tiny amounts of i/o are still going to the MDTs when the filesystems are hung, so eg. hardware raid isn't just silently stopping with no logs.

MDT's are zmirror on top of hardware raid1 (so 4-way replicated).

to un-hang the filesystem, we failover or reboot the MDS. every time we can't umount one of the 3 MDTs in the /dagg(fred) filesystem, and we have to forcibly power off the MDS (which led to LU10877 etc.). recovery these days goes ok, and then the filesystem works again. it seems random which of the three MDTs hangs on umount. we've lfsck'd the namespace about 5 times.

here are first stack traces from each of the 2 separate hangs today, which are pretty typical. usually it's a mdt_rdpg* thread that hangs and those stack traces all look very similar, but occasionally it's a mdt* thread and there's some variations on a theme in those.

/var/log/messages:Jun 11 11:06:59 warble2 kernel: LNet: Service thread pid 116276 was inactive for 200.63s. The thread might be hung, or it might only be slow and will res
ume later. Dumping the stack trace for debugging purposes:
/var/log/messages:Jun 11 11:06:59 warble2 kernel: Pid: 116276, comm: mdt_rdpg00_024
/var/log/messages:Jun 11 11:06:59 warble2 kernel: #012Call Trace:
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff8109eb8b>] ? recalc_sigpending+0x1b/0x50
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff816b40e9>] schedule+0x29/0x70
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0e16f10>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0df882b>] top_trans_stop+0x42b/0x930 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc164f5e9>] lod_trans_stop+0x259/0x340 [lod]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0df8fae>] ? top_trans_start+0x27e/0x940 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc16ed1ea>] mdd_trans_stop+0x2a/0x46 [mdd]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc16e2bcb>] mdd_attr_set+0x5eb/0xce0 [mdd]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0d81d47>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc15ce186>] mdt_mfd_close+0x1a6/0x610 [mdt]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc15d3901>] mdt_close_internal+0x121/0x220 [mdt]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc15d3c20>] mdt_close+0x220/0x780 [mdt]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0de52ba>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0d8de2b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0d8a458>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0d91572>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffffc0d90ae0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810b4031>] kthread+0xd1/0xe0
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
/var/log/messages:Jun 11 11:06:59 warble2 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0
/var/log/messages:Jun 11 11:06:59 warble2 kernel: 
/var/log/messages:Jun 11 11:06:59 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1528679219.116276

Jun 11 14:17:28 warble2 kernel: LNet: Service thread pid 22072 was inactive for 200.35s. The thread might be hung, or it might only be slow and will resume later. Dumping 
the stack trace for debugging purposes:
Jun 11 14:17:28 warble2 kernel: Pid: 22072, comm: mdt00_001
Jun 11 14:17:28 warble2 kernel: #012Call Trace:
Jun 11 14:17:28 warble2 kernel: [<ffffffff8109eb8b>] ? recalc_sigpending+0x1b/0x50
Jun 11 14:17:28 warble2 kernel: [<ffffffff816b40e9>] schedule+0x29/0x70
Jun 11 14:17:28 warble2 kernel: [<ffffffffc0dd3f10>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [<ffffffff810c7c70>] ? default_wake_function+0x0/0x20
Jun 11 14:17:28 warble2 kernel: [<ffffffffc0db582b>] top_trans_stop+0x42b/0x930 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc10f35e9>] lod_trans_stop+0x259/0x340 [lod]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc0db5fae>] ? top_trans_start+0x27e/0x940 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc11911ea>] mdd_trans_stop+0x2a/0x46 [mdd]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc117eb31>] mdd_rename+0x4d1/0x14a0 [mdd]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc103b07a>] mdt_reint_rename_internal.isra.36+0x166a/0x20c0 [mdt]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc103d3fb>] mdt_reint_rename_or_migrate.isra.39+0x19b/0x860 [mdt]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d0eff0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d09440>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc103daf3>] mdt_reint_rename+0x13/0x20 [mdt]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc1041b33>] mdt_reint_rec+0x83/0x210 [mdt]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc102337b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc102ef07>] mdt_reint+0x67/0x140 [mdt]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc0da22ba>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d4ae2b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d47458>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [<ffffffff810c7c82>] ? default_wake_function+0x12/0x20
Jun 11 14:17:28 warble2 kernel: [<ffffffff810bdc4b>] ? __wake_up_common+0x5b/0x90
Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d4e572>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [<ffffffffc0d4dae0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
Jun 11 14:17:28 warble2 kernel: [<ffffffff810b4031>] kthread+0xd1/0xe0
Jun 11 14:17:28 warble2 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0
Jun 11 14:17:28 warble2 kernel: [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
Jun 11 14:17:28 warble2 kernel: [<ffffffff810b3f60>] ? kthread+0x0/0xe0
Jun 11 14:17:28 warble2 kernel: 
Jun 11 14:17:28 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1528690648.22072

I've attach logs of all the stack traces

zgrep warble /var/log/messages-20180[3456]* /var/log/messages | egrep -v 'crmd|pengine' | grep -i 'stack trace' -B20 -A40 > ~/lustre-bugs/warble-stack-traces.txt

clients at the moment are a mix of centos7.4 and 2.10.3/2.10.4, and centos7.5 2.10.4.
servers are centos7.4 and 2.10.4 with zfs 0.7.9 and boot with nopti.

we have inherited directory striping on the main filesystem, so each new dir is on a random one of the three MDTs.

we have no idea what i/o pattern triggers the lockups, sorry.

cheers,
robin

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1529816295.52919
49.77 MB
24/Jun/18 7:04 AM
lustre-log.1533268164.11346.gz
6.07 MB
03/Aug/18 6:23 AM
messages.warble
83 kB
03/Aug/18 6:23 AM
warble-stack-traces.txt
958 kB
11/Jun/18 3:54 PM

Issue Links

is related to

LU-11102 lock revoke may not take effect

Resolved

is related to

LU-11103 add lock for lod_object layout

Resolved

LU-11104 rename under striped directory may cause deadlock

Resolved

stuck threads on MDS

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates