Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.12.0
Affects Version/s: Lustre 2.10.4
Labels:
None
Environment:
x86_64, zfs, 3 MDTs, all on 1 MDS, or across 2 MDS's, 2.10.4 + many patches.

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

Hi,

we've had 2 more MDT hangs that have similar symptoms to ~~LU-11082~~. file operations on clients hang, usually gradually from more and more clients. in every case MDS failover restores functionality, one of the 3 MDTs won't umount, and the MDS has to be stonith'd. only our large DNE filesystem is affected (dagg) and not home,images,apps, so likely DNE issue.

the first hang was yesterday. the 3 MDTs were distributed across the 2 MDS's. dagg-mdt2 would not umount in the MDS failovers that were needed to restore operation. warble1,2 logs attached.
in the hang today all MDTs were on warble1 and dagg-mdt0 wouldn't umount. logs attached.

previous to these 2 hangs, we haven't had similar MDS issues for 3 weeks. we can't see an obvious change in user behaviour that triggers this.

the hangs yesterday started with

Aug 29 14:19:37 warble2 kernel: LNet: Service thread pid 51461 was inactive for 200.46s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Aug 29 14:19:37 warble2 kernel: Pid: 51461, comm: mdt_rdpg01_013 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
Aug 29 14:19:37 warble2 kernel: Call Trace:
Aug 29 14:19:37 warble2 kernel: [<ffffffffc0fb7037>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
Aug 29 14:19:37 warble2 kernel: [<ffffffffc0f9890b>] top_trans_stop+0x42b/0x930 [ptlrpc]
Aug 29 14:19:37 warble2 kernel: [<ffffffffc18655f9>] lod_trans_stop+0x259/0x340 [lod]
Aug 29 14:19:37 warble2 kernel: [<ffffffffc1ac623a>] mdd_trans_stop+0x2a/0x46 [mdd]
Aug 29 14:19:37 warble2 kernel: [<ffffffffc1abbbcb>] mdd_attr_set+0x5eb/0xce0 [mdd]
Aug 29 14:19:37 warble2 kernel: [<ffffffffc1a1b206>] mdt_mfd_close+0x1a6/0x610 [mdt]
Aug 29 14:19:37 warble2 kernel: [<ffffffffc1a20981>] mdt_close_internal+0x121/0x220 [mdt]
Aug 29 14:19:37 warble2 kernel: [<ffffffffc1a20ca0>] mdt_close+0x220/0x780 [mdt]
Aug 29 14:19:37 warble2 kernel: [<ffffffffc0f8538a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Aug 29 14:19:37 warble2 kernel: [<ffffffffc0f2de4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
Aug 29 14:19:37 warble2 kernel: [<ffffffffc0f31592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Aug 29 14:19:37 warble2 kernel: [<ffffffff9f8bb621>] kthread+0xd1/0xe0
Aug 29 14:19:37 warble2 kernel: [<ffffffff9ff205dd>] ret_from_fork_nospec_begin+0x7/0x21
Aug 29 14:19:37 warble2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Aug 29 14:19:37 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1535516377.51461

and the hangs today started with

Aug 30 12:25:32 warble1 kernel: Lustre: dagg-MDT0000-osp-MDT0002: Connection to dagg-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Aug 30 12:25:32 warble1 kernel: LustreError: 212573:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1535595632, 300s ago), entering recovery for dagg-MDT0000_UUID@192.168.44.21@o2ib44 ns: dagg-MDT0000-osp-MDT0002 lock: ffff93869b37ea00/0xddb792bb3962e7e2 lrc: 4/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 13 type: IBT flags: 0x1000001000000 nid: local remote: 0xddb792bb3962e7e9 expref: -99 pid: 212573 timeout: 0 lvb_type: 0
Aug 30 12:25:32 warble1 kernel: Lustre: dagg-MDT0000: Received new LWP connection from 0@lo, removing former export from same NID
Aug 30 12:25:32 warble1 kernel: Lustre: Skipped 1 previous similar message
Aug 30 12:25:32 warble1 kernel: Lustre: dagg-MDT0000: Connection restored to 192.168.44.21@o2ib44 (at 0@lo)
Aug 30 12:25:32 warble1 kernel: Lustre: Skipped 5 previous similar messages
Aug 30 12:25:32 warble1 kernel: LustreError: 212573:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 4 previous similar messages
Aug 30 12:25:34 warble1 kernel: LustreError: 132075:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1535595633, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dagg-MDT0000_UUID lock: ffff931f1afa8a00/0xddb792bb39655687 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 20 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 132075 timeout: 0 lvb_type: 0
Aug 30 12:25:34 warble1 kernel: LustreError: dumping log to /tmp/lustre-log.1535595934.132075

I've attached the first lustre-logs and syslog for both days.
there's nothing extra in console logs that I can see.
MDS's are warble, OSS's are arkle/umlaut, clients are john,farnarkle,bryan.

on Lustre servers we have a bunch of patches over standard 2.10.4 applied. it's pretty much 2.10.5. to be precise, it's Lustre 2.10.4 plus these
lu10683-lu11093-checksum-overquota-gerrit32788-1fb85e7e.patch
lu10988-lfsck2-gerrit32522-21d33c11.patch
lu11074-mdc-xattr-gerrit32739-dea1cde9.patch
lu11107-xattr-gerrit32753-c96a8f08.patch
lu11111-lfsck-gerrit32796-693fe452.patch
lu11082-lu11103-stuckMdtThreads-gerrit32853-3dc08caa.diff
lu11062-stacktrace-gerrit32972-7232c445.patch

on clients we have 2.10.4 plus
lu11074-gerrit32739_a1ae6014.diff

cheers,
robin

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

messages-20180829-grep-warble.txt.gz
62 kB
30/Aug/18 6:35 AM
messages-20180830-grep-v-slurm.etc.txt.gz
108 kB
30/Aug/18 6:35 AM
messages-20180905.grep-v_slurm.etc.txt.gz
12 kB
05/Sep/18 6:14 AM
messages-grep-v-slurm.etc.20180903.txt.gz
441 kB
03/Sep/18 2:57 PM
warble1-20180829-lustre-log.1535516379.149622.gz
6.19 MB
30/Aug/18 4:56 AM
warble1-20180830-lustre-log.1535595934.132075.gz
6.05 MB
30/Aug/18 4:58 AM
warble1-lustre-log.1535964912.429651.gz
5.90 MB
03/Sep/18 2:56 PM
warble2-20180829-lustre-log.1535516377.51461.gz
6.12 MB
30/Aug/18 4:58 AM
warble2-lustre-log.1535968315.274573.gz
4.96 MB
03/Sep/18 2:57 PM
warble2-lustre-log.1535978027.36891.gz
4.25 MB
03/Sep/18 2:58 PM

hung threads on MDT and MDT won't umount

Details

Description

Attachments

Attachments

Activity

People

Dates