Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.10.5
-
None
-
DDN Exascaler
-
4
-
9223372036854775807
Description
We been stuck in MDT recovery for our exascaler host for about 2 hours. We've been wrestling with it back and forth now but it will not finish its recovering. The following errors are seen:
Apr 1 16:20:55 scratchlfsmds01 kernel: Lustre: scrlfs-MDT0000: Denying connection for new client f3980ca0-c3a8-4c99-749a-9f58c2850ad2(at 10.242.104.147@tcp), waiting for 4041 known clients (4012 recovered, 29 in progress, and 0 evicted) to recover in 0:36
Apr 1 16:20:55 scratchlfsmds01 kernel: Lustre: Skipped 1 previous similar message
Apr 1 16:21:00 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 222
Apr 1 16:21:00 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 226
Apr 1 16:21:03 scratchlfsmds01 kernel: LustreError: 19352:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff9dbb52bcb440 x1629643355468768/t0(0) o601->scrlfs-MDT0000-lwp-OST0004_UUID@10.31.164.250@o2ib:735/0 lens 336/0 e 0 to 0 dl 1554150115 ref 1 fl Interpret:/0/ffffffff rc 0/-1
Apr 1 16:21:03 scratchlfsmds01 kernel: LustreError: 19352:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 47 previous similar messages
Apr 1 16:21:10 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 22b
Apr 1 16:21:18 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 231
Apr 1 16:21:18 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 233
Apr 1 16:21:21 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 235
Apr 1 16:21:21 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 238
Apr 1 16:21:31 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 23a
Apr 1 16:21:31 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 23d
Apr 1 16:21:36 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 240
Apr 1 16:21:41 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 244
Apr 1 16:21:51 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 249
Apr 1 16:21:55 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 24e
Apr 1 16:21:55 scratchlfsmds01 corosync[13445]: [TOTEM ] Marking ringid 0 interface 10.31.72.11 FAULTY
Apr 1 16:21:56 scratchlfsmds01 corosync[13445]: [TOTEM ] Automatically recovered ring 0
Apr 1 16:22:01 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 253
Apr 1 16:22:01 scratchlfsmds01 systemd[1]: Starting Cleanup of Temporary Directories...
Apr 1 16:22:01 scratchlfsmds01 systemd[1]: Started Cleanup of Temporary Directories.
Apr 1 16:22:11 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 259
Apr 1 16:22:10 scratchlfsmds01 kernel: Lustre: scrlfs-MDT0000: Denying connection for new client f3980ca0-c3a8-4c99-749a-9f58c2850ad2(at 10.242.104.147@tcp), waiting for 4041 known clients (4012 recovered, 29 in progress, and 0 evicted) to recover in 21188504:41
Apr 1 16:22:10 scratchlfsmds01 kernel: Lustre: Skipped 2 previous similar messages
Apr 1 16:22:13 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 25d
Apr 1 16:22:13 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 25f
Apr 1 16:22:13 scratchlfsmds01 kernel: LustreError: 19370:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1554149835, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff9dd3daa01000/0x6bc5271fecb20ca6 lrc: 3/0,1 mode: --/EX res: [0x73666c726373:0x2:0x0].0x0 rrc: 2037 type: PLN flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 19370 timeout: 0 lvb_type: 0
Apr 1 16:22:13 scratchlfsmds01 kernel: Lustre: MGS: Received new LWP connection from 10.31.163.202@o2ib, removing former export from same NID
Apr 1 16:22:13 scratchlfsmds01 kernel: Lustre: MGS: Connection restored to 988e49b9-7fc3-aa82-6ab5-3ad6ec520922 (at 10.31.163.202@o2ib)
Apr 1 16:22:13 scratchlfsmds01 kernel: Lustre: Skipped 396 previous similar messages
Apr 1 16:22:13 scratchlfsmds01 kernel: LustreError: dumping log to /tmp/lustre-log.1554150135.19370
Apr 1 16:22:14 scratchlfsmds01 kernel: Lustre: MGS: Received new LWP connection from 10.31.161.45@o2ib, removing former export from same NID
Apr 1 16:22:14 scratchlfsmds01 kernel: Lustre: Skipped 56 previous similar messages
Apr 1 16:22:15 scratchlfsmds01 kernel: Lustre: MGS: Received new LWP connection from 10.31.160.124@o2ib, removing former export from same NID
Apr 1 16:22:15 scratchlfsmds01 kernel: Lustre: Skipped 165 previous similar messages
Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 166-1: MGC10.31.164.248@o2ib: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 19118:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1554149838, 300s ago), entering recovery for MGS@10.31.164.248@o2ib ns: MGC10.31.164.248@o2ib lock: ffff9dd3e6932c00/0x6bc5271fecb2391c lrc: 4/1,0 mode: --/CR res: [0x73666c726373:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x6bc5271fecb23923 expref: -99 pid: 19118 timeout: 0 lvb_type: 0
Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 23294:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.31.164.248@o2ib: namespace resource [0x73666c726373:0x2:0x0].0x0 (ffff9dd3e6b4e900) refcount nonzero (2) after lock cleanup; forcing cleanup.
Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 23294:0:(ldlm_resource.c:1682:ldlm_resource_dump()) — Resource: [0x73666c726373:0x2:0x0].0x0 (ffff9dd3e6b4e900) refcount = 3
Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 23294:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 23294:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.31.164.248@o2ib lock: ffff9dd3e6932c00/0x6bc5271fecb2391c lrc: 4/1,0 mode: --/CR res: [0x73666c726373:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0x6bc5271fecb23923 expref: -99 pid: 19118 timeout: 0 lvb_type: 0
Apr 1 16:22:17 scratchlfsmds01 kernel: Lustre: MGS: Received new LWP connection from 10.242.128.221@tcp, removing former export from same NID
Apr 1 16:22:17 scratchlfsmds01 kernel: Lustre: Skipped 327 previous similar messages
Apr 1 16:22:21 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 262
It then gives a bogus time for recovery which equal 40 years.
Apr 1 16:24:23 scratchlfsmds01 kernel: Lustre: scrlfs-MDT0000: Denying connection for new client f3980ca0-c3a8-4c99-749a-9f58c2850ad2(at 10.242.104.147@tcp), waiting for 4041 known clients (4012 recovered, 29 in progress, and 0 evicted) to recover in 21188502:27
Which is clearly impossible. We eventually see this:
Apr 1 16:35:09 scratchlfsmds01 kernel: Lustre: scrlfs-MDT0000: Recovery already passed deadline 13:38, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
Which as you can imagine is not helpful. It seems that we have 29 clients that will not recover and Lustre will not evict. We tried forcing an abort via lctl on the recovery but that did not work. Right now it is just sitting and waiting.
This filesystem is our main scratch filesystem, with it down our cluster is out of the water. So we need help ASAP. Thanks in advance.
Paul Edmon