Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12144

Lustre Stuck on Recovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.10.5
    • None
    • DDN Exascaler
    • 4
    • 9223372036854775807

    Description

      We been stuck in MDT recovery for our exascaler host for about 2 hours.  We've been wrestling with it back and forth now but it will not finish its recovering.  The following errors are seen:

      Apr 1 16:20:55 scratchlfsmds01 kernel: Lustre: scrlfs-MDT0000: Denying connection for new client f3980ca0-c3a8-4c99-749a-9f58c2850ad2(at 10.242.104.147@tcp), waiting for 4041 known clients (4012 recovered, 29 in progress, and 0 evicted) to recover in 0:36
      Apr 1 16:20:55 scratchlfsmds01 kernel: Lustre: Skipped 1 previous similar message
      Apr 1 16:21:00 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 222
      Apr 1 16:21:00 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 226
      Apr 1 16:21:03 scratchlfsmds01 kernel: LustreError: 19352:0:(tgt_handler.c:509:tgt_filter_recovery_request()) @@@ not permitted during recovery req@ffff9dbb52bcb440 x1629643355468768/t0(0) o601->scrlfs-MDT0000-lwp-OST0004_UUID@10.31.164.250@o2ib:735/0 lens 336/0 e 0 to 0 dl 1554150115 ref 1 fl Interpret:/0/ffffffff rc 0/-1
      Apr 1 16:21:03 scratchlfsmds01 kernel: LustreError: 19352:0:(tgt_handler.c:509:tgt_filter_recovery_request()) Skipped 47 previous similar messages
      Apr 1 16:21:10 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 22b
      Apr 1 16:21:18 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 231
      Apr 1 16:21:18 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 233
      Apr 1 16:21:21 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 235
      Apr 1 16:21:21 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 238
      Apr 1 16:21:31 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 23a
      Apr 1 16:21:31 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 23d
      Apr 1 16:21:36 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 240
      Apr 1 16:21:41 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 244
      Apr 1 16:21:51 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 249
      Apr 1 16:21:55 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 24e
      Apr 1 16:21:55 scratchlfsmds01 corosync[13445]: [TOTEM ] Marking ringid 0 interface 10.31.72.11 FAULTY
      Apr 1 16:21:56 scratchlfsmds01 corosync[13445]: [TOTEM ] Automatically recovered ring 0
      Apr 1 16:22:01 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 253
      Apr 1 16:22:01 scratchlfsmds01 systemd[1]: Starting Cleanup of Temporary Directories...
      Apr 1 16:22:01 scratchlfsmds01 systemd[1]: Started Cleanup of Temporary Directories.
      Apr 1 16:22:11 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 259
      Apr 1 16:22:10 scratchlfsmds01 kernel: Lustre: scrlfs-MDT0000: Denying connection for new client f3980ca0-c3a8-4c99-749a-9f58c2850ad2(at 10.242.104.147@tcp), waiting for 4041 known clients (4012 recovered, 29 in progress, and 0 evicted) to recover in 21188504:41
      Apr 1 16:22:10 scratchlfsmds01 kernel: Lustre: Skipped 2 previous similar messages
      Apr 1 16:22:13 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 25d
      Apr 1 16:22:13 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 25f
      Apr 1 16:22:13 scratchlfsmds01 kernel: LustreError: 19370:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1554149835, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff9dd3daa01000/0x6bc5271fecb20ca6 lrc: 3/0,1 mode: --/EX res: [0x73666c726373:0x2:0x0].0x0 rrc: 2037 type: PLN flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 19370 timeout: 0 lvb_type: 0
      Apr 1 16:22:13 scratchlfsmds01 kernel: Lustre: MGS: Received new LWP connection from 10.31.163.202@o2ib, removing former export from same NID
      Apr 1 16:22:13 scratchlfsmds01 kernel: Lustre: MGS: Connection restored to 988e49b9-7fc3-aa82-6ab5-3ad6ec520922 (at 10.31.163.202@o2ib)
      Apr 1 16:22:13 scratchlfsmds01 kernel: Lustre: Skipped 396 previous similar messages
      Apr 1 16:22:13 scratchlfsmds01 kernel: LustreError: dumping log to /tmp/lustre-log.1554150135.19370
      Apr 1 16:22:14 scratchlfsmds01 kernel: Lustre: MGS: Received new LWP connection from 10.31.161.45@o2ib, removing former export from same NID
      Apr 1 16:22:14 scratchlfsmds01 kernel: Lustre: Skipped 56 previous similar messages
      Apr 1 16:22:15 scratchlfsmds01 kernel: Lustre: MGS: Received new LWP connection from 10.31.160.124@o2ib, removing former export from same NID
      Apr 1 16:22:15 scratchlfsmds01 kernel: Lustre: Skipped 165 previous similar messages
      Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 166-1: MGC10.31.164.248@o2ib: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 19118:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1554149838, 300s ago), entering recovery for MGS@10.31.164.248@o2ib ns: MGC10.31.164.248@o2ib lock: ffff9dd3e6932c00/0x6bc5271fecb2391c lrc: 4/1,0 mode: --/CR res: [0x73666c726373:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x6bc5271fecb23923 expref: -99 pid: 19118 timeout: 0 lvb_type: 0
      Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 23294:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.31.164.248@o2ib: namespace resource [0x73666c726373:0x2:0x0].0x0 (ffff9dd3e6b4e900) refcount nonzero (2) after lock cleanup; forcing cleanup.
      Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 23294:0:(ldlm_resource.c:1682:ldlm_resource_dump()) — Resource: [0x73666c726373:0x2:0x0].0x0 (ffff9dd3e6b4e900) refcount = 3
      Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 23294:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
      Apr 1 16:22:16 scratchlfsmds01 kernel: LustreError: 23294:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.31.164.248@o2ib lock: ffff9dd3e6932c00/0x6bc5271fecb2391c lrc: 4/1,0 mode: --/CR res: [0x73666c726373:0x2:0x0].0x0 rrc: 4 type: PLN flags: 0x1106400000000 nid: local remote: 0x6bc5271fecb23923 expref: -99 pid: 19118 timeout: 0 lvb_type: 0
      Apr 1 16:22:17 scratchlfsmds01 kernel: Lustre: MGS: Received new LWP connection from 10.242.128.221@tcp, removing former export from same NID
      Apr 1 16:22:17 scratchlfsmds01 kernel: Lustre: Skipped 327 previous similar messages
      Apr 1 16:22:21 scratchlfsmds01 corosync[13445]: [TOTEM ] Retransmit List: 262

      It then gives a bogus time for recovery which equal 40 years.

      Apr 1 16:24:23 scratchlfsmds01 kernel: Lustre: scrlfs-MDT0000: Denying connection for new client f3980ca0-c3a8-4c99-749a-9f58c2850ad2(at 10.242.104.147@tcp), waiting for 4041 known clients (4012 recovered, 29 in progress, and 0 evicted) to recover in 21188502:27

      Which is clearly impossible.  We eventually see this:

      Apr 1 16:35:09 scratchlfsmds01 kernel: Lustre: scrlfs-MDT0000: Recovery already passed deadline 13:38, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.

      Which as you can imagine is not helpful.  It seems that we have 29 clients that will not recover and Lustre will not evict.  We tried forcing an abort via lctl on the recovery but that did not work.  Right now it is just sitting and waiting.

      This filesystem is our main scratch filesystem, with it down our cluster is out of the water.  So we need help ASAP.  Thanks in advance.

      Paul Edmon

      Attachments

        Activity

          People

            wc-triage WC Triage
            pedmon Paul Edmon (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: