Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9049

DNE MDT Never completes recovery

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.10.0
    • Fix Version/s: Lustre 2.10.0
    • Labels:
    • Environment:
      Soak test cluster, version=lustre: 2.9.51_45_g3b3eeeb - tip of master
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      Failover of lola-11 to lola-10

      2017-01-24 16:44:56,220:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0003 on lola-10 ...
      

      Immediately prior to the failover mount, lola-10 reports hung processes.

      Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1485304863/real 1485304863]  req@ffff8804131dd680 x1557446666855008/t0(0) o38->soaked-MDT0003-osp-MDT0002@192.168.1.111@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1485304880 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      Jan 24 16:41:03 lola-10 kernel: Lustre: 4554:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      Jan 24 16:41:35 lola-10 kernel: LNet: Service thread pid 6159 was inactive for 226.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Jan 24 16:41:35 lola-10 kernel: Pid: 6159, comm: mdt00_005
      Jan 24 16:41:35 lola-10 kernel:
      Jan 24 16:41:35 lola-10 kernel: Call Trace:
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff8105e4e3>] ? __wake_up+0x53/0x70
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa0a33bb2>] top_trans_stop+0x782/0xbd0 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa11b4aac>] lod_trans_stop+0x2bc/0x330 [lod]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1259f31>] mdd_trans_stop+0x21/0xc6 [mdd]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1246e63>] mdd_create+0x1373/0x17e0 [mdd]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1105c44>] ? mdt_version_save+0x84/0x1a0 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa110fcfc>] mdt_reint_create+0xbdc/0xfe0 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa11013bc>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09bb2bb>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff81299b7a>] ? strlcpy+0x4a/0x60
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1102a2a>] ? old_init_ucred_common+0xda/0x2b0 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa1104f5d>] mdt_reint_rec+0x5d/0x200 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa10efd9b>] mdt_reint_internal+0x62b/0xa50 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa10f066b>] mdt_reint+0x6b/0x120 [mdt]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa0a1f17c>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09ca66b>] ptlrpc_server_handle_request+0x2eb/0xbd0 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa067c84a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09c56e9>] ? ptlrpc_wait_event+0xa9/0x2e0 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff81059ca9>] ? __wake_up_common+0x59/0x90
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09cba11>] ptlrpc_main+0xac1/0x18d0 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffffa09caf50>] ? ptlrpc_main+0x0/0x18d0 [ptlrpc]
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
      Jan 24 16:41:35 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      Jan 24 16:41:35 lola-10 kernel:
      Jan 24 16:41:35 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1485304895.6159
      

      Mount occurs

      Jan 24 16:46:22 lola-10 kernel: Lustre: soaked-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      ...
      Jan 24 16:46:27 lola-10 kernel: Lustre: soaked-MDT0003: Will be in recovery for at least 2:30, or until 22 clients reconnect
      ...
      Jan 24 16:47:44 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      Jan 24 16:47:44 lola-10 kernel:  req@ffff88041171b980 x1557448608035648/t21476913294(0) o36->57c15fb3-4c15-952f-3686-35f4c9caa941@192.168.1.126@o2ib100:-1/-1 lens 632/568 e 19 to 0 dl 1485305269 ref 2 fl Interpret:/0/0 rc 0/0
      Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      Jan 24 16:47:47 lola-10 kernel:  req@ffff880737e13cc0 x1557448490734944/t21476914413(0) o36->1e7aaa7d-f417-ebd3-eff8-7787d5dac449@192.168.1.123@o2ib100:-1/-1 lens 616/3128 e 19 to 0 dl 1485305272 ref 2 fl Interpret:/0/0 rc 0/0
      Jan 24 16:47:47 lola-10 kernel: Lustre: 6025:0:(service.c:1331:ptlrpc_at_send_early_reply()) Skipped 22 previous similar messages
      Jan 24 16:47:50 lola-10 kernel: Lustre: soaked-MDT0002: Client ffcd9b78-a534-1d1b-1494-f25c5b00edf1 (at 192.168.1.131@o2ib100) reconnecting
      

      Recovery appears to stall

      Jan 24 16:50:18 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
      Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
      Jan 24 16:50:19 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) Skipped 2 previous similar messages
      Jan 24 16:50:21 lola-10 kernel: Lustre: 7291:0:(ldlm_lib.c:1787:extend_recovery_timer()) soaked-MDT0003: extended recovery timer reaching hard limit: 900, extend: 1
      

      System is still spinning at the present time.

        Attachments

        1. lu-9049.tar.gz
          4.47 MB
        2. lustre-log.1485304895.6159.txt.gz
          9.70 MB
        3. stack-dump-lola-10.txt
          1.22 MB
        4. trace
          1.03 MB

          Issue Links

            Activity

              People

              • Assignee:
                laisiyao Lai Siyao
                Reporter:
                cliffw Cliff White (Inactive)
              • Votes:
                1 Vote for this issue
                Watchers:
                13 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: