Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7531

MDT recovery stalled if rescources are failed back immediatelly

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • lola
      build: 2.7.63-28-g5fda01f, 5fda01f3002e7e742a206ce149652c6b78356828 + patches
    • 3
    • 9223372036854775807

    Description

      The error occurred during soak testing of build '20151201.1' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151201.1). DNE is enabled. MDSes are set-up in active-active HA failover configuration.

      The MDT recovery process stalls on the primary node in case the recovery process is interrupted on the secondary node by failing back the
      resources immediately. This effects all running and new jobs using the remote MDTs.

      Sequence of events:

      • 2015-12-09 04:35:10 - Failover MDTs owned by lola-9 --> lola-8
      • 2015-12-09 04:43:52 - MDTs mounted successful on secondary (lola-8
      • 2015-12-09 04:44:13 - Stop recovery process (incomplete at that time, see soak.log), and initiated failback
      • 2015-12-09 04:44:25 - mds_failover (failback) completed

      Double checked MDTs are active and mounted:

      [root@lola-16 lola]# ssh lola-9 'lctl dl | grep " mdt "' | less -i
        4 UP mdt soaked-MDT0003 soaked-MDT0003_UUID 67
       32 UP mdt soaked-MDT0002 soaked-MDT0002_UUID 63
      
      [root@lola-16 lola]# ssh lola-9 'mount | grep lustre' | less -i
      /dev/mapper/360080e50002ffd8200000251520130a4p1 on /mnt/soaked-mdt3 type lustre (rw,user_xattr)
      /dev/mapper/360080e50002ff4f00000026d52013098p1 on /mnt/soaked-mdt2 type lustre (rw,user_xattr)
      

      Recovery still ongoing after ~ 50 mins:

      [root@lola-9 ~]# date
      Wed Dec  9 05:30:04 PST 2015
      [root@lola-9 ~]# lctl get_param mdt.*.recovery_status
      mdt.soaked-MDT0002.recovery_status=
      status: RECOVERING
      recovery_start: 1449667442
      time_remaining: 0
      connected_clients: 16/16
      req_replay_clients: 5
      lock_repay_clients: 5
      completed_clients: 11
      evicted_clients: 0
      replayed_requests: 0
      queued_requests: 4
      next_transno: 1090929750241
      mdt.soaked-MDT0003.recovery_status=
      status: RECOVERING
      recovery_start: 1449667442
      time_remaining: 0
      connected_clients: 16/16
      req_replay_clients: 5
      lock_repay_clients: 5
      completed_clients: 11
      evicted_clients: 0
      replayed_requests: 0
      queued_requests: 4
      next_transno: 1047980457114
      

      Attached messages, console log file of MDT (lola-8), debug log file created manually at 04:55 and soak.log file.

      Attachments

        Activity

          [LU-7531] MDT recovery stalled if rescources are failed back immediatelly

          Landed for 2.8

          jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17539/
          Subject: LU-7531 osp: allow few requests during recovery
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 2fe2d1e82005746180309d9b79057a418a729e54

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17539/ Subject: LU-7531 osp: allow few requests during recovery Project: fs/lustre-release Branch: master Current Patch Set: Commit: 2fe2d1e82005746180309d9b79057a418a729e54

          Error happens again, even if recovery time is 1h. 6 MDTs are stalled for over 1 h now:

          ----------------
          lola-9
          ----------------
          mdt.soaked-MDT0002.recovery_status=
          status: RECOVERING
          recovery_start: 1449724636
          time_remaining: 0
          connected_clients: 9/16
          req_replay_clients: 1
          lock_repay_clients: 1
          completed_clients: 8
          evicted_clients: 7
          replayed_requests: 10
          queued_requests: 0
          next_transno: 1090929751042
          mdt.soaked-MDT0003.recovery_status=
          status: RECOVERING
          recovery_start: 1449724648
          time_remaining: 0
          connected_clients: 9/16
          req_replay_clients: 1
          lock_repay_clients: 1
          completed_clients: 8
          evicted_clients: 7
          replayed_requests: 0
          queued_requests: 0
          next_transno: 1047980457124
          ----------------
          lola-10
          ----------------
          mdt.soaked-MDT0004.recovery_status=
          status: RECOVERING
          recovery_start: 1449735425
          time_remaining: 0
          connected_clients: 14/14
          req_replay_clients: 0
          lock_repay_clients: 0
          completed_clients: 14
          evicted_clients: 0
          replayed_requests: 0
          queued_requests: 0
          next_transno: 1129576470250
          mdt.soaked-MDT0005.recovery_status=
          status: RECOVERING
          recovery_start: 1449735422
          time_remaining: 0
          connected_clients: 14/14
          req_replay_clients: 0
          lock_repay_clients: 0
          completed_clients: 14
          evicted_clients: 0
          replayed_requests: 0
          queued_requests: 0
          next_transno: 1138166481892
          ----------------
          lola-11
          ----------------
          mdt.soaked-MDT0006.recovery_status=
          status: RECOVERING
          recovery_start: 1449729112
          time_remaining: 0
          connected_clients: 14/14
          req_replay_clients: 0
          lock_repay_clients: 0
          completed_clients: 14
          evicted_clients: 0
          replayed_requests: 0
          queued_requests: 0
          next_transno: 1116691658301
          mdt.soaked-MDT0007.recovery_status=
          status: RECOVERING
          recovery_start: 1449729069
          time_remaining: 0
          connected_clients: 14/14
          req_replay_clients: 0
          lock_repay_clients: 0
          completed_clients: 14
          evicted_clients: 0
          replayed_requests: 0
          queued_requests: 0
          next_transno: 1125281595927
          

          I attached the messages file containing the stack traces of the nodes affected as requested above.

          heckes Frank Heckes (Inactive) added a comment - Error happens again, even if recovery time is 1h. 6 MDTs are stalled for over 1 h now: ---------------- lola-9 ---------------- mdt.soaked-MDT0002.recovery_status= status: RECOVERING recovery_start: 1449724636 time_remaining: 0 connected_clients: 9/16 req_replay_clients: 1 lock_repay_clients: 1 completed_clients: 8 evicted_clients: 7 replayed_requests: 10 queued_requests: 0 next_transno: 1090929751042 mdt.soaked-MDT0003.recovery_status= status: RECOVERING recovery_start: 1449724648 time_remaining: 0 connected_clients: 9/16 req_replay_clients: 1 lock_repay_clients: 1 completed_clients: 8 evicted_clients: 7 replayed_requests: 0 queued_requests: 0 next_transno: 1047980457124 ---------------- lola-10 ---------------- mdt.soaked-MDT0004.recovery_status= status: RECOVERING recovery_start: 1449735425 time_remaining: 0 connected_clients: 14/14 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 14 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 1129576470250 mdt.soaked-MDT0005.recovery_status= status: RECOVERING recovery_start: 1449735422 time_remaining: 0 connected_clients: 14/14 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 14 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 1138166481892 ---------------- lola-11 ---------------- mdt.soaked-MDT0006.recovery_status= status: RECOVERING recovery_start: 1449729112 time_remaining: 0 connected_clients: 14/14 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 14 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 1116691658301 mdt.soaked-MDT0007.recovery_status= status: RECOVERING recovery_start: 1449729069 time_remaining: 0 connected_clients: 14/14 req_replay_clients: 0 lock_repay_clients: 0 completed_clients: 14 evicted_clients: 0 replayed_requests: 0 queued_requests: 0 next_transno: 1125281595927 I attached the messages file containing the stack traces of the nodes affected as requested above.

          wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17539
          Subject: LU-7531 osp: allow enqueue during recovery
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 240fc3cf7188c4291085862225cf469ac3d29850

          gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17539 Subject: LU-7531 osp: allow enqueue during recovery Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 240fc3cf7188c4291085862225cf469ac3d29850
          di.wang Di Wang added a comment -

          I saw a similar issue in soak-test, and got the trace

          tgt_recov     S 0000000000000013     0 13622      2 0x00000080
           ffff8803f5fb57e0 0000000000000046 0000000000000000 ffff8803f5fb57a4
           ffffffff00000000 ffff88043fe84400 000003162493ac87 0000000000000286
           ffff8803f5fb5780 ffffffff8108742c ffff8803f5fad068 ffff8803f5fb5fd8
          Call Trace:
           [<ffffffff8108742c>] ? lock_timer_base+0x3c/0x70
           [<ffffffff8152b222>] schedule_timeout+0x192/0x2e0
           [<ffffffff81087540>] ? process_timeout+0x0/0x10
           [<ffffffffa0bad779>] ptlrpc_set_wait+0x319/0xa20 [ptlrpc]
           [<ffffffffa0967344>] ? lprocfs_stats_alloc_one+0x344/0x360 [obdclass]
           [<ffffffffa0ba2c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc]
           [<ffffffff81064c00>] ? default_wake_function+0x0/0x20
           [<ffffffffa0bb9b15>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc]
           [<ffffffffa0badf01>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc]
           [<ffffffffa0b85f6e>] ldlm_cli_enqueue+0x37e/0x870 [ptlrpc]
           [<ffffffffa0b8b050>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
           [<ffffffffa122b390>] ? mdt_remote_blocking_ast+0x0/0x210 [mdt]
           [<ffffffffa140f615>] osp_md_object_lock+0x185/0x240 [osp]
           [<ffffffffa131c557>] lod_object_lock+0x147/0x860 [lod]
           [<ffffffffa0855109>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs]
           [<ffffffffa099161f>] ? lu_object_find_try+0x1df/0x260 [obdclass]
           [<ffffffffa139e90b>] mdd_object_lock+0x3b/0xd0 [mdd]
           [<ffffffffa1238bba>] mdt_remote_object_lock+0x14a/0x310 [mdt]
           [<ffffffffa1238ea9>] mdt_object_lock_internal+0x129/0x2d0 [mdt]
           [<ffffffffa1239111>] mdt_object_lock+0x11/0x20 [mdt]
           [<ffffffffa1250474>] mdt_reint_unlink+0x204/0xff0 [mdt]
           [<ffffffffa1247a2d>] mdt_reint_rec+0x5d/0x200 [mdt]
           [<ffffffffa123381b>] mdt_reint_internal+0x62b/0xb80 [mdt]
           [<ffffffffa123420b>] mdt_reint+0x6b/0x120 [mdt]
           [<ffffffffa0c2350c>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
           [<ffffffffa0c22c20>] ? tgt_request_handle+0x0/0x1470 [ptlrpc]
           [<ffffffffa0b741eb>] handle_recovery_req+0x16b/0x290 [ptlrpc]
           [<ffffffffa0b7b17c>] target_recovery_thread+0x10dc/0x2550 [ptlrpc]
           [<ffffffff81064c12>] ? default_wake_function+0x12/0x20
           [<ffffffffa0b7a0a0>] ? target_recovery_thread+0x0/0x2550 [ptlrpc]
           [<ffffffff8109e78e>] kthread+0x9e/0xc0
           [<ffffffff8100c28a>] child_rip+0xa/0x20
           [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
           [<ffffffff8100c280>] ? child_rip+0x0/0x20
          

          Ah, we should allow the lock enqueue during recovering, I will cook a patch.

          di.wang Di Wang added a comment - I saw a similar issue in soak-test, and got the trace tgt_recov S 0000000000000013 0 13622 2 0x00000080 ffff8803f5fb57e0 0000000000000046 0000000000000000 ffff8803f5fb57a4 ffffffff00000000 ffff88043fe84400 000003162493ac87 0000000000000286 ffff8803f5fb5780 ffffffff8108742c ffff8803f5fad068 ffff8803f5fb5fd8 Call Trace: [<ffffffff8108742c>] ? lock_timer_base+0x3c/0x70 [<ffffffff8152b222>] schedule_timeout+0x192/0x2e0 [<ffffffff81087540>] ? process_timeout+0x0/0x10 [<ffffffffa0bad779>] ptlrpc_set_wait+0x319/0xa20 [ptlrpc] [<ffffffffa0967344>] ? lprocfs_stats_alloc_one+0x344/0x360 [obdclass] [<ffffffffa0ba2c20>] ? ptlrpc_interrupted_set+0x0/0x110 [ptlrpc] [<ffffffff81064c00>] ? default_wake_function+0x0/0x20 [<ffffffffa0bb9b15>] ? lustre_msg_set_jobid+0xf5/0x130 [ptlrpc] [<ffffffffa0badf01>] ptlrpc_queue_wait+0x81/0x220 [ptlrpc] [<ffffffffa0b85f6e>] ldlm_cli_enqueue+0x37e/0x870 [ptlrpc] [<ffffffffa0b8b050>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [<ffffffffa122b390>] ? mdt_remote_blocking_ast+0x0/0x210 [mdt] [<ffffffffa140f615>] osp_md_object_lock+0x185/0x240 [osp] [<ffffffffa131c557>] lod_object_lock+0x147/0x860 [lod] [<ffffffffa0855109>] ? cfs_hash_bd_add_locked+0x29/0x90 [libcfs] [<ffffffffa099161f>] ? lu_object_find_try+0x1df/0x260 [obdclass] [<ffffffffa139e90b>] mdd_object_lock+0x3b/0xd0 [mdd] [<ffffffffa1238bba>] mdt_remote_object_lock+0x14a/0x310 [mdt] [<ffffffffa1238ea9>] mdt_object_lock_internal+0x129/0x2d0 [mdt] [<ffffffffa1239111>] mdt_object_lock+0x11/0x20 [mdt] [<ffffffffa1250474>] mdt_reint_unlink+0x204/0xff0 [mdt] [<ffffffffa1247a2d>] mdt_reint_rec+0x5d/0x200 [mdt] [<ffffffffa123381b>] mdt_reint_internal+0x62b/0xb80 [mdt] [<ffffffffa123420b>] mdt_reint+0x6b/0x120 [mdt] [<ffffffffa0c2350c>] tgt_request_handle+0x8ec/0x1470 [ptlrpc] [<ffffffffa0c22c20>] ? tgt_request_handle+0x0/0x1470 [ptlrpc] [<ffffffffa0b741eb>] handle_recovery_req+0x16b/0x290 [ptlrpc] [<ffffffffa0b7b17c>] target_recovery_thread+0x10dc/0x2550 [ptlrpc] [<ffffffff81064c12>] ? default_wake_function+0x12/0x20 [<ffffffffa0b7a0a0>] ? target_recovery_thread+0x0/0x2550 [ptlrpc] [<ffffffff8109e78e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 Ah, we should allow the lock enqueue during recovering, I will cook a patch.
          di.wang Di Wang added a comment -

          I just checked the debug log, and it looks like the recovery process is stuck during update recovery. Though I can not figure out why the recovery is stuck by current log.

          Frank, Could you please re-run test, and get me the threads stack trace when the endless recovery happens? Unfortunately lola-9 has been restarted, thanks.

          di.wang Di Wang added a comment - I just checked the debug log, and it looks like the recovery process is stuck during update recovery. Though I can not figure out why the recovery is stuck by current log. Frank, Could you please re-run test, and get me the threads stack trace when the endless recovery happens? Unfortunately lola-9 has been restarted, thanks.

          Assigning to Di who reported he will be looking into it shortly to assess.

          jgmitter Joseph Gmitter (Inactive) added a comment - Assigning to Di who reported he will be looking into it shortly to assess.

          People

            di.wang Di Wang
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: