Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7531

MDT recovery stalled if rescources are failed back immediatelly

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • lola
      build: 2.7.63-28-g5fda01f, 5fda01f3002e7e742a206ce149652c6b78356828 + patches
    • 3
    • 9223372036854775807

    Description

      The error occurred during soak testing of build '20151201.1' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151201.1). DNE is enabled. MDSes are set-up in active-active HA failover configuration.

      The MDT recovery process stalls on the primary node in case the recovery process is interrupted on the secondary node by failing back the
      resources immediately. This effects all running and new jobs using the remote MDTs.

      Sequence of events:

      • 2015-12-09 04:35:10 - Failover MDTs owned by lola-9 --> lola-8
      • 2015-12-09 04:43:52 - MDTs mounted successful on secondary (lola-8
      • 2015-12-09 04:44:13 - Stop recovery process (incomplete at that time, see soak.log), and initiated failback
      • 2015-12-09 04:44:25 - mds_failover (failback) completed

      Double checked MDTs are active and mounted:

      [root@lola-16 lola]# ssh lola-9 'lctl dl | grep " mdt "' | less -i
        4 UP mdt soaked-MDT0003 soaked-MDT0003_UUID 67
       32 UP mdt soaked-MDT0002 soaked-MDT0002_UUID 63
      
      [root@lola-16 lola]# ssh lola-9 'mount | grep lustre' | less -i
      /dev/mapper/360080e50002ffd8200000251520130a4p1 on /mnt/soaked-mdt3 type lustre (rw,user_xattr)
      /dev/mapper/360080e50002ff4f00000026d52013098p1 on /mnt/soaked-mdt2 type lustre (rw,user_xattr)
      

      Recovery still ongoing after ~ 50 mins:

      [root@lola-9 ~]# date
      Wed Dec  9 05:30:04 PST 2015
      [root@lola-9 ~]# lctl get_param mdt.*.recovery_status
      mdt.soaked-MDT0002.recovery_status=
      status: RECOVERING
      recovery_start: 1449667442
      time_remaining: 0
      connected_clients: 16/16
      req_replay_clients: 5
      lock_repay_clients: 5
      completed_clients: 11
      evicted_clients: 0
      replayed_requests: 0
      queued_requests: 4
      next_transno: 1090929750241
      mdt.soaked-MDT0003.recovery_status=
      status: RECOVERING
      recovery_start: 1449667442
      time_remaining: 0
      connected_clients: 16/16
      req_replay_clients: 5
      lock_repay_clients: 5
      completed_clients: 11
      evicted_clients: 0
      replayed_requests: 0
      queued_requests: 4
      next_transno: 1047980457114
      

      Attached messages, console log file of MDT (lola-8), debug log file created manually at 04:55 and soak.log file.

      Attachments

        Activity

          People

            di.wang Di Wang
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: