Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8428

Very long recovery times for MDTs after MDS restart

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      The effect happens during soak testing of build '20160713' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713)
      OSTs have been configured using zfs, MDTs using ldiskfs
      Environment consist of 4 MDSes with 1 MDT each and 6 OSSes with 4 OSTs each.
      DNE is enabled.
      MDS and OSS nodes are configured in active-active HA configuration.

      The effect of very long recovery times occurred after the restart of two MDS nodes.
      lola-9 : 89:38 min recovery time
      lola-10 : 193:55 min recovery time

      An OSS failover/failback injected by the framework, while the MDTs of the restarted MDS nodes were in recovery, finished successful in 48 seconds.

      Sequence of events:

      • 2016-07-19 22:57:52 - lola-10 mds_restart
      • 2016-07-19 23:07:21 - lola-10 , recovery started
      • 2016-07-20 00:41:55 - lola-9 mds_restart
      • 2016-07-20 00:51:55 - lola-9 , recovery started
      • 2016-07-20 00:53:07 - lola-3 oss_failover
      • 2016-07-20 00:58:55 - lola-3 recovery started
      • 2016-07-20 00:59:43 - lola-3 kernel: Lustre: soaked-OST0007: Recovery over after 0:48, of 23 clients 23 recovered and 0 were evicted.
      • 2016-07-20 02:21:35 - lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 89:38, of 22 clients 21 recovered and 1 was evicted.
      • 2016-07-20 02:21:17 - lola-10 kernel: Lustre: soaked-MDT0002: Recovery over after 193:55, of 22 clients 20 recovered and 2 were evicted.

      Clients were in the following state during recovery:

      [root@lola-16 ~]# pdsh -g clients 'lctl get_param *.*.state | grep -A 2 -E "MDT0001|MDT0002" ' | dshbak -c | 
      less -i
      ----------------  
      lola-13
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff88028add7800.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff88028add7800.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------
      lola-16
      ----------------
      mdc.soaked-MDT0001-mdc-ffff880415a56400.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff880415a56400.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------
      lola-19
      ----------------
      mdc.soaked-MDT0001-mdc-ffff88021b0fa800.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff88021b0fa800.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------
      lola-20
      ----------------
      mdc.soaked-MDT0001-mdc-ffff880c9b13f400.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff880c9b13f400.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-21
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff881033e32400.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff881033e32400.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------
      ----------------  
      lola-22
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff88082e455000.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff88082e455000.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-23
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff8808e6af4400.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff8808e6af4400.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-24
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff88102e910800.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff88102e910800.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-25
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff88102f38a400.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff88102f38a400.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-26
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff880f7857fc00.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff880f7857fc00.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-27
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff88081a0eec00.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff88081a0eec00.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-29
      ----------------  
      ----------------  
      lola-13
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff88028add7800.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff88028add7800.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-16
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff880415a56400.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff880415a56400.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-19
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff88021b0fa800.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff88021b0fa800.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-20
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff880c9b13f400.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff880c9b13f400.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-21
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff881033e32400.state=
      current_state: REPLAY_WAIT
      state_history:
      --
      mdc.soaked-MDT0002-mdc-ffff881033e32400.state=
      current_state: REPLAY_WAIT
      state_history:
      ----------------  
      lola-22
      ----------------  
      mdc.soaked-MDT0001-mdc-ffff88082e455000.state=
      current_state: REPLAY_WAIT
      state_history:
      

      Attached files:
      Console, messages logs of MDS nodes lola-[9,10]
      debug kernel logs of MDS nodes lola-[9,10] and single Lustre client node lola-19

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: