Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7848

Recovery process on MDS stalled

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      Error occurred during soak testing of build '20160302' (b2_8 RC4) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration)

      The following effects can be observed:

      • After restarting and failover it takes 0.5 - 3 hours for the recovery to complete on all MDSes(seems to be correlated wiith uptime of the MDS)
      • Sometimes only 1 MDT finish recovery
      • Often the recovery never completes
      • This is true for all MDSes
      • a high rate of clients are evicted leading to a large number of job crashes ( up to ~ 25%).
      • Interestingly the recovery of secondary MDTs take only a couple of minutes and always complete on the failover partner node.

      Here're failover and restart events listed for MDS node lola-11. The same 'structure' can be found for the other nodes:
      Recovery for secondary MDTs on lola-11

      mds_failover     : 2016-03-03 10:24:12,345 - 2016-03-03 10:32:12,647    lola-10
      Mar  3 10:31:58 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 2:14, of 16 clients 0 recovered and 16 were evicted.
      Mar  3 10:32:06 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:20, of 16 clients 0 recovered and 16 were evicted.
      
      mds_failover     : 2016-03-03 18:11:42,958 - 2016-03-03 18:18:17,112    lola-10
      Mar  3 18:18:03 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:03, of 16 clients 0 recovered and 16 were evicted.
      Mar  3 18:18:10 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:08, of 16 clients 0 recovered and 16 were evicted.
      
      mds_failover     : 2016-03-03 22:04:51,554 - 2016-03-03 22:12:03,652    lola-10
      Mar  3 22:11:50 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:36, of 16 clients 0 recovered and 16 were evicted.
      Mar  3 22:11:57 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:22, of 16 clients 0 recovered and 16 were evicted.
      
      mds_failover     : 2016-03-04 00:11:27,161 - 2016-03-04 00:18:36,686    lola-10
      Mar  4 00:18:23 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:23, of 5 clients 0 recovered and 5 were evicted.
      Mar  4 00:18:30 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 6 clients 0 recovered and 6 were evicted.
      
      mds_failover     : 2016-03-04 01:51:11,775 - 2016-03-04 01:58:40,927    lola-10
      Mar  4 01:58:27 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:41, of 16 clients 0 recovered and 16 were evicted.
      Mar  4 01:58:34 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 16 clients 0 recovered and 16 were evicted.
      
      mds_failover     : 2016-03-04 02:54:18,928 - 2016-03-04 03:01:00,519    lola-10
      Mar  4 03:00:47 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:05, of 16 clients 0 recovered and 16 were evicted.
      Mar  4 03:00:54 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:09, of 16 clients 0 recovered and 16 were evicted.
      

      ------------------
      Recovery for primary MDTs on lola-11

      mds_failover     : 2016-03-03 09:36:44,457 - 2016-03-03 09:43:43,316    lola-11
      Mar  3 09:50:42 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 6:59, of 16 clients 16 recovered and 0 were evicted.
      Mar  3 09:51:14 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 7:31, of 16 clients 8 recovered and 8 were evicted.
      
      mds_failover     : 2016-03-03 13:06:05,210 - 2016-03-03 13:13:33,003    lola-11
      Mar  3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted.
      Mar  3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted.
      
      mds_restart      : 2016-03-03 13:26:05,005 - 2016-03-03 13:32:48,359    lola-11
      Mar  3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted.
      Mar  3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted.
      
      mds_restart      : 2016-03-03 20:14:23,309 - 2016-03-03 20:24:56,044    lola-11
      Mar  3 20:37:51 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 12:50, of 16 clients 16 recovered and 0 were evicted.
       ---> MDT0007 never recovered
      
      mds_failover     : 2016-03-03 22:15:27,654 - 2016-03-03 22:23:34,982    lola-11
      Mar  4 01:03:03 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 159:29, of 16 clients 14 recovered and 2 were evicted.
      Mar  4 01:03:05 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 159:30, of 16 clients 14 recovered and 2 were evicted.
      
      mds_failover     : 2016-03-04 05:10:37,638 - 2016-03-04 05:17:48,193    lola-11
       ---> MDT0006 never recovered
       ---> MDT0007 never recovered
      
      mds_failover     : 2016-03-04 05:35:12,194 - 2016-03-04 05:41:56,320    lola-11
       ---> MDT0006 never recovered
       ---> MDT0007 never recovered
      
      mds_restart      : 2016-03-04 06:53:30,098 - 2016-03-04 07:03:06,783    lola-11
       ---> MDT0006 never recovered
       ---> MDT0007 never recovered
      

      Attached message, console and debug log files (with mask '1') of all MDS nodes (lola[8-11].

      Same situation ended once with start of oom-killer (see LU-7836.)

      Attachments

        1. console-lola-10.log.bz2
          506 kB
        2. console-lola-11.log.bz2
          563 kB
        3. console-lola-8.log.bz2
          723 kB
        4. console-lola-9.log.bz2
          650 kB
        5. messages-lola-10.log.bz2
          370 kB
        6. messages-lola-11.log.bz2
          291 kB
        7. messages-lola-8.log.bz2
          324 kB
        8. messages-lola-9.log.bz2
          367 kB
        9. lola-10-lustre-log-20160304-0751.bz2
          2.14 MB
        10. lola-8-lustre-log-20160304-0751.bz2
          2.58 MB
        11. lola-9-lustre-log-20160304-0751.bz2
          2.13 MB
        12. lola-11-lustre-log-20160304-0751.bz2
          1.44 MB
        13. recovery-times-20160317
          18 kB
        14. lustre-log-20160318-0240.bz2
          1.02 MB
        15. console-lola-10-log-20160407.bz2
          53 kB
        16. console-lola-11-log-20160407.bz2
          84 kB
        17. console-lola-8-log-20160407.bz2
          139 kB
        18. messages-lola-10.log-20160414.bz2
          84 kB
        19. messages-lola-11.log-20160414.bz2
          90 kB
        20. messages-lola-8.log-20160414.bz2
          111 kB
        21. lola-10_lustre-log.20160414-0312.bz2
          3.18 MB
        22. lola-11_lustre-log.20160414-0312.bz2
          2.22 MB
        23. lola-8_lustre-log.20160414-0312.bz2
          3.66 MB

        Issue Links

          Activity

            People

              di.wang Di Wang
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: