Details

    • 3
    • 9223372036854775807

    Description

      Error occurred during soak testing of build '20160302' (b2_8 RC4) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration)

      The following effects can be observed:

      • After restarting and failover it takes 0.5 - 3 hours for the recovery to complete on all MDSes(seems to be correlated wiith uptime of the MDS)
      • Sometimes only 1 MDT finish recovery
      • Often the recovery never completes
      • This is true for all MDSes
      • a high rate of clients are evicted leading to a large number of job crashes ( up to ~ 25%).
      • Interestingly the recovery of secondary MDTs take only a couple of minutes and always complete on the failover partner node.

      Here're failover and restart events listed for MDS node lola-11. The same 'structure' can be found for the other nodes:
      Recovery for secondary MDTs on lola-11

      mds_failover     : 2016-03-03 10:24:12,345 - 2016-03-03 10:32:12,647    lola-10
      Mar  3 10:31:58 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 2:14, of 16 clients 0 recovered and 16 were evicted.
      Mar  3 10:32:06 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:20, of 16 clients 0 recovered and 16 were evicted.
      
      mds_failover     : 2016-03-03 18:11:42,958 - 2016-03-03 18:18:17,112    lola-10
      Mar  3 18:18:03 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:03, of 16 clients 0 recovered and 16 were evicted.
      Mar  3 18:18:10 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:08, of 16 clients 0 recovered and 16 were evicted.
      
      mds_failover     : 2016-03-03 22:04:51,554 - 2016-03-03 22:12:03,652    lola-10
      Mar  3 22:11:50 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:36, of 16 clients 0 recovered and 16 were evicted.
      Mar  3 22:11:57 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:22, of 16 clients 0 recovered and 16 were evicted.
      
      mds_failover     : 2016-03-04 00:11:27,161 - 2016-03-04 00:18:36,686    lola-10
      Mar  4 00:18:23 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:23, of 5 clients 0 recovered and 5 were evicted.
      Mar  4 00:18:30 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 6 clients 0 recovered and 6 were evicted.
      
      mds_failover     : 2016-03-04 01:51:11,775 - 2016-03-04 01:58:40,927    lola-10
      Mar  4 01:58:27 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:41, of 16 clients 0 recovered and 16 were evicted.
      Mar  4 01:58:34 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:23, of 16 clients 0 recovered and 16 were evicted.
      
      mds_failover     : 2016-03-04 02:54:18,928 - 2016-03-04 03:01:00,519    lola-10
      Mar  4 03:00:47 lola-11 kernel: Lustre: soaked-MDT0005: Recovery over after 1:05, of 16 clients 0 recovered and 16 were evicted.
      Mar  4 03:00:54 lola-11 kernel: Lustre: soaked-MDT0004: Recovery over after 0:09, of 16 clients 0 recovered and 16 were evicted.
      

      ------------------
      Recovery for primary MDTs on lola-11

      mds_failover     : 2016-03-03 09:36:44,457 - 2016-03-03 09:43:43,316    lola-11
      Mar  3 09:50:42 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 6:59, of 16 clients 16 recovered and 0 were evicted.
      Mar  3 09:51:14 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 7:31, of 16 clients 8 recovered and 8 were evicted.
      
      mds_failover     : 2016-03-03 13:06:05,210 - 2016-03-03 13:13:33,003    lola-11
      Mar  3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted.
      Mar  3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted.
      
      mds_restart      : 2016-03-03 13:26:05,005 - 2016-03-03 13:32:48,359    lola-11
      Mar  3 14:13:46 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 40:56, of 16 clients 16 recovered and 0 were evicted.
      Mar  3 14:13:50 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 41:50, of 16 clients 16 recovered and 0 were evicted.
      
      mds_restart      : 2016-03-03 20:14:23,309 - 2016-03-03 20:24:56,044    lola-11
      Mar  3 20:37:51 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 12:50, of 16 clients 16 recovered and 0 were evicted.
       ---> MDT0007 never recovered
      
      mds_failover     : 2016-03-03 22:15:27,654 - 2016-03-03 22:23:34,982    lola-11
      Mar  4 01:03:03 lola-11 kernel: Lustre: soaked-MDT0007: Recovery over after 159:29, of 16 clients 14 recovered and 2 were evicted.
      Mar  4 01:03:05 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 159:30, of 16 clients 14 recovered and 2 were evicted.
      
      mds_failover     : 2016-03-04 05:10:37,638 - 2016-03-04 05:17:48,193    lola-11
       ---> MDT0006 never recovered
       ---> MDT0007 never recovered
      
      mds_failover     : 2016-03-04 05:35:12,194 - 2016-03-04 05:41:56,320    lola-11
       ---> MDT0006 never recovered
       ---> MDT0007 never recovered
      
      mds_restart      : 2016-03-04 06:53:30,098 - 2016-03-04 07:03:06,783    lola-11
       ---> MDT0006 never recovered
       ---> MDT0007 never recovered
      

      Attached message, console and debug log files (with mask '1') of all MDS nodes (lola[8-11].

      Same situation ended once with start of oom-killer (see LU-7836.)

      Attachments

        1. console-lola-10.log.bz2
          506 kB
        2. console-lola-10-log-20160407.bz2
          53 kB
        3. console-lola-11.log.bz2
          563 kB
        4. console-lola-11-log-20160407.bz2
          84 kB
        5. console-lola-8.log.bz2
          723 kB
        6. console-lola-8-log-20160407.bz2
          139 kB
        7. console-lola-9.log.bz2
          650 kB
        8. lola-10_lustre-log.20160414-0312.bz2
          3.18 MB
        9. lola-10-lustre-log-20160304-0751.bz2
          2.14 MB
        10. lola-11_lustre-log.20160414-0312.bz2
          2.22 MB
        11. lola-11-lustre-log-20160304-0751.bz2
          1.44 MB
        12. lola-8_lustre-log.20160414-0312.bz2
          3.66 MB
        13. lola-8-lustre-log-20160304-0751.bz2
          2.58 MB
        14. lola-9-lustre-log-20160304-0751.bz2
          2.13 MB
        15. lustre-log-20160318-0240.bz2
          1.02 MB
        16. messages-lola-10.log.bz2
          370 kB
        17. messages-lola-10.log-20160414.bz2
          84 kB
        18. messages-lola-11.log.bz2
          291 kB
        19. messages-lola-11.log-20160414.bz2
          90 kB
        20. messages-lola-8.log.bz2
          324 kB
        21. messages-lola-8.log-20160414.bz2
          111 kB
        22. messages-lola-9.log.bz2
          367 kB
        23. recovery-times-20160317
          18 kB

        Issue Links

          Activity

            [LU-7848] Recovery process on MDS stalled
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18800/
            Subject: LU-7848 target: Do not fail MDT-MDT connection
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a37b7315ad15a1012c22277a0f7c7b7c9a989b59

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18800/ Subject: LU-7848 target: Do not fail MDT-MDT connection Project: fs/lustre-release Branch: master Current Patch Set: Commit: a37b7315ad15a1012c22277a0f7c7b7c9a989b59

            Frank can confirm, but I believe the stuck recovery happened on the first mds_failover after we restarted. The restart was to recover from the lola-10 crash. So, not related.

            cliffw Cliff White (Inactive) added a comment - Frank can confirm, but I believe the stuck recovery happened on the first mds_failover after we restarted. The restart was to recover from the lola-10 crash. So, not related.
            di.wang Di Wang added a comment -

            You can ignore these console message for now, and this is an known issue, see LU-7732.

            soaked-MDT0003: waking for gap in transno, VBR is OFF (skip: 12890084916, ql: 5, comp: 15, conn: 20, next: 12890084917, next_update 12890084974 last_committed: 12890084882)
            ...
            

            Hmm, I did not see the console message from lola-9. And also it seems lola-10 crashed around 04-14 6am because of LU-8022. when did this recovery stuck happen? thanks. Just wondering if this is related?

            di.wang Di Wang added a comment - You can ignore these console message for now, and this is an known issue, see LU-7732 . soaked-MDT0003: waking for gap in transno, VBR is OFF (skip: 12890084916, ql: 5, comp: 15, conn: 20, next: 12890084917, next_update 12890084974 last_committed: 12890084882) ... Hmm, I did not see the console message from lola-9. And also it seems lola-10 crashed around 04-14 6am because of LU-8022 . when did this recovery stuck happen? thanks. Just wondering if this is related?

            The error appears also during soak test of build '20160413' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413). Test configuration is the same as specified above again.
            Please let me know if this should be handled in a new ticket!

            After MDS restart and failover of 3 MDTs lola-[8,10,11] the recovery process is stucked,

            Sequence of events:

            • 2016-04-13 16:51:12,334:fsmgmt.fsmgmt:INFO mds_restart lola-11 just completed
              Shows many messages also (INTL-156?)
              Apr 13 16:51:51 lola-11 kernel: LustreError: 5331:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0003: waking for gap in transno, VBR is OFF (skip: 12890084916, ql: 5, comp: 15, conn: 20, next: 12890084917, next_update 12890084974 last_committed: 12890084882)
              ...
              ...
              
            • 2016-04-13 23:43:59,801:fsmgmt.fsmgmt:INFO mds_restart lola-10 just completed
            • 2016-04-14 00:41:58,831:fsmgmt.fsmgmt:INFO mds_restart lola-8 just completed

            Attached files: message, console and debug_kernel files of lola-[8,10,11]. Debug files contain
            information gathered with default debug mask.

            heckes Frank Heckes (Inactive) added a comment - The error appears also during soak test of build '20160413' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160413 ). Test configuration is the same as specified above again. Please let me know if this should be handled in a new ticket! After MDS restart and failover of 3 MDTs lola- [8,10,11] the recovery process is stucked, Sequence of events: 2016-04-13 16:51:12,334:fsmgmt.fsmgmt:INFO mds_restart lola-11 just completed Shows many messages also (INTL-156?) Apr 13 16:51:51 lola-11 kernel: LustreError: 5331:0:(ldlm_lib.c:1900:check_for_next_transno()) soaked-MDT0003: waking for gap in transno, VBR is OFF (skip: 12890084916, ql: 5, comp: 15, conn: 20, next: 12890084917, next_update 12890084974 last_committed: 12890084882) ... ... 2016-04-13 23:43:59,801:fsmgmt.fsmgmt:INFO mds_restart lola-10 just completed 2016-04-14 00:41:58,831:fsmgmt.fsmgmt:INFO mds_restart lola-8 just completed Attached files: message, console and debug_kernel files of lola- [8,10,11] . Debug files contain information gathered with default debug mask.

            The same failure recurred, Mar 31. Seems to be very regular, but takes a day or so.
            Same sequence of events as above, errors from lola-9

            Lustre: Skipped 1164 previous similar messages^M
            Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib10, removing former export from same NID^M
            Lustre: Skipped 287 previous similar messages^M
            LustreError: 4406:0:(ldlm_lib.c:2773:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff88040c491850 x1530264857542592/t0(399435290195) o1000->soaked-MDT0001-mdtlov_UUID@0@lo:409/0 lens 344/0 e 0 to 0 dl 1459462974 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M
            LustreError: 4406:0:(ldlm_lib.c:2773:target_queue_recovery_request()) Skipped 263 previous similar messages^M
            LustreError: 4304:0:(client.c:2874:ptlrpc_replay_interpret()) @@@ request replay timed out.^M
              req@ffff8807008016c0 x1530264857542592/t399435290195(399435290195) o1000->soaked-MDT0000-osp-MDT0001@0@lo:24/4 lens 344/4288 e 1146 to 1 dl 1459462988 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1^M
            LustreError: 4304:0:(client.c:2874:ptlrpc_replay_interpret()) Skipped 87 previous similar messages^M
            Mar 31 15:25:01 lola-9 TIME: Time stamp for console
            Lustre: soaked-MDT0003-osp-MDT0000: Connection to soaked-MDT0003 (at 192.168.1.111@o2ib10) was lost; in progress operations using this service will wait for recovery to complete^M
            Lustre: soaked-MDT0000: Recovery already passed deadline 477:37, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M
            Lustre: Skipped 875 previous similar messages^M
            
            cliffw Cliff White (Inactive) added a comment - The same failure recurred, Mar 31. Seems to be very regular, but takes a day or so. Same sequence of events as above, errors from lola-9 Lustre: Skipped 1164 previous similar messages^M Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib10, removing former export from same NID^M Lustre: Skipped 287 previous similar messages^M LustreError: 4406:0:(ldlm_lib.c:2773:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff88040c491850 x1530264857542592/t0(399435290195) o1000->soaked-MDT0001-mdtlov_UUID@0@lo:409/0 lens 344/0 e 0 to 0 dl 1459462974 ref 2 fl Interpret:/6/ffffffff rc 0/-1^M LustreError: 4406:0:(ldlm_lib.c:2773:target_queue_recovery_request()) Skipped 263 previous similar messages^M LustreError: 4304:0:(client.c:2874:ptlrpc_replay_interpret()) @@@ request replay timed out.^M req@ffff8807008016c0 x1530264857542592/t399435290195(399435290195) o1000->soaked-MDT0000-osp-MDT0001@0@lo:24/4 lens 344/4288 e 1146 to 1 dl 1459462988 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1^M LustreError: 4304:0:(client.c:2874:ptlrpc_replay_interpret()) Skipped 87 previous similar messages^M Mar 31 15:25:01 lola-9 TIME: Time stamp for console Lustre: soaked-MDT0003-osp-MDT0000: Connection to soaked-MDT0003 (at 192.168.1.111@o2ib10) was lost; in progress operations using this service will wait for recovery to complete^M Lustre: soaked-MDT0000: Recovery already passed deadline 477:37, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.^M Lustre: Skipped 875 previous similar messages^M

            Soak hit the OOM killer again after ~48 hours of running.

            2016-03-25 23:35:03,097:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0003 on lola-10 ...
            2016-03-25 23:37:36,548:fsmgmt.fsmgmt:INFO     ... soaked-MDT0003 mounted successfully on lola-10
            2016-03-25 23:37:36,548:fsmgmt.fsmgmt:INFO     ... soaked-MDT0003 failed over
            2016-03-25 23:37:36,549:fsmgmt.fsmgmt:INFO     Wait for recovery to complete
            

            errors reported to syslog during the recovery

            Mar 26 00:00:54 lola-10 kernel: LustreError: 4923:0:(ldlm_lib.c:2773:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff8803efc440c0 x1529699890967980/t0(270583579075) o36->df764ab7-38a0-d041-b3b3-6cbdb5966d59@192.168.1.131@o2ib100:70/0 lens 768/0 e 0 to 0 dl 1458975660 ref 2 fl Interpret:/6/ffffffff rc 0/-1
            Mar 26 00:00:54 lola-10 kernel: LustreError: 4923:0:(ldlm_lib.c:2773:target_queue_recovery_request()) Skipped 320 previous similar messages
            
            Mar 26 00:02:21 lola-10 kernel: LustreError: 3413:0:(import.c:675:ptlrpc_connect_import()) already connecting
            Mar 26 00:02:21 lola-10 kernel: LustreError: 3413:0:(import.c:675:ptlrpc_connect_import()) Skipped 3 previous similar messages
            
            Mar 27 03:15:21 lola-10 kernel: Lustre: soaked-MDT0003: Recovery already passed deadline 1650:08, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
            
            

            Due to the weekend, the recovery was hung for quite a long time, last console message is dated Mar 27 03:19:15

            cliffw Cliff White (Inactive) added a comment - Soak hit the OOM killer again after ~48 hours of running. 2016-03-25 23:35:03,097:fsmgmt.fsmgmt:INFO Mounting soaked-MDT0003 on lola-10 ... 2016-03-25 23:37:36,548:fsmgmt.fsmgmt:INFO ... soaked-MDT0003 mounted successfully on lola-10 2016-03-25 23:37:36,548:fsmgmt.fsmgmt:INFO ... soaked-MDT0003 failed over 2016-03-25 23:37:36,549:fsmgmt.fsmgmt:INFO Wait for recovery to complete errors reported to syslog during the recovery Mar 26 00:00:54 lola-10 kernel: LustreError: 4923:0:(ldlm_lib.c:2773:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff8803efc440c0 x1529699890967980/t0(270583579075) o36->df764ab7-38a0-d041-b3b3-6cbdb5966d59@192.168.1.131@o2ib100:70/0 lens 768/0 e 0 to 0 dl 1458975660 ref 2 fl Interpret:/6/ffffffff rc 0/-1 Mar 26 00:00:54 lola-10 kernel: LustreError: 4923:0:(ldlm_lib.c:2773:target_queue_recovery_request()) Skipped 320 previous similar messages Mar 26 00:02:21 lola-10 kernel: LustreError: 3413:0:( import .c:675:ptlrpc_connect_import()) already connecting Mar 26 00:02:21 lola-10 kernel: LustreError: 3413:0:( import .c:675:ptlrpc_connect_import()) Skipped 3 previous similar messages Mar 27 03:15:21 lola-10 kernel: Lustre: soaked-MDT0003: Recovery already passed deadline 1650:08, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. Due to the weekend, the recovery was hung for quite a long time, last console message is dated Mar 27 03:19:15

            As of Friday morning, we have had 12 MDS failovers without an issues and 11 MDS restarts. Average times are low.
            a few examples:

            lola-11.log:Mar 24 15:14:37 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 1:11, of 12 clients 12 recovered and 0 were evicted.
            lola-11.log:Mar 24 17:12:16 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 0:55, of 12 clients 12 recovered and 0 were evicted.
            lola-11.log:Mar 24 18:53:29 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 1:09, of 12 clients 12 recovered and 0 were evicted.
            lola-11.log:Mar 24 20:37:20 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 0:48, of 12 clients 12 recovered and 0 were evicted.
            lola-11.log:Mar 24 23:50:03 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 0:55, of 12 clients 12 recovered and 0 were evicted.
            lola-11.log:Mar 25 01:05:09 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:09, of 12 clients 12 recovered and 0 were evicted.
            lola-11.log:Mar 25 02:05:57 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:11, of 12 clients 12 recovered and 0 were evicted.
            lola-11.log:Mar 25 07:16:49 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:05, of 12 clients 12 recovered and 0 were evicted.
            lola-11.log:Mar 25 08:37:15 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 0:20, of 12 clients 12 recovered and 0 were evicted.
            lola-8.log:Mar 24 08:56:21 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 5:00, of 12 clients 3 recovered and 9 were evicted.
            lola-8.log:Mar 24 14:14:03 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:25, of 12 clients 12 recovered and 0 were evicted.
            lola-8.log:Mar 24 18:07:07 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:24, of 12 clients 12 recovered and 0 were evicted.
            lola-8.log:Mar 24 19:16:10 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:06, of 12 clients 12 recovered and 0 were evicted.
            lola-8.log:Mar 25 00:49:16 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:24, of 12 clients 12 recovered and 0 were evicted.
            lola-8.log:Mar 25 02:17:26 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:28, of 12 clients 12 recovered and 0 were evicted.
            lola-8.log:Mar 25 03:03:24 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:18, of 12 clients 12 recovered and 0 were evicted.
            lola-8.log:Mar 25 06:09:20 lola-8 kernel: Lustre: soaked-MDT0001: Recovery over after 0:45, of 12 clients 12 recovered and 0 were evicted.
            lola-8.log:Mar 25 08:00:11 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:13, of 12 clients 12 recovered and 0 were evicted.
            lola-8.log:Mar 25 09:23:57 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:28, of 12 clients 12 recovered and 0 were evicted.
            lola-9.log:Mar 24 08:56:21 lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 5:00, of 12 clients 3 recovered and 9 were evicted.
            lola-9.log:Mar 24 10:58:12 lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 0:48, of 12 clients 12 recovered and 0 were evicted.
            
            cliffw Cliff White (Inactive) added a comment - As of Friday morning, we have had 12 MDS failovers without an issues and 11 MDS restarts. Average times are low. a few examples: lola-11.log:Mar 24 15:14:37 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 1:11, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 24 17:12:16 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 0:55, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 24 18:53:29 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 1:09, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 24 20:37:20 lola-11 kernel: Lustre: soaked-MDT0002: Recovery over after 0:48, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 24 23:50:03 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 0:55, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 25 01:05:09 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:09, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 25 02:05:57 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:11, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 25 07:16:49 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 1:05, of 12 clients 12 recovered and 0 were evicted. lola-11.log:Mar 25 08:37:15 lola-11 kernel: Lustre: soaked-MDT0003: Recovery over after 0:20, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 24 08:56:21 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 5:00, of 12 clients 3 recovered and 9 were evicted. lola-8.log:Mar 24 14:14:03 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:25, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 24 18:07:07 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:24, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 24 19:16:10 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:06, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 00:49:16 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:24, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 02:17:26 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:28, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 03:03:24 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:18, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 06:09:20 lola-8 kernel: Lustre: soaked-MDT0001: Recovery over after 0:45, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 08:00:11 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 1:13, of 12 clients 12 recovered and 0 were evicted. lola-8.log:Mar 25 09:23:57 lola-8 kernel: Lustre: soaked-MDT0000: Recovery over after 0:28, of 12 clients 12 recovered and 0 were evicted. lola-9.log:Mar 24 08:56:21 lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 5:00, of 12 clients 3 recovered and 9 were evicted. lola-9.log:Mar 24 10:58:12 lola-9 kernel: Lustre: soaked-MDT0001: Recovery over after 0:48, of 12 clients 12 recovered and 0 were evicted.
            di.wang Di Wang added a comment -

            In the newest run, MDT failover seems stuck again. Here is what happened.

            1. lola 9 is restarted, and MDT0001 is failover on lola-8.
            2. Then MDT0001 is re-mounted and tries to connect to MDT0002 on lola-10, but always get EALREADY.

            LustreError: 11-0: soaked-MDT0002-osp-MDT0001: operation mds_connect to node 192.168.1.110@o2ib10 failed: rc = -114
            LustreError: 11-0: soaked-MDT0002-osp-MDT0001: operation mds_connect to node 192.168.1.110@o2ib10 failed: rc = -114
            

            3. And on MDT0002 (lola-10) denies the new connection because of

            Lustre: soaked-MDT0002: Client soaked-MDT0001-mdtlov_UUID seen on new nid 192.168.1.108@o2ib10 when existing nid 192.168.1.109@o2ib10 is already connected
            

            Since MDT always the same uuid to connect other MDT, I think we need remove the old export when it finds same export comes from a different nid.

            di.wang Di Wang added a comment - In the newest run, MDT failover seems stuck again. Here is what happened. 1. lola 9 is restarted, and MDT0001 is failover on lola-8. 2. Then MDT0001 is re-mounted and tries to connect to MDT0002 on lola-10, but always get EALREADY. LustreError: 11-0: soaked-MDT0002-osp-MDT0001: operation mds_connect to node 192.168.1.110@o2ib10 failed: rc = -114 LustreError: 11-0: soaked-MDT0002-osp-MDT0001: operation mds_connect to node 192.168.1.110@o2ib10 failed: rc = -114 3. And on MDT0002 (lola-10) denies the new connection because of Lustre: soaked-MDT0002: Client soaked-MDT0001-mdtlov_UUID seen on new nid 192.168.1.108@o2ib10 when existing nid 192.168.1.109@o2ib10 is already connected Since MDT always the same uuid to connect other MDT, I think we need remove the old export when it finds same export comes from a different nid.

            Same for umount :

            [root@lola-8 ~]# ps aux | grep umount
            root      21524  0.0  0.0 105184   776 pts/0    D+   08:45   0:00 umount /mnt/soaked-mdt0
            [root@lola-8 ~]# cat /proc/21524/wchan 
            target_stop_recovery_thread[root@lola-8 ~]# 
            
            heckes Frank Heckes (Inactive) added a comment - Same for umount : [root@lola-8 ~]# ps aux | grep umount root 21524 0.0 0.0 105184 776 pts/0 D+ 08:45 0:00 umount /mnt/soaked-mdt0 [root@lola-8 ~]# cat /proc/21524/wchan target_stop_recovery_thread[root@lola-8 ~]#

            People

              di.wang Di Wang
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: