Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12990

MDS failed to mount during failover

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.13.0
    • 3
    • 9223372036854775807

    Description

      soak triggered mds_failover testing. According to the soak.log, MDT0003 from the failing MDS(soak-11) should be mounted on failover pair soak-10, but it didn't.

      soak.log

      2019-11-19 01:22:13,931:fsmgmt.fsmgmt:INFO     trying to connect to soak-11 ...
      2019-11-19 01:22:20,107:fsmgmt.fsmgmt:INFO     trying to connect to soak-11 ...
      2019-11-19 01:22:25,285:fsmgmt.fsmgmt:INFO     trying to connect to soak-11 ...
      2019-11-19 01:22:26,296:fsmgmt.fsmgmt:INFO     soak-11 is up!!!
      2019-11-19 01:22:37,308:fsmgmt.fsmgmt:INFO     Failing over soaked-MDT0003 ...
      2019-11-19 01:22:37,308:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0003 on soak-10 ...
      

      Here is the console log on soak-10 around that time

      [17741.278456] device-mapper: multipath: Failing path 8:128.
      [17746.279544] device-mapper: multipath: Reinstating path 8:128.
      [17746.286032] device-mapper: multipath: Failing path 8:128.
      [17747.871994] LNetError: 6527:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
      [17747.883281] LNetError: 6527:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 192.168.1.111@o2ib (10): c: 7, oc: 0, rc: 8
      [17747.897005] LNetError: 6533:0:(peer.c:3724:lnet_peer_ni_add_to_recoveryq_locked()) lpni 192.168.1.111@o2ib added to recovery queue. Health = 900
      [17747.911953] LNetError: 20538:0:(lib-msg.c:481:lnet_handle_local_failure()) ni 192.168.1.110@o2ib added to recovery queue. Health = 900
      [17747.925462] LNetError: 20538:0:(lib-msg.c:481:lnet_handle_local_failure()) Skipped 5 previous similar messages
      [17747.937096] Lustre: 6550:0:(client.c:2219:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1574126428/real 1574126433]  req@ffff899fea070000 x1650580018692480/t0(0) o41->soaked-MDT0003-osp-MDT0002@192.168.1.111@o2ib:24/4 lens 224/368 e 0 to 1 dl 1574126435 ref 1 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:''
      [17747.970231] Lustre: 6550:0:(client.c:2219:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [17747.980974] Lustre: soaked-MDT0003-osp-MDT0002: Connection to soaked-MDT0003 (at 192.168.1.111@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      [17751.292982] device-mapper: multipath: Reinstating path 8:128.
      [17751.299695] device-mapper: multipath: Failing path 8:128.
      [17756.300695] device-mapper: multipath: Reinstating path 8:128.
      [17756.307377] device-mapper: multipath: Failing path 8:128.
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: