Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6852

MDS is evicted during 24-24 hours failover.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      During 24 hours failover test, I found one MDT is evicted due to timeout of blocking ast.

      Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=1221 DURATION=86400 PERIOD=600
      Lustre: DEBUG MARKER: Wait mds7 recovery complete before doing next failover...
      Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover...
      Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
      Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
      Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
      Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
      Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
      Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
      Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
      Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
      Lustre: DEBUG MARKER: Starting failover on mds7
      Lustre: Failing over lustre-MDT0006
      Lustre: 11310:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 0/real 0]  req@ffff880806d34080 x1506706376470312/t0(0) o104->lustre-MDT0006@192.168.2.126@o2ib:15/16 lens 296/224 e 0 to 1 dl 0 ref 1 fl Rpc:EX/0/ffffffff rc -19/-1
      Lustre: 11310:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      LustreError: 11310:0:(ldlm_lockd.c:668:ldlm_handle_ast_error()) ### client (nid 192.168.2.126@o2ib) failed to reply to blocking AST (req status 0 rc -19), evict it ns: mdt-lustre-MDT0006_UUID lock: ffff88080794e700/0xfe1fd5ccdec21d1a lrc: 4/0,0 mode: PW/PW res: [0x380000405:0x15e0:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x60000000000020 nid: 192.168.2.126@o2ib remote: 0x95ff49805a265b90 expref: 1607 pid: 11310 timeout: 4297597422 lvb_type: 0
      LustreError: 138-a: lustre-MDT0006: A client on nid 192.168.2.126@o2ib was evicted due to a lock blocking callback time out: rc -19
      Lustre: lustre-MDT0006: Not available for connect from 192.168.2.126@o2ib (stopping)
      LustreError: 3551:0:(client.c:1142:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880fe21853c0 x1506706376470568/t0(0) o1000->lustre-MDT0000-osp-MDT0006@192.168.2.125@o2ib:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
      

      So the reason is that MDT6 restarts, so it can not be response to the blocking ast request in time, which caused it to be evicted. So again come to the question should we evict MDT during recovery. Anyway I will cook a patch to merge it into http://review.whamcloud.com/#/c/13224/

      Attachments

        Issue Links

          Activity

            People

              di.wang Di Wang
              di.wang Di Wang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: