Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6057

replay-dual test_9 failed - post-failover df: 1

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.7.0
    • None
    • OpenSFS cluster running lustre-master tag 2.6.91 build # 2771 with two MDSs with one MDT each, three OSSs with two OSTs each and three clients.
    • 3
    • 16869

    Description

      While running the LFSCK Phase 3 test plan, replay-dual test 9 failed with the error in the routine fail():

      c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 181 sec
      c12: stat: cannot read file system information for `/lustre/scratch': Input/output error
      

      replay-dual test 10 failed with the same error message:

      c13: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      c11: stat: cannot read file system information for `/lustre/scratch': Input/output error
      pdsh@c13: c11: ssh exited with exit code 1
      c13: stat: cannot read file system information for `/lustre/scratch': Input/output error
      

      The test results are at https://testing.hpdd.intel.com/test_sets/78dc0abe-861b-11e4-ac52-5254006e85c2

      It’s not clear from the logs what is related to this error. For test 9, the client that could not stat the file, c12, has the following in dmesg right before the test fails

      00800000:00020000:5.0:1418766109.528501:0:25671:0:(lmv_obd.c:1477:lmv_statfs()) can't stat MDS #0 (scratch-MDT0000-mdc-ffff8808028cbc00), error -5
      

      On the primary MDS, MDS0, the recovery looks like it having issues:

      Lustre: *** cfs_fail_loc=119, val=2147483648***
      LustreError: 12646:0:(ldlm_lib.c:2384:target_send_reply_msg()) @@@ dropping reply  req@ffff880d0ee74c80 x1487677070285728/t128849018882(128849018882) o36->558cba8f-7f43-4143-5d8a-c7adfced85eb@192.168.2.112@o2ib:308/0 lens 488/448 e 0 to 0 dl 1418766108 ref 1 fl Complete:/4/0 rc 0/0
      Lustre: scratch-MDT0000: recovery is timed out, evict stale exports
      Lustre: scratch-MDT0000: disconnecting 1 stale clients
      Lustre: 12646:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
      Lustre: 12646:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      Lustre: 12646:0:(ldlm_lib.c:1773:target_recovery_overseer()) Skipped 2 previous similar messages
      Lustre: 12646:0:(ldlm_lib.c:1415:abort_req_replay_queue()) @@@ aborted:  req@ffff880275011380 x1487683234832604/t0(128849018884) o36->d08d2f7b-4c89-7208-ad20-237f0ed0a102@192.168.2.113@o2ib:294/0 lens 488/0 e 6 to 0 dl 1418766094 ref 1 fl Complete:/4/ffffffff rc 0/-1
      Lustre: 12646:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
      Lustre: 12646:0:(ldlm_lib.c:2060:target_recovery_thread()) too long recovery - read logs
      Lustre: scratch-MDT0000: Recovery over after 3:01, of 7 clients 1 recovered and 6 were evicted.
      LustreError: dumping log to /tmp/lustre-log.1418766079.12646
      Lustre: Skipped 3 previous similar messages
      Lustre: DEBUG MARKER: mdc.scratch-MDT0000-mdc-*.mds_server_uuid in FULL state after 181 sec
      Lustre: DEBUG MARKER: replay-dual test_9: @@@@@@ FAIL: post-failover df: 1
      

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: