Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6919

replay-single test_70b: "Cannot send after transport endpoint shutdown" running dbench

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      The test was executed for 50 iterations out of that it failed for 4.

      fre0107: 1 6274 0.00 MB/sec execute 143 sec latency 198693.612 ms
      fre0107: [6274] open ./clients/client0/~dmtmp/WORDPRO/BENCHS.LWP failed for handle 11182 (Cannot send after transport endpoint shutdown)
      fre0107: [6277] open ./clients/client0/~dmtmp/WORDPRO/BENCHS.LWP failed for handle 11183 (Cannot send after transport endpoint shutdown)
      fre0107: (6278) ERROR: handle 11183 was not found
      fre0107: Child failed with status 1
      fre0107: status script Total(sec) E(xcluded) S(low)
      fre0107: ------------------------------------------------------------------------------------
      fre0107:
      fre0107: touch: missing file operand
      fre0107: Try `touch --help' for more information.
      pdsh@fre0107: fre0107: ssh exited with exit code 1
      fre0108: [6481] unlink ./clients/client0/~dmtmp/WORDPRO/BENCHS1.LWP failed (Cannot send after transport endpoint shutdown) - expected NT_STATUS_OK
      fre0108: ERROR: child 0 failed at line 6481
      fre0108: Child failed with status 1
      fre0108: status script Total(sec) E(xcluded) S(low)
      fre0108: ------------------------------------------------------------------------------------
      fre0108:
      fre0108: touch: missing file operand
      fre0108: Try `touch --help' for more information.
      pdsh@fre0107: fre0108: ssh exited with exit code 1
      fre0108: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 182 sec
      fre0107: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 182 sec
      fre0108: dbench: no process killed
      fre0107: dbench: no process killed
      pdsh@fre0107: fre0108: ssh exited with exit code 1
      pdsh@fre0107: fre0107: ssh exited with exit code 1
      replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of fre0107,fre0108!
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:4732:error_noexit()
      = /usr/lib64/lustre/tests/replay-single.sh:2080:test_70b()
      = /usr/lib64/lustre/tests/test-framework.sh:5010:run_one()
      = /usr/lib64/lustre/tests/test-framework.sh:5047:run_one_logged()
      = /usr/lib64/lustre/tests/test-framework.sh:4864:run_test()
      = /usr/lib64/lustre/tests/replay-single.sh:2101:main()
      Dumping lctl log to /tmp/test_logs/1437990212/replay-single.test_70b.*.1437990422.log
      fre0106: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

      fre0108: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

      fre0105: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

      fre0107: dbench: no process killed
      fre0108: dbench: no process killed
      pdsh@fre0107: fre0107: ssh exited with exit code 1
      pdsh@fre0107: fre0108: ssh exited with exit code 1
      replay-single test_70b: @@@@@@ FAIL: rundbench load on fre0107,fre0108 failed!
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:4732:error_noexit()
      = /usr/lib64/lustre/tests/test-framework.sh:4763:error()
      = /usr/lib64/lustre/tests/replay-single.sh:2099:test_70b()
      = /usr/lib64/lustre/tests/test-framework.sh:5010:run_one()
      = /usr/lib64/lustre/tests/test-framework.sh:5047:run_one_logged()
      = /usr/lib64/lustre/tests/test-framework.sh:4864:run_test()
      = /usr/lib64/lustre/tests/replay-single.sh:2101:main()
      Dumping lctl log to /tmp/test_logs/1437990212/replay-single.test_70b.*.1437990424.log
      FAIL 70b (208s)

      Attachments

        1. 70b__2.lctl.tgz
          1.39 MB
        2. LU-6919-client1.txt
          115 kB
        3. LU-6919-client2.txt
          120 kB
        4. LU-6919-MGS.txt
          147 kB
        5. LU-6919-OST1.txt
          161 kB

        Issue Links

          Activity

            [LU-6919] replay-single test_70b: "Cannot send after transport endpoint shutdown" running dbench

            console output of all the machines.

            aditya.pandit@seagate.com Aditya Pandit (Inactive) added a comment - console output of all the machines.

            Link to LU-6844 because of similar failure, but it may have a different cause.

            adilger Andreas Dilger added a comment - Link to LU-6844 because of similar failure, but it may have a different cause.

            Aditya, do you have the console logs from this test? It looks like the client has been evicted for some reason.

            Also, it would be useful for you to comment about which role each of the fre0105-0107 nodes is playing (client, MDS, OSS) so that we don't have to guess what is happening.

            adilger Andreas Dilger added a comment - Aditya, do you have the console logs from this test? It looks like the client has been evicted for some reason. Also, it would be useful for you to comment about which role each of the fre0105-0107 nodes is playing (client, MDS, OSS) so that we don't have to guess what is happening.

            People

              wc-triage WC Triage
              aditya.pandit@seagate.com Aditya Pandit (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: