[LU-6919] replay-single test_70b: "Cannot send after transport endpoint shutdown" running dbench Created: 28/Jul/15  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Aditya Pandit (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File 70b__2.lctl.tgz     Text File LU-6919-MGS.txt     Text File LU-6919-OST1.txt     Text File LU-6919-client1.txt     Text File LU-6919-client2.txt    
Issue Links:
Duplicate
Related
is related to LU-6844 replay-single test 70b failure: 'rund... Resolved
is related to LU-6935 replay-single test_70b FAIL: import i... Resolved
is related to LU-7265 replay-single test_70b timeout: NULL ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The test was executed for 50 iterations out of that it failed for 4.

fre0107: 1 6274 0.00 MB/sec execute 143 sec latency 198693.612 ms
fre0107: [6274] open ./clients/client0/~dmtmp/WORDPRO/BENCHS.LWP failed for handle 11182 (Cannot send after transport endpoint shutdown)
fre0107: [6277] open ./clients/client0/~dmtmp/WORDPRO/BENCHS.LWP failed for handle 11183 (Cannot send after transport endpoint shutdown)
fre0107: (6278) ERROR: handle 11183 was not found
fre0107: Child failed with status 1
fre0107: status script Total(sec) E(xcluded) S(low)
fre0107: ------------------------------------------------------------------------------------
fre0107:
fre0107: touch: missing file operand
fre0107: Try `touch --help' for more information.
pdsh@fre0107: fre0107: ssh exited with exit code 1
fre0108: [6481] unlink ./clients/client0/~dmtmp/WORDPRO/BENCHS1.LWP failed (Cannot send after transport endpoint shutdown) - expected NT_STATUS_OK
fre0108: ERROR: child 0 failed at line 6481
fre0108: Child failed with status 1
fre0108: status script Total(sec) E(xcluded) S(low)
fre0108: ------------------------------------------------------------------------------------
fre0108:
fre0108: touch: missing file operand
fre0108: Try `touch --help' for more information.
pdsh@fre0107: fre0108: ssh exited with exit code 1
fre0108: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 182 sec
fre0107: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 182 sec
fre0108: dbench: no process killed
fre0107: dbench: no process killed
pdsh@fre0107: fre0108: ssh exited with exit code 1
pdsh@fre0107: fre0107: ssh exited with exit code 1
replay-single test_70b: @@@@@@ FAIL: dbench stopped on some of fre0107,fre0108!
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4732:error_noexit()
= /usr/lib64/lustre/tests/replay-single.sh:2080:test_70b()
= /usr/lib64/lustre/tests/test-framework.sh:5010:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:5047:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4864:run_test()
= /usr/lib64/lustre/tests/replay-single.sh:2101:main()
Dumping lctl log to /tmp/test_logs/1437990212/replay-single.test_70b.*.1437990422.log
fre0106: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

fre0108: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

fre0105: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

fre0107: dbench: no process killed
fre0108: dbench: no process killed
pdsh@fre0107: fre0107: ssh exited with exit code 1
pdsh@fre0107: fre0108: ssh exited with exit code 1
replay-single test_70b: @@@@@@ FAIL: rundbench load on fre0107,fre0108 failed!
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4732:error_noexit()
= /usr/lib64/lustre/tests/test-framework.sh:4763:error()
= /usr/lib64/lustre/tests/replay-single.sh:2099:test_70b()
= /usr/lib64/lustre/tests/test-framework.sh:5010:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:5047:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4864:run_test()
= /usr/lib64/lustre/tests/replay-single.sh:2101:main()
Dumping lctl log to /tmp/test_logs/1437990212/replay-single.test_70b.*.1437990424.log
FAIL 70b (208s)



 Comments   
Comment by Andreas Dilger [ 06/Aug/15 ]

Aditya, do you have the console logs from this test? It looks like the client has been evicted for some reason.

Also, it would be useful for you to comment about which role each of the fre0105-0107 nodes is playing (client, MDS, OSS) so that we don't have to guess what is happening.

Comment by Andreas Dilger [ 06/Aug/15 ]

Link to LU-6844 because of similar failure, but it may have a different cause.

Comment by Aditya Pandit (Inactive) [ 07/Aug/15 ]

console output of all the machines.

Generated at Sat Feb 10 02:04:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.