[LU-13538] conf-sanity test_48: network issues cause test-suite timeout Created: 08/May/20  Updated: 07/Jun/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Chris Horn <hornc@cray.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/55fb5861-ae41-4fa8-988f-39e94caae705

test_48 failed with the following error:

Timeout occurred after 261 mins, last suite running was conf-sanity

Client unable to connect MDT0:

[12253.240233] Lustre: DEBUG MARKER: == conf-sanity test 48: too many acls on file ======================================================== 00:54:49 (1588899289)
[12268.159764] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre
[12268.168910] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock trevis-70vm11@tcp:/lustre /mnt/lustre
[12363.361555] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[12513.594687] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[12663.827908] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[12814.061377] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[12964.295004] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[13114.528746] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[13264.762686] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[13414.995666] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[13565.229414] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[13715.462556] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[14015.929722] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[14015.931840] LustreError: Skipped 1 previous similar message
[14616.864428] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[14616.866551] LustreError: Skipped 3 previous similar messages
[15217.797969] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[15217.800127] LustreError: Skipped 3 previous similar messages
[15818.731170] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
[15818.733314] LustreError: Skipped 3 previous similar messages

All the servers showing network errors:
OST 1, OST 2, OST 3, OST 4, OST 5, OST 6, OST 7, OST 8 (trevis-70vm10)

[12109.584743] LNetError: 120-3: Refusing connection from 127.0.0.1 for 127.0.0.2@tcp: No matching NI
[12109.586451] LNetError: Skipped 9 previous similar messages
[12109.587612] LNetError: 21240:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.2
[12109.589403] LNetError: 21240:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Skipped 9 previous similar messages

MDS 2, MDS 4 (trevis-70vm12)

[12168.724056] LNetError: 120-3: Refusing connection from 127.0.0.1 for 127.0.0.2@tcp: No matching NI
[12168.725999] LNetError: Skipped 10 previous similar messages
[12168.727047] LNetError: 14056:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.2
[12168.728834] LNetError: 14056:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Skipped 10 previous similar messages
[12168.730585] LNetError: 11b-b: Connection to 127.0.0.2@tcp at host 127.0.0.2 on port 7988 was reset: is it running a compatible version of Lustre and is 127.0.0.2@tcp one of its NIDs?
[12168.733518] LNetError: Skipped 10 previous similar messages

MDS 1, MDS 3 (trevis-70vm11)

[12067.370561] LustreError: 13b-9: lustre-OST0000 claims to have registered, but this MGS does not know about it, preventing registration.
[12112.040026] LNetError: 120-3: Refusing connection from 127.0.0.1 for 127.0.0.2@tcp: No matching NI
[12112.041735] LNetError: Skipped 9 previous similar messages
[12112.042757] LNetError: 12115:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.2
[12112.044547] LNetError: 12115:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Skipped 9 previous similar messages
[12112.046396] LNetError: 11b-b: Connection to 127.0.0.2@tcp at host 127.0.0.2 on port 7988 was reset: is it running a compatible version of Lustre and is 127.0.0.2@tcp one of its NIDs?
[12112.049170] LNetError: Skipped 9 previous similar messages

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_48 - Timeout occurred after 261 mins, last suite running was conf-sanity


Generated at Sat Feb 10 03:02:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.