Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13538

conf-sanity test_48: network issues cause test-suite timeout

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Chris Horn <hornc@cray.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/55fb5861-ae41-4fa8-988f-39e94caae705

      test_48 failed with the following error:

      Timeout occurred after 261 mins, last suite running was conf-sanity
      

      Client unable to connect MDT0:

      [12253.240233] Lustre: DEBUG MARKER: == conf-sanity test 48: too many acls on file ======================================================== 00:54:49 (1588899289)
      [12268.159764] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre
      [12268.168910] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock trevis-70vm11@tcp:/lustre /mnt/lustre
      [12363.361555] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [12513.594687] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [12663.827908] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [12814.061377] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [12964.295004] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [13114.528746] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [13264.762686] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [13414.995666] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [13565.229414] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [13715.462556] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [14015.929722] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [14015.931840] LustreError: Skipped 1 previous similar message
      [14616.864428] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [14616.866551] LustreError: Skipped 3 previous similar messages
      [15217.797969] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [15217.800127] LustreError: Skipped 3 previous similar messages
      [15818.731170] LustreError: 11-0: lustre-MDT0000-mdc-ffff89ffe63b2000: operation mds_connect to node 10.9.4.51@tcp failed: rc = -11
      [15818.733314] LustreError: Skipped 3 previous similar messages
      

      All the servers showing network errors:
      OST 1, OST 2, OST 3, OST 4, OST 5, OST 6, OST 7, OST 8 (trevis-70vm10)

      [12109.584743] LNetError: 120-3: Refusing connection from 127.0.0.1 for 127.0.0.2@tcp: No matching NI
      [12109.586451] LNetError: Skipped 9 previous similar messages
      [12109.587612] LNetError: 21240:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.2
      [12109.589403] LNetError: 21240:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Skipped 9 previous similar messages
      

      MDS 2, MDS 4 (trevis-70vm12)

      [12168.724056] LNetError: 120-3: Refusing connection from 127.0.0.1 for 127.0.0.2@tcp: No matching NI
      [12168.725999] LNetError: Skipped 10 previous similar messages
      [12168.727047] LNetError: 14056:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.2
      [12168.728834] LNetError: 14056:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Skipped 10 previous similar messages
      [12168.730585] LNetError: 11b-b: Connection to 127.0.0.2@tcp at host 127.0.0.2 on port 7988 was reset: is it running a compatible version of Lustre and is 127.0.0.2@tcp one of its NIDs?
      [12168.733518] LNetError: Skipped 10 previous similar messages
      

      MDS 1, MDS 3 (trevis-70vm11)

      [12067.370561] LustreError: 13b-9: lustre-OST0000 claims to have registered, but this MGS does not know about it, preventing registration.
      [12112.040026] LNetError: 120-3: Refusing connection from 127.0.0.1 for 127.0.0.2@tcp: No matching NI
      [12112.041735] LNetError: Skipped 9 previous similar messages
      [12112.042757] LNetError: 12115:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.2
      [12112.044547] LNetError: 12115:0:(socklnd_cb.c:1808:ksocknal_recv_hello()) Skipped 9 previous similar messages
      [12112.046396] LNetError: 11b-b: Connection to 127.0.0.2@tcp at host 127.0.0.2 on port 7988 was reset: is it running a compatible version of Lustre and is 127.0.0.2@tcp one of its NIDs?
      [12112.049170] LNetError: Skipped 9 previous similar messages
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      conf-sanity test_48 - Timeout occurred after 261 mins, last suite running was conf-sanity

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: