Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13251

conf-sanity test_116 hangs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.14.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f9a3b2f8-4f30-11ea-a90e-52540065bddc

      test_116 failed with the following error:

      == conf-sanity test 116: big size MDT support ======================================================== 14:30:44 (1581604244)
      CMD: trevis-41vm12 which mkfs.xfs
      /sbin/mkfs.xfs
      Stopping clients: trevis-41vm10,trevis-41vm9.trevis.whamcloud.com /mnt/lustre (opts:)
      CMD: trevis-41vm10,trevis-41vm9.trevis.whamcloud.com running=\$(grep -c /mnt/lustre' ' /proc/mounts);
      if [ \$running -ne 0 ] ; then
      echo Stopping client \$(hostname) /mnt/lustre opts:;
      lsof /mnt/lustre || need_kill=no;
      if [ x != x -a x\$need_kill != xno ]; then
          pids=\$(lsof -t /mnt/lustre | sort -u);
          if [ -n \"\$pids\" ]; then
                   kill -9 \$pids;
          fi
      fi;
      while umount  /mnt/lustre 2>&1 | grep -q busy; do
          echo /mnt/lustre is still busy, wait one second && sleep 1;
      done;
      fi
      

      Console log on OSS:

      [51410.683520] Lustre: DEBUG MARKER: == conf-sanity test 116: big size MDT support ======================================================== 14:30:44 (1581604244)
      [51831.596292] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [51831.597893] LNetError: Skipped 6 previous similar messages
      [51831.598897] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [51831.600826] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 6 previous similar messages
      [51831.602520] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [51831.605249] LNetError: Skipped 6 previous similar messages
      [52106.585997] LNetError: 10606:0:(peer.c:3706:lnet_peer_ni_add_to_recoveryq_locked()) lpni 0.0.0.0@tcp added to recovery queue. Health = 0
      [52106.588260] LNetError: 10606:0:(peer.c:3706:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 2 previous similar messages
      [52456.590555] LNetError: 120-3: Refusing connection from 127.0.0.1 for 0.0.0.0@tcp: No matching NI
      [52456.592517] LNetError: Skipped 7 previous similar messages
      [52456.593570] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Error -104 reading HELLO from 127.0.0.1
      [52456.595610] LNetError: 10598:0:(socklnd_cb.c:1817:ksocknal_recv_hello()) Skipped 7 previous similar messages
      [52456.597393] LNetError: 11b-b: Connection to 0.0.0.0@tcp at host 0.0.0.0 on port 7988 was reset: is it running a compatible version of Lustre and is 0.0.0.0@tcp one of its NIDs?
      [52456.600207] LNetError: Skipped 7 previous similar messages
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      conf-sanity test_116 - Timeout occurred after 917 mins, last suite running was conf-sanity

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: