Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10073

lnet-selftest test_smoke: lst Error found

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.11.0, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6
    • trevis, full, x86_64 servers, ppc clients
      servers: el7.4, ldiskfs, branch master, v2.10.53.1, b3642
      clients: el7.4, branch master, v2.10.53.1, b3642
    • 3
    • 9223372036854775807

    Description

      https://testing.whamcloud.com/test_sets/87032fec-9d50-11e7-b778-5254006e85c2

      Seen previously in 2.9 testing (LU-6622).

      From test_log:

      Batch is stopped
      12345-10.9.0.84@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
      12345-10.9.0.85@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
      c:
      Total 2 error nodes in c
      12345-10.9.5.24@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
      12345-10.9.5.25@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
      s:
      Total 2 error nodes in s
      session is ended
      Total 2 error nodes in c
      Total 2 error nodes in s
      

      and

      Started clients trevis-77vm3.trevis.hpdd.intel.com,trevis-77vm4: 
      CMD: trevis-77vm3.trevis.hpdd.intel.com,trevis-77vm4 mount | grep /mnt/lustre' '
      10.9.5.25@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
      10.9.5.25@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
      CMD: trevis-77vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
      trevis-77vm4: h2tcp: deprecated, use h2nettype instead
      trevis-77vm4: trevis-77vm4.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
       lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5289:error()
        = /usr/lib64/lustre/tests/lnet-selftest.sh:153:check_lst_err()
        = /usr/lib64/lustre/tests/lnet-selftest.sh:179:test_smoke()
        = /usr/lib64/lustre/tests/test-framework.sh:5565:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5604:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5451:run_test()
        = /usr/lib64/lustre/tests/lnet-selftest.sh:182:main()
      

      Attachments

        1. perf-kernel-121.svg
          945 kB
        2. perf-kernel-122.svg
          681 kB
        3. perf-kernel-123.svg
          1.14 MB
        4. perf-kernel-124.svg
          757 kB
        5. perf-kernel-vm1.svg
          118 kB
        6. perf-kernel-vm2.svg
          131 kB
        7. perf-kernel-vm3.svg
          201 kB
        8. perf-kernel-vm4.svg
          189 kB

        Issue Links

          Activity

            [LU-10073] lnet-selftest test_smoke: lst Error found

            "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/46037
            Subject: LU-10073 tests: re-enable lnet selftest smoke test 4.4+ kernels
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 29bb14ccb62a1bccf066d558d35658fb57ffca11

            gerrit Gerrit Updater added a comment - "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/46037 Subject: LU-10073 tests: re-enable lnet selftest smoke test 4.4+ kernels Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 29bb14ccb62a1bccf066d558d35658fb57ffca11
            pjones Peter Jones added a comment -

            James

            It looks like this is an area that you are still looking into

            Peter

            pjones Peter Jones added a comment - James It looks like this is an area that you are still looking into Peter

            James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/38857
            Subject: LU-10073 tests: re-enable lnet selftest smoke test for PPC + ARM
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ff47b8bfb0d507c8a75338b7ddfde4eef99d5bb6

            gerrit Gerrit Updater added a comment - James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/38857 Subject: LU-10073 tests: re-enable lnet selftest smoke test for PPC + ARM Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ff47b8bfb0d507c8a75338b7ddfde4eef99d5bb6

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37450/
            Subject: LU-10073 tests: skip test smoke for PPC
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5ab2220687f4ef3a1d5b435f1e34f808723a9bf5

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37450/ Subject: LU-10073 tests: skip test smoke for PPC Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5ab2220687f4ef3a1d5b435f1e34f808723a9bf5

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37450
            Subject: LU-10073 tests: skip test smoke for PPC
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c28c9cabd71b8b0d4e45e909acfd4c797176ed59

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37450 Subject: LU-10073 tests: skip test smoke for PPC Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c28c9cabd71b8b0d4e45e909acfd4c797176ed59
            yujian Jian Yu added a comment -

            The failure also occurred on RHEL 7.7 client + server testing session: https://testing.whamcloud.com/test_sets/10d17588-3bb7-11ea-adca-52540065bddc

            yujian Jian Yu added a comment - The failure also occurred on RHEL 7.7 client + server testing session: https://testing.whamcloud.com/test_sets/10d17588-3bb7-11ea-adca-52540065bddc
            yujian Jian Yu added a comment - +1 on RHEL 8.1 client testing: https://testing.whamcloud.com/test_sets/e47aaeaa-3ba7-11ea-bb75-52540065bddc
            hornc Chris Horn added a comment - +1 on master https://testing.whamcloud.com/test_sessions/26e84ad7-8e0a-4307-a1f8-1c5281550588
            yujian Jian Yu added a comment -

            The failure occurred on RHEL 8.0 vm client+server against master branch:
            https://testing.whamcloud.com/test_sets/380744c0-c709-11e9-a25b-52540065bddc

            yujian Jian Yu added a comment - The failure occurred on RHEL 8.0 vm client+server against master branch: https://testing.whamcloud.com/test_sets/380744c0-c709-11e9-a25b-52540065bddc

            I setup 2 nodes with Ubuntu 18: 4.15.0-45-generic and two nodes with centos 7.6: 3.10.0-957.21.3.el7_lustre.x86_64

            We couldn't install RHEL8 on the physical nodes, so the setup is not 100% the same as the VM one. Once the RHEL8 installation is resolved I'll attempt the test again.

            Running the exact same script which failed on the VMs, passed on the physical setup.

            I collected the flamegraphs below and there is no significant differences between the Ubuntu18 and centos 7.6 with regards to softirq handling.

            Ubuntu 18 client: perf-kernel-121.svg
            Ubuntu 18 client: perf-kernel-122.svg
            RHEL 7.6 server: perf-kernel-123.svg
            RHEL 7.6 server: perf-kernel-124.svg

            This issue appears to be localized to VM setups. As far as I know there hasn't been reports of test failure on a physical setup.

            The VMs are started on a RHEL7.5 host. Could there be an interaction issue between host and VM? One thing to try is to deploy the VMs on a RHEL8 host and try the test. Currently there is a problem installing RHEL8 on physical nodes. Will try the test once this issue is resolved.

            ashehata Amir Shehata (Inactive) added a comment - I setup 2 nodes with Ubuntu 18: 4.15.0-45-generic and two nodes with centos 7.6: 3.10.0-957.21.3.el7_lustre.x86_64 We couldn't install RHEL8 on the physical nodes, so the setup is not 100% the same as the VM one. Once the RHEL8 installation is resolved I'll attempt the test again. Running the exact same script which failed on the VMs, passed on the physical setup. I collected the flamegraphs below and there is no significant differences between the Ubuntu18 and centos 7.6 with regards to softirq handling. Ubuntu 18 client: perf-kernel-121.svg Ubuntu 18 client: perf-kernel-122.svg RHEL 7.6 server: perf-kernel-123.svg RHEL 7.6 server: perf-kernel-124.svg This issue appears to be localized to VM setups. As far as I know there hasn't been reports of test failure on a physical setup. The VMs are started on a RHEL7.5 host. Could there be an interaction issue between host and VM? One thing to try is to deploy the VMs on a RHEL8 host and try the test. Currently there is a problem installing RHEL8 on physical nodes. Will try the test once this issue is resolved.

            I believe so too.

            I tried RHEL8 across all the VMs and the problem persists.

            I then increased the lnet_selftest rpc timeout to 256 seconds. And the test passed. IE no RPC errors or drops.

            I measured the time it takes to complete RPCs from lnet_selftest perspective and I noticed the following behavior:

            1. with one test in the batch, RPCs take a max of 1 second to complete
            2. with two tests in the batch I see RPCs taking close to 10 seconds
            3. As I increase the number of tests in the batch I see that there are RPCs which take longer and longer to complete. With 40+ tests in the batch (which what lnet-selftest.sh does) I see RPCs taking up to 130 seconds to complete.

            I then went to the previous setup with 2 RHEL8 clients and 2 RHEL7.6 servers and captured performance data using perf and generated flame graphs

            perf-kernel-vm1.svg
            perf-kernel-vm2.svg
            perf-kernel-vm3.svg
            perf-kernel-vm4.svg

            There appears to be a key difference in the flamegraphs captured on the RHEL8 VMs vs the RHEL7.6 VMs. The ksoftirqd/1 is appearing significantly less on the RHEL7.6 VMs (~43 samples) vs RHEL8 VMs (~7000 samples).

            My next steps are

            1. Attempt and reproduce this on physical nodes
            2. Investigate and see why interrupt handling on RHEL 8 is happening much more frequently. Is it only on VMs or physical machines as well

             

            ashehata Amir Shehata (Inactive) added a comment - I believe so too. I tried RHEL8 across all the VMs and the problem persists. I then increased the lnet_selftest rpc timeout to 256 seconds. And the test passed. IE no RPC errors or drops. I measured the time it takes to complete RPCs from lnet_selftest perspective and I noticed the following behavior: with one test in the batch, RPCs take a max of 1 second to complete with two tests in the batch I see RPCs taking close to 10 seconds As I increase the number of tests in the batch I see that there are RPCs which take longer and longer to complete. With 40+ tests in the batch (which what lnet-selftest.sh does) I see RPCs taking up to 130 seconds to complete. I then went to the previous setup with 2 RHEL8 clients and 2 RHEL7.6 servers and captured performance data using perf and generated flame graphs perf-kernel-vm1.svg perf-kernel-vm2.svg perf-kernel-vm3.svg perf-kernel-vm4.svg There appears to be a key difference in the flamegraphs captured on the RHEL8 VMs vs the RHEL7.6 VMs. The ksoftirqd/1 is appearing significantly less on the RHEL7.6 VMs (~43 samples) vs RHEL8 VMs (~7000 samples). My next steps are Attempt and reproduce this on physical nodes Investigate and see why interrupt handling on RHEL 8 is happening much more frequently. Is it only on VMs or physical machines as well  

            People

              simmonsja James A Simmons
              jcasper James Casper (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: