Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10073

lnet-selftest test_smoke: lst Error found

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.11.0, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6
    • trevis, full, x86_64 servers, ppc clients
      servers: el7.4, ldiskfs, branch master, v2.10.53.1, b3642
      clients: el7.4, branch master, v2.10.53.1, b3642
    • 3
    • 9223372036854775807

    Description

      https://testing.whamcloud.com/test_sets/87032fec-9d50-11e7-b778-5254006e85c2

      Seen previously in 2.9 testing (LU-6622).

      From test_log:

      Batch is stopped
      12345-10.9.0.84@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
      12345-10.9.0.85@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
      c:
      Total 2 error nodes in c
      12345-10.9.5.24@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
      12345-10.9.5.25@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
      s:
      Total 2 error nodes in s
      session is ended
      Total 2 error nodes in c
      Total 2 error nodes in s
      

      and

      Started clients trevis-77vm3.trevis.hpdd.intel.com,trevis-77vm4: 
      CMD: trevis-77vm3.trevis.hpdd.intel.com,trevis-77vm4 mount | grep /mnt/lustre' '
      10.9.5.25@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
      10.9.5.25@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
      CMD: trevis-77vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
      trevis-77vm4: h2tcp: deprecated, use h2nettype instead
      trevis-77vm4: trevis-77vm4.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
       lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5289:error()
        = /usr/lib64/lustre/tests/lnet-selftest.sh:153:check_lst_err()
        = /usr/lib64/lustre/tests/lnet-selftest.sh:179:test_smoke()
        = /usr/lib64/lustre/tests/test-framework.sh:5565:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5604:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5451:run_test()
        = /usr/lib64/lustre/tests/lnet-selftest.sh:182:main()
      

      Attachments

        1. perf-kernel-121.svg
          945 kB
        2. perf-kernel-122.svg
          681 kB
        3. perf-kernel-123.svg
          1.14 MB
        4. perf-kernel-124.svg
          757 kB
        5. perf-kernel-vm1.svg
          118 kB
        6. perf-kernel-vm2.svg
          131 kB
        7. perf-kernel-vm3.svg
          201 kB
        8. perf-kernel-vm4.svg
          189 kB

        Issue Links

          Activity

            [LU-10073] lnet-selftest test_smoke: lst Error found

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37450
            Subject: LU-10073 tests: skip test smoke for PPC
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c28c9cabd71b8b0d4e45e909acfd4c797176ed59

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37450 Subject: LU-10073 tests: skip test smoke for PPC Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c28c9cabd71b8b0d4e45e909acfd4c797176ed59
            yujian Jian Yu added a comment -

            The failure also occurred on RHEL 7.7 client + server testing session: https://testing.whamcloud.com/test_sets/10d17588-3bb7-11ea-adca-52540065bddc

            yujian Jian Yu added a comment - The failure also occurred on RHEL 7.7 client + server testing session: https://testing.whamcloud.com/test_sets/10d17588-3bb7-11ea-adca-52540065bddc
            yujian Jian Yu added a comment - +1 on RHEL 8.1 client testing: https://testing.whamcloud.com/test_sets/e47aaeaa-3ba7-11ea-bb75-52540065bddc
            hornc Chris Horn added a comment - +1 on master https://testing.whamcloud.com/test_sessions/26e84ad7-8e0a-4307-a1f8-1c5281550588
            yujian Jian Yu added a comment -

            The failure occurred on RHEL 8.0 vm client+server against master branch:
            https://testing.whamcloud.com/test_sets/380744c0-c709-11e9-a25b-52540065bddc

            yujian Jian Yu added a comment - The failure occurred on RHEL 8.0 vm client+server against master branch: https://testing.whamcloud.com/test_sets/380744c0-c709-11e9-a25b-52540065bddc

            I setup 2 nodes with Ubuntu 18: 4.15.0-45-generic and two nodes with centos 7.6: 3.10.0-957.21.3.el7_lustre.x86_64

            We couldn't install RHEL8 on the physical nodes, so the setup is not 100% the same as the VM one. Once the RHEL8 installation is resolved I'll attempt the test again.

            Running the exact same script which failed on the VMs, passed on the physical setup.

            I collected the flamegraphs below and there is no significant differences between the Ubuntu18 and centos 7.6 with regards to softirq handling.

            Ubuntu 18 client: perf-kernel-121.svg
            Ubuntu 18 client: perf-kernel-122.svg
            RHEL 7.6 server: perf-kernel-123.svg
            RHEL 7.6 server: perf-kernel-124.svg

            This issue appears to be localized to VM setups. As far as I know there hasn't been reports of test failure on a physical setup.

            The VMs are started on a RHEL7.5 host. Could there be an interaction issue between host and VM? One thing to try is to deploy the VMs on a RHEL8 host and try the test. Currently there is a problem installing RHEL8 on physical nodes. Will try the test once this issue is resolved.

            ashehata Amir Shehata (Inactive) added a comment - I setup 2 nodes with Ubuntu 18: 4.15.0-45-generic and two nodes with centos 7.6: 3.10.0-957.21.3.el7_lustre.x86_64 We couldn't install RHEL8 on the physical nodes, so the setup is not 100% the same as the VM one. Once the RHEL8 installation is resolved I'll attempt the test again. Running the exact same script which failed on the VMs, passed on the physical setup. I collected the flamegraphs below and there is no significant differences between the Ubuntu18 and centos 7.6 with regards to softirq handling. Ubuntu 18 client: perf-kernel-121.svg Ubuntu 18 client: perf-kernel-122.svg RHEL 7.6 server: perf-kernel-123.svg RHEL 7.6 server: perf-kernel-124.svg This issue appears to be localized to VM setups. As far as I know there hasn't been reports of test failure on a physical setup. The VMs are started on a RHEL7.5 host. Could there be an interaction issue between host and VM? One thing to try is to deploy the VMs on a RHEL8 host and try the test. Currently there is a problem installing RHEL8 on physical nodes. Will try the test once this issue is resolved.

            I believe so too.

            I tried RHEL8 across all the VMs and the problem persists.

            I then increased the lnet_selftest rpc timeout to 256 seconds. And the test passed. IE no RPC errors or drops.

            I measured the time it takes to complete RPCs from lnet_selftest perspective and I noticed the following behavior:

            1. with one test in the batch, RPCs take a max of 1 second to complete
            2. with two tests in the batch I see RPCs taking close to 10 seconds
            3. As I increase the number of tests in the batch I see that there are RPCs which take longer and longer to complete. With 40+ tests in the batch (which what lnet-selftest.sh does) I see RPCs taking up to 130 seconds to complete.

            I then went to the previous setup with 2 RHEL8 clients and 2 RHEL7.6 servers and captured performance data using perf and generated flame graphs

            perf-kernel-vm1.svg
            perf-kernel-vm2.svg
            perf-kernel-vm3.svg
            perf-kernel-vm4.svg

            There appears to be a key difference in the flamegraphs captured on the RHEL8 VMs vs the RHEL7.6 VMs. The ksoftirqd/1 is appearing significantly less on the RHEL7.6 VMs (~43 samples) vs RHEL8 VMs (~7000 samples).

            My next steps are

            1. Attempt and reproduce this on physical nodes
            2. Investigate and see why interrupt handling on RHEL 8 is happening much more frequently. Is it only on VMs or physical machines as well

             

            ashehata Amir Shehata (Inactive) added a comment - I believe so too. I tried RHEL8 across all the VMs and the problem persists. I then increased the lnet_selftest rpc timeout to 256 seconds. And the test passed. IE no RPC errors or drops. I measured the time it takes to complete RPCs from lnet_selftest perspective and I noticed the following behavior: with one test in the batch, RPCs take a max of 1 second to complete with two tests in the batch I see RPCs taking close to 10 seconds As I increase the number of tests in the batch I see that there are RPCs which take longer and longer to complete. With 40+ tests in the batch (which what lnet-selftest.sh does) I see RPCs taking up to 130 seconds to complete. I then went to the previous setup with 2 RHEL8 clients and 2 RHEL7.6 servers and captured performance data using perf and generated flame graphs perf-kernel-vm1.svg perf-kernel-vm2.svg perf-kernel-vm3.svg perf-kernel-vm4.svg There appears to be a key difference in the flamegraphs captured on the RHEL8 VMs vs the RHEL7.6 VMs. The ksoftirqd/1 is appearing significantly less on the RHEL7.6 VMs (~43 samples) vs RHEL8 VMs (~7000 samples). My next steps are Attempt and reproduce this on physical nodes Investigate and see why interrupt handling on RHEL 8 is happening much more frequently. Is it only on VMs or physical machines as well  

            After moving onyx-30vm[1-2] from RHEL8 clients to RHEL7.6 clients the issue is not reproducible any longer.

            This is also true of ARM and Ubuntu clients - they cause constant failures, but x86 clients do not. My thinking is that this relates to the specific kernel version used, and it just happens that ARM, Ubuntu, and RHEL8 are newer kernels that somehow changed the kernel socket interfaces.

            adilger Andreas Dilger added a comment - After moving onyx-30vm [1-2] from RHEL8 clients to RHEL7.6 clients the issue is not reproducible any longer. This is also true of ARM and Ubuntu clients - they cause constant failures, but x86 clients do not. My thinking is that this relates to the specific kernel version used, and it just happens that ARM, Ubuntu, and RHEL8 are newer kernels that somehow changed the kernel socket interfaces.

            After moving onyx-30vm[1-2] from RHEL8 clients to RHEL7.6 clients the issue is not reproducible any longer.

            Will start backing out socklnd patches to see if any of them is the problem.

             

            ashehata Amir Shehata (Inactive) added a comment - After moving onyx-30vm [1-2] from RHEL8 clients to RHEL7.6 clients the issue is not reproducible any longer. Will start backing out socklnd patches to see if any of them is the problem.  

            I was able to reproduce this on onyx-121vm[1-4].  I then run lnetctl set transaction_timeout 60 once lnet_selftest is loaded up and when I re-ran lnet-selftest.sh the problem did not occur. I pushed the same change for test with autotest and the problem happens. Once the test failed I took over the cluster: onyx-30vm[1-4], and ran lnet-selftest.sh manually (the same way I do with on onyx-121vm[1-4]) and it failed. I made sure that both clusters are setup with the same kernel and the same lustre build

            onyx-121vm[1-2] 
            ----------------
            lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64
            kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
            lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
            kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
            lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
            kernel-4.18.0-80.4.2.el8_0.x86_64
            kernel-tools-libs-4.18.0-80.el8.x86_64
            kernel-tools-4.18.0-80.el8.x86_64
            kernel-4.18.0-80.el8.x86_64
            kernel-core-4.18.0-80.4.2.el8_0.x86_64
            kernel-core-4.18.0-80.el8.x86_64
            kernel-modules-4.18.0-80.el8.x86_64
            kernel-headers-4.18.0-80.el8.x86_64
            kernel-modules-4.18.0-80.4.2.el8_0.x86_64
            
            onyx-121vm[3-4]
            ----------------
            kernel-tools-libs-3.10.0-957.21.3.el7_lustre.x86_64
            kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64
            lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
            python-perf-3.10.0-957.21.3.el7_lustre.x86_64
            kernel-tools-3.10.0-957.21.3.el7_lustre.x86_64
            kernel-devel-3.10.0-957.21.3.el7_lustre.x86_64
            lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64
            lustre-2.12.56_1_g07b40bd-1.el7.x86_64
            lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64
            kernel-3.10.0-957.21.3.el7_lustre.x86_64
            kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64
            kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
            kernel-headers-3.10.0-957.21.3.el7_lustre.x86_6

             

            onyx-30vm[1-2]
            ---------------
            kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
            lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
            lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64
            lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
            kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
            kernel-modules-4.18.0-80.4.2.el8_0.x86_64
            kernel-headers-4.18.0-80.el8.x86_64
            kernel-core-4.18.0-80.el8.x86_64
            kernel-modules-4.18.0-80.el8.x86_64
            kernel-tools-libs-4.18.0-80.el8.x86_64
            kernel-4.18.0-80.el8.x86_64
            kernel-core-4.18.0-80.4.2.el8_0.x86_64
            kernel-tools-4.18.0-80.el8.x86_64
            kernel-4.18.0-80.4.2.el8_0.x86_64
            
            
            onyx-30vm[3-4]
            --------------
            lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
            kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64
            kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
            lustre-2.12.56_1_g07b40bd-1.el7.x86_64
            lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64
            kernel-3.10.0-957.21.3.el7_lustre.x86_64
            lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64
            kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64
            kernel-headers-3.10.0-957.1.3.el7.x86_64
            kernel-tools-libs-3.10.0-957.1.3.el7.x86_64
            kernel-3.10.0-957.1.3.el7.x86_64
            kernel-tools-3.10.0-957.1.3.el7.x86_64
            kernel-devel-3.10.0-957.1.3.el7.x86_64
            
            

            The cluster setup by autotest has some differences in rpms installed, but nothing I would think would cause this problem.

            To make sure it's not the transaction_timeout (health code) which is causing this. I turned up the transaction timeout to 9000 seconds, turned off retries and turned off health sensitivity and reran the test. It still failed on the autotest setup cluster

            Results from ony-121vm[1-4]

             1 batch in stopping
            Batch is stopped
            c:
            Total 0 error nodes in c
            s:
            Total 0 error nodes in s
            session is ended
            Total 0 error nodes in c
            Total 0 error nodes in s
            onyx-121vm4: onyx-121vm4.onyx.whamcloud.com: executing lst_cleanup
            onyx-121vm3: onyx-121vm3.onyx.whamcloud.com: executing lst_cleanup
            onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup
            onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup
            onyx-121vm2: onyx-121vm2.onyx.whamcloud.com: executing lst_cleanup
            

            Results from onyx-30vm[1-4]:

             killing 14697 ...
            2 batch in stopping
            Batch is stopped
            12345-10.2.4.93@tcp: [Session 206 brw errors, 15 ping errors] [RPC: 117 errors, 113 dropped, 216 expired]
            12345-10.2.4.94@tcp: [Session 230 brw errors, 30 ping errors] [RPC: 62 errors, 144 dropped, 253 expired]
            c:
            Total 2 error nodes in c
            12345-10.2.4.95@tcp: [Session 232 brw errors, 15 ping errors] [RPC: 95 errors, 151 dropped, 247 expired]
            12345-10.2.4.96@tcp: [Session 239 brw errors, 30 ping errors] [RPC: 72 errors, 101 dropped, 266 expired]
            s:
            Total 2 error nodes in s
            session is ended
            Total 2 error nodes in c
            Total 2 error nodes in s
            Starting client onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2:  -o user_xattr,flock onyx-30vm4@tcp:/lustre /mnt/lustre
            Started clients onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2: 
            10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
            10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
            10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
            10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
            10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
            onyx-30vm1: onyx-30vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
            onyx-30vm2: onyx-30vm2.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
             lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found 
              Trace dump:
              = /usr/lib64/lustre/tests/test-framework.sh:6030:error()
              = /usr/lib64/lustre/tests/lnet-selftest.sh:161:check_lst_err()
              = /usr/lib64/lustre/tests/lnet-selftest.sh:187:test_smoke()
              = /usr/lib64/lustre/tests/test-framework.sh:6332:run_one()
              = /usr/lib64/lustre/tests/test-framework.sh:6371:run_one_logged()
              = /usr/lib64/lustre/tests/test-framework.sh:6217:run_test()
              = /usr/lib64/lustre/tests/lnet-selftest.sh:190:main()

            When the test fails, I see two issues:

            1) BRW timeouts. Basically lots of

            (socklnd_cb.c:2593:ksocknal_check_peer_timeouts()) Total 2 stale ZC_REQs for peer_ni 10.2.4.95@tcp detected; the oldest(0000000025bf6991) timed out 1 secs ago, resid: 0, wmem: 977752
            
            (socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.2.4.95@tcp[2], ip 10.2.4.95:7988, with error, wanted: 523368, left: 523368, last alive is 0 secs ago

            2) lnet_selftest outputs a message stating that it's uanble to get stats from 1 or 2 of the nodes.

            I tried to check the differences in HW between onyx-30 and onyx-121.

            onyx-121
            ---------
             [root@onyx-121 ~]# lscpu
            Architecture:          x86_64
            CPU op-mode(s):        32-bit, 64-bit
            Byte Order:            Little Endian
            CPU(s):                24
            On-line CPU(s) list:   0-23
            Thread(s) per core:    1
            Core(s) per socket:    12
            Socket(s):             2
            NUMA node(s):          2
            Vendor ID:             GenuineIntel
            CPU family:            6
            Model:                 62
            Model name:            Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
            Stepping:              4
            CPU MHz:               3000.058
            CPU max MHz:           3500.0000
            CPU min MHz:           1200.0000
            BogoMIPS:              5387.10
            Virtualization:        VT-x
            L1d cache:             32K
            L1i cache:             32K
            L2 cache:              256K
            L3 cache:              30720K
            NUMA node0 CPU(s):     0-11
            NUMA node1 CPU(s):     12-23
            Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
            
            [root@onyx-121 ~]# lspci | grep -i eth
            02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
            02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
            02:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
            02:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
             [root@onyx-30 ~]# lscpu
            Architecture:          x86_64
            CPU op-mode(s):        32-bit, 64-bit
            Byte Order:            Little Endian
            CPU(s):                32
            On-line CPU(s) list:   0-31
            Thread(s) per core:    2
            Core(s) per socket:    8
            Socket(s):             2
            NUMA node(s):          2
            Vendor ID:             GenuineIntel
            CPU family:            6
            Model:                 45
            Model name:            Genuine Intel(R) CPU  @ 2.60GHz
            Stepping:              5
            CPU MHz:               2999.902
            CPU max MHz:           3300.0000
            CPU min MHz:           1200.0000
            BogoMIPS:              5187.48
            Virtualization:        VT-x
            L1d cache:             32K
            L1i cache:             32K
            L2 cache:              256K
            L3 cache:              20480K
            NUMA node0 CPU(s):     0-7,16-23
            NUMA node1 CPU(s):     8-15,24-31
            Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
            
            [root@onyx-30 ~]# lspci | grep -i eth
            04:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
            

            It appears like onyx-30 only has one ethernet device. trevis-41 which is another clustre the test failed on, has only 2 ethernet devices. onyx-121 has 4.

            It's important to note also that the lnet-selftest.sh add 71 tests. Running just 1 test doesn't cause a problem. So it appears like batching up so many tests in one go causes the problem to occur.

            I'm investigating whether the number of interfaces accounts for the difference in behavior. So far it appears like this is the most significant change between the two clusters. Software wise both clusters have the same set of software installed.

            ashehata Amir Shehata (Inactive) added a comment - I was able to reproduce this on onyx-121vm [1-4] .  I then run lnetctl set transaction_timeout 60 once lnet_selftest is loaded up and when I re-ran lnet-selftest.sh the problem did not occur. I pushed the same change for test with autotest and the problem happens. Once the test failed I took over the cluster: onyx-30vm [1-4] , and ran lnet-selftest.sh manually (the same way I do with on onyx-121vm [1-4] ) and it failed. I made sure that both clusters are setup with the same kernel and the same lustre build onyx-121vm[1-2] ---------------- lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64 kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 kernel-4.18.0-80.4.2.el8_0.x86_64 kernel-tools-libs-4.18.0-80.el8.x86_64 kernel-tools-4.18.0-80.el8.x86_64 kernel-4.18.0-80.el8.x86_64 kernel-core-4.18.0-80.4.2.el8_0.x86_64 kernel-core-4.18.0-80.el8.x86_64 kernel-modules-4.18.0-80.el8.x86_64 kernel-headers-4.18.0-80.el8.x86_64 kernel-modules-4.18.0-80.4.2.el8_0.x86_64 onyx-121vm[3-4] ---------------- kernel-tools-libs-3.10.0-957.21.3.el7_lustre.x86_64 kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 python-perf-3.10.0-957.21.3.el7_lustre.x86_64 kernel-tools-3.10.0-957.21.3.el7_lustre.x86_64 kernel-devel-3.10.0-957.21.3.el7_lustre.x86_64 lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-3.10.0-957.21.3.el7_lustre.x86_64 kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-headers-3.10.0-957.21.3.el7_lustre.x86_6   onyx-30vm[1-2] --------------- kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 kernel-modules-4.18.0-80.4.2.el8_0.x86_64 kernel-headers-4.18.0-80.el8.x86_64 kernel-core-4.18.0-80.el8.x86_64 kernel-modules-4.18.0-80.el8.x86_64 kernel-tools-libs-4.18.0-80.el8.x86_64 kernel-4.18.0-80.el8.x86_64 kernel-core-4.18.0-80.4.2.el8_0.x86_64 kernel-tools-4.18.0-80.el8.x86_64 kernel-4.18.0-80.4.2.el8_0.x86_64 onyx-30vm[3-4] -------------- lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-3.10.0-957.21.3.el7_lustre.x86_64 lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-headers-3.10.0-957.1.3.el7.x86_64 kernel-tools-libs-3.10.0-957.1.3.el7.x86_64 kernel-3.10.0-957.1.3.el7.x86_64 kernel-tools-3.10.0-957.1.3.el7.x86_64 kernel-devel-3.10.0-957.1.3.el7.x86_64 The cluster setup by autotest has some differences in rpms installed, but nothing I would think would cause this problem. To make sure it's not the transaction_timeout (health code) which is causing this. I turned up the transaction timeout to 9000 seconds, turned off retries and turned off health sensitivity and reran the test. It still failed on the autotest setup cluster Results from ony-121vm [1-4] 1 batch in stopping Batch is stopped c: Total 0 error nodes in c s: Total 0 error nodes in s session is ended Total 0 error nodes in c Total 0 error nodes in s onyx-121vm4: onyx-121vm4.onyx.whamcloud.com: executing lst_cleanup onyx-121vm3: onyx-121vm3.onyx.whamcloud.com: executing lst_cleanup onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup onyx-121vm2: onyx-121vm2.onyx.whamcloud.com: executing lst_cleanup Results from onyx-30vm [1-4] : killing 14697 ... 2 batch in stopping Batch is stopped 12345-10.2.4.93@tcp: [Session 206 brw errors, 15 ping errors] [RPC: 117 errors, 113 dropped, 216 expired] 12345-10.2.4.94@tcp: [Session 230 brw errors, 30 ping errors] [RPC: 62 errors, 144 dropped, 253 expired] c: Total 2 error nodes in c 12345-10.2.4.95@tcp: [Session 232 brw errors, 15 ping errors] [RPC: 95 errors, 151 dropped, 247 expired] 12345-10.2.4.96@tcp: [Session 239 brw errors, 30 ping errors] [RPC: 72 errors, 101 dropped, 266 expired] s: Total 2 error nodes in s session is ended Total 2 error nodes in c Total 2 error nodes in s Starting client onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2: -o user_xattr,flock onyx-30vm4@tcp:/lustre /mnt/lustre Started clients onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2: 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) onyx-30vm1: onyx-30vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 onyx-30vm2: onyx-30vm2.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6030:error() = /usr/lib64/lustre/tests/lnet-selftest.sh:161:check_lst_err() = /usr/lib64/lustre/tests/lnet-selftest.sh:187:test_smoke() = /usr/lib64/lustre/tests/test-framework.sh:6332:run_one() = /usr/lib64/lustre/tests/test-framework.sh:6371:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:6217:run_test() = /usr/lib64/lustre/tests/lnet-selftest.sh:190:main() When the test fails, I see two issues: 1) BRW timeouts. Basically lots of (socklnd_cb.c:2593:ksocknal_check_peer_timeouts()) Total 2 stale ZC_REQs for peer_ni 10.2.4.95@tcp detected; the oldest(0000000025bf6991) timed out 1 secs ago, resid: 0, wmem: 977752 (socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.2.4.95@tcp[2], ip 10.2.4.95:7988, with error, wanted: 523368, left: 523368, last alive is 0 secs ago 2) lnet_selftest outputs a message stating that it's uanble to get stats from 1 or 2 of the nodes. I tried to check the differences in HW between onyx-30 and onyx-121. onyx-121 --------- [root@onyx-121 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread (s) per core: 1 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Model name: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz Stepping: 4 CPU MHz: 3000.058 CPU max MHz: 3500.0000 CPU min MHz: 1200.0000 BogoMIPS: 5387.10 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0-11 NUMA node1 CPU(s): 12-23 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts [root@onyx-121 ~]# lspci | grep -i eth 02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 02:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 02:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) [root@onyx-30 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread (s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Model name: Genuine Intel(R) CPU @ 2.60GHz Stepping: 5 CPU MHz: 2999.902 CPU max MHz: 3300.0000 CPU min MHz: 1200.0000 BogoMIPS: 5187.48 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts [root@onyx-30 ~]# lspci | grep -i eth 04:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) It appears like onyx-30 only has one ethernet device. trevis-41 which is another clustre the test failed on, has only 2 ethernet devices. onyx-121 has 4. It's important to note also that the lnet-selftest.sh add 71 tests. Running just 1 test doesn't cause a problem. So it appears like batching up so many tests in one go causes the problem to occur. I'm investigating whether the number of interfaces accounts for the difference in behavior. So far it appears like this is the most significant change between the two clusters. Software wise both clusters have the same set of software installed.

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35540
            Subject: LU-10073 lnet: increase transaction timeout
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 07b40bdd960d579a70c933f4d17faec844b211a4

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35540 Subject: LU-10073 lnet: increase transaction timeout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 07b40bdd960d579a70c933f4d17faec844b211a4

            People

              simmonsja James A Simmons
              jcasper James Casper (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: