I was able to reproduce this on onyx-121vm[1-4]. I then run lnetctl set transaction_timeout 60 once lnet_selftest is loaded up and when I re-ran lnet-selftest.sh the problem did not occur. I pushed the same change for test with autotest and the problem happens. Once the test failed I took over the cluster: onyx-30vm[1-4], and ran lnet-selftest.sh manually (the same way I do with on onyx-121vm[1-4]) and it failed. I made sure that both clusters are setup with the same kernel and the same lustre build
onyx-121vm[1-2]
----------------
lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64
kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
kernel-4.18.0-80.4.2.el8_0.x86_64
kernel-tools-libs-4.18.0-80.el8.x86_64
kernel-tools-4.18.0-80.el8.x86_64
kernel-4.18.0-80.el8.x86_64
kernel-core-4.18.0-80.4.2.el8_0.x86_64
kernel-core-4.18.0-80.el8.x86_64
kernel-modules-4.18.0-80.el8.x86_64
kernel-headers-4.18.0-80.el8.x86_64
kernel-modules-4.18.0-80.4.2.el8_0.x86_64
onyx-121vm[3-4]
----------------
kernel-tools-libs-3.10.0-957.21.3.el7_lustre.x86_64
kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
python-perf-3.10.0-957.21.3.el7_lustre.x86_64
kernel-tools-3.10.0-957.21.3.el7_lustre.x86_64
kernel-devel-3.10.0-957.21.3.el7_lustre.x86_64
lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-3.10.0-957.21.3.el7_lustre.x86_64
kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-headers-3.10.0-957.21.3.el7_lustre.x86_6
onyx-30vm[1-2]
---------------
kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
kernel-modules-4.18.0-80.4.2.el8_0.x86_64
kernel-headers-4.18.0-80.el8.x86_64
kernel-core-4.18.0-80.el8.x86_64
kernel-modules-4.18.0-80.el8.x86_64
kernel-tools-libs-4.18.0-80.el8.x86_64
kernel-4.18.0-80.el8.x86_64
kernel-core-4.18.0-80.4.2.el8_0.x86_64
kernel-tools-4.18.0-80.el8.x86_64
kernel-4.18.0-80.4.2.el8_0.x86_64
onyx-30vm[3-4]
--------------
lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-3.10.0-957.21.3.el7_lustre.x86_64
lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-headers-3.10.0-957.1.3.el7.x86_64
kernel-tools-libs-3.10.0-957.1.3.el7.x86_64
kernel-3.10.0-957.1.3.el7.x86_64
kernel-tools-3.10.0-957.1.3.el7.x86_64
kernel-devel-3.10.0-957.1.3.el7.x86_64
The cluster setup by autotest has some differences in rpms installed, but nothing I would think would cause this problem.
To make sure it's not the transaction_timeout (health code) which is causing this. I turned up the transaction timeout to 9000 seconds, turned off retries and turned off health sensitivity and reran the test. It still failed on the autotest setup cluster
Results from ony-121vm[1-4]
1 batch in stopping
Batch is stopped
c:
Total 0 error nodes in c
s:
Total 0 error nodes in s
session is ended
Total 0 error nodes in c
Total 0 error nodes in s
onyx-121vm4: onyx-121vm4.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm3: onyx-121vm3.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm2: onyx-121vm2.onyx.whamcloud.com: executing lst_cleanup
Results from onyx-30vm[1-4]:
killing 14697 ...
2 batch in stopping
Batch is stopped
12345-10.2.4.93@tcp: [Session 206 brw errors, 15 ping errors] [RPC: 117 errors, 113 dropped, 216 expired]
12345-10.2.4.94@tcp: [Session 230 brw errors, 30 ping errors] [RPC: 62 errors, 144 dropped, 253 expired]
c:
Total 2 error nodes in c
12345-10.2.4.95@tcp: [Session 232 brw errors, 15 ping errors] [RPC: 95 errors, 151 dropped, 247 expired]
12345-10.2.4.96@tcp: [Session 239 brw errors, 30 ping errors] [RPC: 72 errors, 101 dropped, 266 expired]
s:
Total 2 error nodes in s
session is ended
Total 2 error nodes in c
Total 2 error nodes in s
Starting client onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2: -o user_xattr,flock onyx-30vm4@tcp:/lustre /mnt/lustre
Started clients onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2:
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
onyx-30vm1: onyx-30vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
onyx-30vm2: onyx-30vm2.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:6030:error()
= /usr/lib64/lustre/tests/lnet-selftest.sh:161:check_lst_err()
= /usr/lib64/lustre/tests/lnet-selftest.sh:187:test_smoke()
= /usr/lib64/lustre/tests/test-framework.sh:6332:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:6371:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:6217:run_test()
= /usr/lib64/lustre/tests/lnet-selftest.sh:190:main()
When the test fails, I see two issues:
1) BRW timeouts. Basically lots of
(socklnd_cb.c:2593:ksocknal_check_peer_timeouts()) Total 2 stale ZC_REQs for peer_ni 10.2.4.95@tcp detected; the oldest(0000000025bf6991) timed out 1 secs ago, resid: 0, wmem: 977752
(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.2.4.95@tcp[2], ip 10.2.4.95:7988, with error, wanted: 523368, left: 523368, last alive is 0 secs ago
2) lnet_selftest outputs a message stating that it's uanble to get stats from 1 or 2 of the nodes.
I tried to check the differences in HW between onyx-30 and onyx-121.
onyx-121
---------
[root@onyx-121 ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
Stepping: 4
CPU MHz: 3000.058
CPU max MHz: 3500.0000
CPU min MHz: 1200.0000
BogoMIPS: 5387.10
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11
NUMA node1 CPU(s): 12-23
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
[root@onyx-121 ~]# lspci | grep -i eth
02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
[root@onyx-30 ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Genuine Intel(R) CPU @ 2.60GHz
Stepping: 5
CPU MHz: 2999.902
CPU max MHz: 3300.0000
CPU min MHz: 1200.0000
BogoMIPS: 5187.48
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
[root@onyx-30 ~]# lspci | grep -i eth
04:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
It appears like onyx-30 only has one ethernet device. trevis-41 which is another clustre the test failed on, has only 2 ethernet devices. onyx-121 has 4.
It's important to note also that the lnet-selftest.sh add 71 tests. Running just 1 test doesn't cause a problem. So it appears like batching up so many tests in one go causes the problem to occur.
I'm investigating whether the number of interfaces accounts for the difference in behavior. So far it appears like this is the most significant change between the two clusters. Software wise both clusters have the same set of software installed.
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37450
Subject:
LU-10073tests: skip test smoke for PPCProject: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c28c9cabd71b8b0d4e45e909acfd4c797176ed59