[LU-10073] lnet-selftest test_smoke: lst Error found - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.16.0
Affects Version/s: Lustre 2.11.0, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6
Labels:
- LTS12
- arm
- ppc
- rhel8
- ubuntu
Environment:
trevis, full, x86_64 servers, ppc clients
servers: el7.4, ldiskfs, branch master, v2.10.53.1, b3642
clients: el7.4, branch master, v2.10.53.1, b3642

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

https://testing.whamcloud.com/test_sets/87032fec-9d50-11e7-b778-5254006e85c2

Seen previously in 2.9 testing (~~LU-6622~~).

From test_log:

Batch is stopped
12345-10.9.0.84@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
12345-10.9.0.85@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
c:
Total 2 error nodes in c
12345-10.9.5.24@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
12345-10.9.5.25@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
s:
Total 2 error nodes in s
session is ended
Total 2 error nodes in c
Total 2 error nodes in s

and

Started clients trevis-77vm3.trevis.hpdd.intel.com,trevis-77vm4: 
CMD: trevis-77vm3.trevis.hpdd.intel.com,trevis-77vm4 mount | grep /mnt/lustre' '
10.9.5.25@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.9.5.25@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
CMD: trevis-77vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
trevis-77vm4: h2tcp: deprecated, use h2nettype instead
trevis-77vm4: trevis-77vm4.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
 lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5289:error()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:153:check_lst_err()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:179:test_smoke()
  = /usr/lib64/lustre/tests/test-framework.sh:5565:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5604:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5451:run_test()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:182:main()

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

perf-kernel-121.svg
945 kB
26/Jul/19 11:55 PM
perf-kernel-122.svg
681 kB
26/Jul/19 11:55 PM
perf-kernel-123.svg
1.14 MB
26/Jul/19 11:55 PM
perf-kernel-124.svg
757 kB
26/Jul/19 11:55 PM
perf-kernel-vm1.svg
118 kB
25/Jul/19 6:08 PM
perf-kernel-vm2.svg
131 kB
25/Jul/19 6:08 PM
perf-kernel-vm3.svg
201 kB
25/Jul/19 6:08 PM
perf-kernel-vm4.svg
189 kB
25/Jul/19 6:08 PM

Issue Links

is related to

LU-13063 sanity test 411 times out for RHEL8.1

Open

LU-12269 Support RHEL 8.0

Resolved

LU-15781 Ubuntu 22.04 LTS release support

Resolved

is related to

LU-10157 LNET_MAX_IOV hard coded to 256

Resolved

LU-11389 lnet-setltest test smoke fails with ‘lst Error found’

Resolved

LU-11878 sanity test 103b: OOM because of too many bash processes: page allocation stalls for 18420ms

Resolved

LU-10300 Can the Lustre 2.10.x clients support 64K kernel page?

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(2 is related to , 18 mentioned in)

Activity

[LU-10073] lnet-selftest test_smoke: lst Error found

Gerrit Updater added a comment - 05/Feb/20 10:52 PM

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37450
Subject: ~~LU-10073~~ tests: skip test smoke for PPC
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c28c9cabd71b8b0d4e45e909acfd4c797176ed59

Gerrit Updater added a comment - 05/Feb/20 10:52 PM James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37450 Subject: LU-10073 tests: skip test smoke for PPC Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c28c9cabd71b8b0d4e45e909acfd4c797176ed59

Jian Yu added a comment - 21/Jan/20 8:17 AM

The failure also occurred on RHEL 7.7 client + server testing session: https://testing.whamcloud.com/test_sets/10d17588-3bb7-11ea-adca-52540065bddc

Jian Yu added a comment - 21/Jan/20 8:17 AM The failure also occurred on RHEL 7.7 client + server testing session: https://testing.whamcloud.com/test_sets/10d17588-3bb7-11ea-adca-52540065bddc

Jian Yu added a comment - 21/Jan/20 6:35 AM

+1 on RHEL 8.1 client testing: https://testing.whamcloud.com/test_sets/e47aaeaa-3ba7-11ea-bb75-52540065bddc

Jian Yu added a comment - 21/Jan/20 6:35 AM +1 on RHEL 8.1 client testing: https://testing.whamcloud.com/test_sets/e47aaeaa-3ba7-11ea-bb75-52540065bddc

Chris Horn added a comment - 18/Oct/19 2:22 AM

+1 on master https://testing.whamcloud.com/test_sessions/26e84ad7-8e0a-4307-a1f8-1c5281550588

Chris Horn added a comment - 18/Oct/19 2:22 AM +1 on master https://testing.whamcloud.com/test_sessions/26e84ad7-8e0a-4307-a1f8-1c5281550588

Jian Yu added a comment - 27/Aug/19 12:50 AM

The failure occurred on RHEL 8.0 vm client+server against master branch:
https://testing.whamcloud.com/test_sets/380744c0-c709-11e9-a25b-52540065bddc

Jian Yu added a comment - 27/Aug/19 12:50 AM The failure occurred on RHEL 8.0 vm client+server against master branch: https://testing.whamcloud.com/test_sets/380744c0-c709-11e9-a25b-52540065bddc

Amir Shehata (Inactive) added a comment - 26/Jul/19 11:57 PM

I setup 2 nodes with Ubuntu 18: 4.15.0-45-generic and two nodes with centos 7.6: 3.10.0-957.21.3.el7_lustre.x86_64

We couldn't install RHEL8 on the physical nodes, so the setup is not 100% the same as the VM one. Once the RHEL8 installation is resolved I'll attempt the test again.

Running the exact same script which failed on the VMs, passed on the physical setup.

I collected the flamegraphs below and there is no significant differences between the Ubuntu18 and centos 7.6 with regards to softirq handling.

Ubuntu 18 client: perf-kernel-121.svg
Ubuntu 18 client: perf-kernel-122.svg
RHEL 7.6 server: perf-kernel-123.svg
RHEL 7.6 server: perf-kernel-124.svg

This issue appears to be localized to VM setups. As far as I know there hasn't been reports of test failure on a physical setup.

The VMs are started on a RHEL7.5 host. Could there be an interaction issue between host and VM? One thing to try is to deploy the VMs on a RHEL8 host and try the test. Currently there is a problem installing RHEL8 on physical nodes. Will try the test once this issue is resolved.

Amir Shehata (Inactive) added a comment - 26/Jul/19 11:57 PM I setup 2 nodes with Ubuntu 18: 4.15.0-45-generic and two nodes with centos 7.6: 3.10.0-957.21.3.el7_lustre.x86_64 We couldn't install RHEL8 on the physical nodes, so the setup is not 100% the same as the VM one. Once the RHEL8 installation is resolved I'll attempt the test again. Running the exact same script which failed on the VMs, passed on the physical setup. I collected the flamegraphs below and there is no significant differences between the Ubuntu18 and centos 7.6 with regards to softirq handling. Ubuntu 18 client: perf-kernel-121.svg Ubuntu 18 client: perf-kernel-122.svg RHEL 7.6 server: perf-kernel-123.svg RHEL 7.6 server: perf-kernel-124.svg This issue appears to be localized to VM setups. As far as I know there hasn't been reports of test failure on a physical setup. The VMs are started on a RHEL7.5 host. Could there be an interaction issue between host and VM? One thing to try is to deploy the VMs on a RHEL8 host and try the test. Currently there is a problem installing RHEL8 on physical nodes. Will try the test once this issue is resolved.

Amir Shehata (Inactive) added a comment - 25/Jul/19 6:17 PM

I believe so too.

I tried RHEL8 across all the VMs and the problem persists.

I then increased the lnet_selftest rpc timeout to 256 seconds. And the test passed. IE no RPC errors or drops.

I measured the time it takes to complete RPCs from lnet_selftest perspective and I noticed the following behavior:

with one test in the batch, RPCs take a max of 1 second to complete
with two tests in the batch I see RPCs taking close to 10 seconds
As I increase the number of tests in the batch I see that there are RPCs which take longer and longer to complete. With 40+ tests in the batch (which what lnet-selftest.sh does) I see RPCs taking up to 130 seconds to complete.

I then went to the previous setup with 2 RHEL8 clients and 2 RHEL7.6 servers and captured performance data using perf and generated flame graphs

perf-kernel-vm1.svg
perf-kernel-vm2.svg
perf-kernel-vm3.svg
perf-kernel-vm4.svg

There appears to be a key difference in the flamegraphs captured on the RHEL8 VMs vs the RHEL7.6 VMs. The ksoftirqd/1 is appearing significantly less on the RHEL7.6 VMs (~43 samples) vs RHEL8 VMs (~7000 samples).

My next steps are

Attempt and reproduce this on physical nodes
Investigate and see why interrupt handling on RHEL 8 is happening much more frequently. Is it only on VMs or physical machines as well

Amir Shehata (Inactive) added a comment - 25/Jul/19 6:17 PM I believe so too. I tried RHEL8 across all the VMs and the problem persists. I then increased the lnet_selftest rpc timeout to 256 seconds. And the test passed. IE no RPC errors or drops. I measured the time it takes to complete RPCs from lnet_selftest perspective and I noticed the following behavior: with one test in the batch, RPCs take a max of 1 second to complete with two tests in the batch I see RPCs taking close to 10 seconds As I increase the number of tests in the batch I see that there are RPCs which take longer and longer to complete. With 40+ tests in the batch (which what lnet-selftest.sh does) I see RPCs taking up to 130 seconds to complete. I then went to the previous setup with 2 RHEL8 clients and 2 RHEL7.6 servers and captured performance data using perf and generated flame graphs perf-kernel-vm1.svg perf-kernel-vm2.svg perf-kernel-vm3.svg perf-kernel-vm4.svg There appears to be a key difference in the flamegraphs captured on the RHEL8 VMs vs the RHEL7.6 VMs. The ksoftirqd/1 is appearing significantly less on the RHEL7.6 VMs (~43 samples) vs RHEL8 VMs (~7000 samples). My next steps are Attempt and reproduce this on physical nodes Investigate and see why interrupt handling on RHEL 8 is happening much more frequently. Is it only on VMs or physical machines as well

Andreas Dilger added a comment - 20/Jul/19 9:16 AM

After moving onyx-30vm[1-2] from RHEL8 clients to RHEL7.6 clients the issue is not reproducible any longer.

This is also true of ARM and Ubuntu clients - they cause constant failures, but x86 clients do not. My thinking is that this relates to the specific kernel version used, and it just happens that ARM, Ubuntu, and RHEL8 are newer kernels that somehow changed the kernel socket interfaces.

Andreas Dilger added a comment - 20/Jul/19 9:16 AM After moving onyx-30vm [1-2] from RHEL8 clients to RHEL7.6 clients the issue is not reproducible any longer. This is also true of ARM and Ubuntu clients - they cause constant failures, but x86 clients do not. My thinking is that this relates to the specific kernel version used, and it just happens that ARM, Ubuntu, and RHEL8 are newer kernels that somehow changed the kernel socket interfaces.

Amir Shehata (Inactive) added a comment - 19/Jul/19 11:43 PM

After moving onyx-30vm[1-2] from RHEL8 clients to RHEL7.6 clients the issue is not reproducible any longer.

Will start backing out socklnd patches to see if any of them is the problem.

Amir Shehata (Inactive) added a comment - 19/Jul/19 11:43 PM After moving onyx-30vm [1-2] from RHEL8 clients to RHEL7.6 clients the issue is not reproducible any longer. Will start backing out socklnd patches to see if any of them is the problem.

Amir Shehata (Inactive) added a comment - 18/Jul/19 3:48 AM

I was able to reproduce this on onyx-121vm[1-4]. I then run lnetctl set transaction_timeout 60 once lnet_selftest is loaded up and when I re-ran lnet-selftest.sh the problem did not occur. I pushed the same change for test with autotest and the problem happens. Once the test failed I took over the cluster: onyx-30vm[1-4], and ran lnet-selftest.sh manually (the same way I do with on onyx-121vm[1-4]) and it failed. I made sure that both clusters are setup with the same kernel and the same lustre build

onyx-121vm[1-2] 
----------------
lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64
kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
kernel-4.18.0-80.4.2.el8_0.x86_64
kernel-tools-libs-4.18.0-80.el8.x86_64
kernel-tools-4.18.0-80.el8.x86_64
kernel-4.18.0-80.el8.x86_64
kernel-core-4.18.0-80.4.2.el8_0.x86_64
kernel-core-4.18.0-80.el8.x86_64
kernel-modules-4.18.0-80.el8.x86_64
kernel-headers-4.18.0-80.el8.x86_64
kernel-modules-4.18.0-80.4.2.el8_0.x86_64

onyx-121vm[3-4]
----------------
kernel-tools-libs-3.10.0-957.21.3.el7_lustre.x86_64
kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
python-perf-3.10.0-957.21.3.el7_lustre.x86_64
kernel-tools-3.10.0-957.21.3.el7_lustre.x86_64
kernel-devel-3.10.0-957.21.3.el7_lustre.x86_64
lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-3.10.0-957.21.3.el7_lustre.x86_64
kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-headers-3.10.0-957.21.3.el7_lustre.x86_6

onyx-30vm[1-2]
---------------
kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
kernel-modules-4.18.0-80.4.2.el8_0.x86_64
kernel-headers-4.18.0-80.el8.x86_64
kernel-core-4.18.0-80.el8.x86_64
kernel-modules-4.18.0-80.el8.x86_64
kernel-tools-libs-4.18.0-80.el8.x86_64
kernel-4.18.0-80.el8.x86_64
kernel-core-4.18.0-80.4.2.el8_0.x86_64
kernel-tools-4.18.0-80.el8.x86_64
kernel-4.18.0-80.4.2.el8_0.x86_64


onyx-30vm[3-4]
--------------
lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-3.10.0-957.21.3.el7_lustre.x86_64
lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-headers-3.10.0-957.1.3.el7.x86_64
kernel-tools-libs-3.10.0-957.1.3.el7.x86_64
kernel-3.10.0-957.1.3.el7.x86_64
kernel-tools-3.10.0-957.1.3.el7.x86_64
kernel-devel-3.10.0-957.1.3.el7.x86_64

The cluster setup by autotest has some differences in rpms installed, but nothing I would think would cause this problem.

To make sure it's not the transaction_timeout (health code) which is causing this. I turned up the transaction timeout to 9000 seconds, turned off retries and turned off health sensitivity and reran the test. It still failed on the autotest setup cluster

Results from ony-121vm[1-4]

 1 batch in stopping
Batch is stopped
c:
Total 0 error nodes in c
s:
Total 0 error nodes in s
session is ended
Total 0 error nodes in c
Total 0 error nodes in s
onyx-121vm4: onyx-121vm4.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm3: onyx-121vm3.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm2: onyx-121vm2.onyx.whamcloud.com: executing lst_cleanup

Results from onyx-30vm[1-4]:

 killing 14697 ...
2 batch in stopping
Batch is stopped
12345-10.2.4.93@tcp: [Session 206 brw errors, 15 ping errors] [RPC: 117 errors, 113 dropped, 216 expired]
12345-10.2.4.94@tcp: [Session 230 brw errors, 30 ping errors] [RPC: 62 errors, 144 dropped, 253 expired]
c:
Total 2 error nodes in c
12345-10.2.4.95@tcp: [Session 232 brw errors, 15 ping errors] [RPC: 95 errors, 151 dropped, 247 expired]
12345-10.2.4.96@tcp: [Session 239 brw errors, 30 ping errors] [RPC: 72 errors, 101 dropped, 266 expired]
s:
Total 2 error nodes in s
session is ended
Total 2 error nodes in c
Total 2 error nodes in s
Starting client onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2:  -o user_xattr,flock onyx-30vm4@tcp:/lustre /mnt/lustre
Started clients onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2: 
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
onyx-30vm1: onyx-30vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
onyx-30vm2: onyx-30vm2.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
 lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6030:error()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:161:check_lst_err()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:187:test_smoke()
  = /usr/lib64/lustre/tests/test-framework.sh:6332:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:6371:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:6217:run_test()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:190:main()

When the test fails, I see two issues:

1) BRW timeouts. Basically lots of

(socklnd_cb.c:2593:ksocknal_check_peer_timeouts()) Total 2 stale ZC_REQs for peer_ni 10.2.4.95@tcp detected; the oldest(0000000025bf6991) timed out 1 secs ago, resid: 0, wmem: 977752

(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.2.4.95@tcp[2], ip 10.2.4.95:7988, with error, wanted: 523368, left: 523368, last alive is 0 secs ago

2) lnet_selftest outputs a message stating that it's uanble to get stats from 1 or 2 of the nodes.

I tried to check the differences in HW between onyx-30 and onyx-121.

onyx-121
---------
 [root@onyx-121 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
Stepping:              4
CPU MHz:               3000.058
CPU max MHz:           3500.0000
CPU min MHz:           1200.0000
BogoMIPS:              5387.10
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-11
NUMA node1 CPU(s):     12-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

[root@onyx-121 ~]# lspci | grep -i eth
02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

 [root@onyx-30 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Model name:            Genuine Intel(R) CPU  @ 2.60GHz
Stepping:              5
CPU MHz:               2999.902
CPU max MHz:           3300.0000
CPU min MHz:           1200.0000
BogoMIPS:              5187.48
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts

[root@onyx-30 ~]# lspci | grep -i eth
04:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

It appears like onyx-30 only has one ethernet device. trevis-41 which is another clustre the test failed on, has only 2 ethernet devices. onyx-121 has 4.

It's important to note also that the lnet-selftest.sh add 71 tests. Running just 1 test doesn't cause a problem. So it appears like batching up so many tests in one go causes the problem to occur.

I'm investigating whether the number of interfaces accounts for the difference in behavior. So far it appears like this is the most significant change between the two clusters. Software wise both clusters have the same set of software installed.

Amir Shehata (Inactive) added a comment - 18/Jul/19 3:48 AM I was able to reproduce this on onyx-121vm [1-4] . I then run lnetctl set transaction_timeout 60 once lnet_selftest is loaded up and when I re-ran lnet-selftest.sh the problem did not occur. I pushed the same change for test with autotest and the problem happens. Once the test failed I took over the cluster: onyx-30vm [1-4] , and ran lnet-selftest.sh manually (the same way I do with on onyx-121vm [1-4] ) and it failed. I made sure that both clusters are setup with the same kernel and the same lustre build onyx-121vm[1-2] ---------------- lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64 kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 kernel-4.18.0-80.4.2.el8_0.x86_64 kernel-tools-libs-4.18.0-80.el8.x86_64 kernel-tools-4.18.0-80.el8.x86_64 kernel-4.18.0-80.el8.x86_64 kernel-core-4.18.0-80.4.2.el8_0.x86_64 kernel-core-4.18.0-80.el8.x86_64 kernel-modules-4.18.0-80.el8.x86_64 kernel-headers-4.18.0-80.el8.x86_64 kernel-modules-4.18.0-80.4.2.el8_0.x86_64 onyx-121vm[3-4] ---------------- kernel-tools-libs-3.10.0-957.21.3.el7_lustre.x86_64 kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 python-perf-3.10.0-957.21.3.el7_lustre.x86_64 kernel-tools-3.10.0-957.21.3.el7_lustre.x86_64 kernel-devel-3.10.0-957.21.3.el7_lustre.x86_64 lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-3.10.0-957.21.3.el7_lustre.x86_64 kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-headers-3.10.0-957.21.3.el7_lustre.x86_6 onyx-30vm[1-2] --------------- kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 kernel-modules-4.18.0-80.4.2.el8_0.x86_64 kernel-headers-4.18.0-80.el8.x86_64 kernel-core-4.18.0-80.el8.x86_64 kernel-modules-4.18.0-80.el8.x86_64 kernel-tools-libs-4.18.0-80.el8.x86_64 kernel-4.18.0-80.el8.x86_64 kernel-core-4.18.0-80.4.2.el8_0.x86_64 kernel-tools-4.18.0-80.el8.x86_64 kernel-4.18.0-80.4.2.el8_0.x86_64 onyx-30vm[3-4] -------------- lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-3.10.0-957.21.3.el7_lustre.x86_64 lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-headers-3.10.0-957.1.3.el7.x86_64 kernel-tools-libs-3.10.0-957.1.3.el7.x86_64 kernel-3.10.0-957.1.3.el7.x86_64 kernel-tools-3.10.0-957.1.3.el7.x86_64 kernel-devel-3.10.0-957.1.3.el7.x86_64 The cluster setup by autotest has some differences in rpms installed, but nothing I would think would cause this problem. To make sure it's not the transaction_timeout (health code) which is causing this. I turned up the transaction timeout to 9000 seconds, turned off retries and turned off health sensitivity and reran the test. It still failed on the autotest setup cluster Results from ony-121vm [1-4] 1 batch in stopping Batch is stopped c: Total 0 error nodes in c s: Total 0 error nodes in s session is ended Total 0 error nodes in c Total 0 error nodes in s onyx-121vm4: onyx-121vm4.onyx.whamcloud.com: executing lst_cleanup onyx-121vm3: onyx-121vm3.onyx.whamcloud.com: executing lst_cleanup onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup onyx-121vm2: onyx-121vm2.onyx.whamcloud.com: executing lst_cleanup Results from onyx-30vm [1-4] : killing 14697 ... 2 batch in stopping Batch is stopped 12345-10.2.4.93@tcp: [Session 206 brw errors, 15 ping errors] [RPC: 117 errors, 113 dropped, 216 expired] 12345-10.2.4.94@tcp: [Session 230 brw errors, 30 ping errors] [RPC: 62 errors, 144 dropped, 253 expired] c: Total 2 error nodes in c 12345-10.2.4.95@tcp: [Session 232 brw errors, 15 ping errors] [RPC: 95 errors, 151 dropped, 247 expired] 12345-10.2.4.96@tcp: [Session 239 brw errors, 30 ping errors] [RPC: 72 errors, 101 dropped, 266 expired] s: Total 2 error nodes in s session is ended Total 2 error nodes in c Total 2 error nodes in s Starting client onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2: -o user_xattr,flock onyx-30vm4@tcp:/lustre /mnt/lustre Started clients onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2: 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) onyx-30vm1: onyx-30vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 onyx-30vm2: onyx-30vm2.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6030:error() = /usr/lib64/lustre/tests/lnet-selftest.sh:161:check_lst_err() = /usr/lib64/lustre/tests/lnet-selftest.sh:187:test_smoke() = /usr/lib64/lustre/tests/test-framework.sh:6332:run_one() = /usr/lib64/lustre/tests/test-framework.sh:6371:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:6217:run_test() = /usr/lib64/lustre/tests/lnet-selftest.sh:190:main() When the test fails, I see two issues: 1) BRW timeouts. Basically lots of (socklnd_cb.c:2593:ksocknal_check_peer_timeouts()) Total 2 stale ZC_REQs for peer_ni 10.2.4.95@tcp detected; the oldest(0000000025bf6991) timed out 1 secs ago, resid: 0, wmem: 977752 (socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.2.4.95@tcp[2], ip 10.2.4.95:7988, with error, wanted: 523368, left: 523368, last alive is 0 secs ago 2) lnet_selftest outputs a message stating that it's uanble to get stats from 1 or 2 of the nodes. I tried to check the differences in HW between onyx-30 and onyx-121. onyx-121 --------- [root@onyx-121 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread (s) per core: 1 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Model name: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz Stepping: 4 CPU MHz: 3000.058 CPU max MHz: 3500.0000 CPU min MHz: 1200.0000 BogoMIPS: 5387.10 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0-11 NUMA node1 CPU(s): 12-23 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts [root@onyx-121 ~]# lspci | grep -i eth 02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 02:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 02:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) [root@onyx-30 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread (s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Model name: Genuine Intel(R) CPU @ 2.60GHz Stepping: 5 CPU MHz: 2999.902 CPU max MHz: 3300.0000 CPU min MHz: 1200.0000 BogoMIPS: 5187.48 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts [root@onyx-30 ~]# lspci | grep -i eth 04:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) It appears like onyx-30 only has one ethernet device. trevis-41 which is another clustre the test failed on, has only 2 ethernet devices. onyx-121 has 4. It's important to note also that the lnet-selftest.sh add 71 tests. Running just 1 test doesn't cause a problem. So it appears like batching up so many tests in one go causes the problem to occur. I'm investigating whether the number of interfaces accounts for the difference in behavior. So far it appears like this is the most significant change between the two clusters. Software wise both clusters have the same set of software installed.

Gerrit Updater added a comment - 17/Jul/19 5:50 PM

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35540
Subject: ~~LU-10073~~ lnet: increase transaction timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 07b40bdd960d579a70c933f4d17faec844b211a4

Gerrit Updater added a comment - 17/Jul/19 5:50 PM Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35540 Subject: LU-10073 lnet: increase transaction timeout Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 07b40bdd960d579a70c933f4d17faec844b211a4

People

Assignee:: James A Simmons

Reporter:: James Casper (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 04/Oct/17 4:40 PM

Updated:: 20/Jan/24 12:46 AM

Resolved:: 31/May/22 1:46 AM