[LU-10073] lnet-selftest test_smoke: lst Error found Created: 04/Oct/17  Updated: 20/Jan/24  Resolved: 31/May/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: James Casper Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: LTS12, arm, ppc, rhel8, ubuntu
Environment:

trevis, full, x86_64 servers, ppc clients
servers: el7.4, ldiskfs, branch master, v2.10.53.1, b3642
clients: el7.4, branch master, v2.10.53.1, b3642


Attachments: File perf-kernel-121.svg     File perf-kernel-122.svg     File perf-kernel-123.svg     File perf-kernel-124.svg     File perf-kernel-vm1.svg     File perf-kernel-vm2.svg     File perf-kernel-vm3.svg     File perf-kernel-vm4.svg    
Issue Links:
Related
is related to LU-10157 LNET_MAX_IOV hard coded to 256 Resolved
is related to LU-11389 lnet-setltest test smoke fails with ‘... Resolved
is related to LU-11878 sanity test 103b: OOM because of too ... Resolved
is related to LU-10300 Can the Lustre 2.10.x clients support... Resolved
is related to LU-13063 sanity test 411 times out for RHEL8.1 Open
is related to LU-12269 Support RHEL 8.0 Resolved
is related to LU-15781 Ubuntu 22.04 LTS release support Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.whamcloud.com/test_sets/87032fec-9d50-11e7-b778-5254006e85c2

Seen previously in 2.9 testing (LU-6622).

From test_log:

Batch is stopped
12345-10.9.0.84@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
12345-10.9.0.85@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
c:
Total 2 error nodes in c
12345-10.9.5.24@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
12345-10.9.5.25@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired]
s:
Total 2 error nodes in s
session is ended
Total 2 error nodes in c
Total 2 error nodes in s

and

Started clients trevis-77vm3.trevis.hpdd.intel.com,trevis-77vm4: 
CMD: trevis-77vm3.trevis.hpdd.intel.com,trevis-77vm4 mount | grep /mnt/lustre' '
10.9.5.25@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.9.5.25@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
CMD: trevis-77vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 
trevis-77vm4: h2tcp: deprecated, use h2nettype instead
trevis-77vm4: trevis-77vm4.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
 lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5289:error()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:153:check_lst_err()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:179:test_smoke()
  = /usr/lib64/lustre/tests/test-framework.sh:5565:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5604:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5451:run_test()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:182:main()


 Comments   
Comment by Andreas Dilger [ 27/Nov/18 ]

Still seeing these failures on a regular basis - 50 failures in the past week, which is about 23% of all runs, increasing to 45% of ARM runs:
https://testing.whamcloud.com/sub_tests/021c4e14-f01d-11e8-bfe1-52540065bddc
https://testing.whamcloud.com/sub_tests/0512c886-f1c7-11e8-815b-52540065bddc
https://testing.whamcloud.com/sub_tests/f7e4f2a6-edac-11e8-815b-52540065bddc
:
:

Comment by Sonia Sharma (Inactive) [ 27/Nov/18 ]

I would say these current failures are different from what is mentioned in the description of this ticket. In the description, it shows to have ping errors but in all the above test failure links, the errors are the brw errors.
Here is the ticket that James had opened for similar brw errors that were seen.
LU-11389

Comment by Artem Blagodarenko (Inactive) [ 12/Dec/18 ]

Happened again https://testing.whamcloud.com/test_sets/e1d460c4-f7e7-11e8-b67f-52540065bddc

Comment by Sonia Sharma (Inactive) [ 13/Dec/18 ]

Ideally this again looks more related to LU-11389 than the ping errors reported in this ticket description.

But in this particular case, I also see socklnd error messages in the logs like below which I did not see in other reported cases 

LNetError: 19630:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.8.62@tcp[2], ip 10.9.8.62:1021, with error, wanted: 811328, left: 811328, last alive is 5 secs ago

LNet: 2475:0:(rpc.c:1250:srpc_send_rpc()) Remote error 74 at 12345-10.9.8.62@tcp, unlink bulk buffer in case peer didn't initiate bulk transfer

And these - 

00000800:00020000:0.0:1543943185.853200:0:19630:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.8.62@tcp[2], ip 10.9.8.62:1021, with error, wanted: 811328, left: 811328, last alive is 5 secs ago
00000400:00000100:0.0:1543943185.858360:0:19630:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -5 type 2, RPC errors 1
00000400:00000100:0.0:1543943185.865739:0:19632:0:(lib-move.c:3856:lnet_parse_reply()) 10.9.4.72@tcp: Dropping REPLY from 12345-10.9.8.62@tcp for invalid MD 0x156d2737da0adbc9.0x483fb9
00000400:00000100:0.0:1543943185.866100:0:19632:0:(lib-move.c:3856:lnet_parse_reply()) 10.9.4.72@tcp: Dropping REPLY from 12345-10.9.8.62@tcp for invalid MD 0x156d2737da0adbc9.0x483fbd
00000400:00000100:0.0:1543943185.866322:0:19632:0:(lib-move.c:3856:lnet_parse_reply()) 10.9.4.72@tcp: Dropping REPLY from 12345-10.9.8.62@tcp for invalid MD 0x156d2737da0adbc9.0x483fc1
00000400:00000100:0.0:1543943185.866328:0:19632:0:(lib-move.c:3856:lnet_parse_reply()) 10.9.4.72@tcp: Dropping REPLY from 12345-10.9.8.62@tcp for invalid MD 0x156d2737da0adbc9.0x483fc5
00000400:00000100:0.0:1543943185.866333:0:19632:0:(lib-move.c:3856:lnet_parse_reply()) 10.9.4.72@tcp: Dropping REPLY from 12345-10.9.8.62@tcp for invalid MD 0x156d2737da0adbc9.0x483fc9
00000800:00000100:0.0:1543943185.866336:0:19632:0:(socklnd_cb.c:972:ksocknal_launch_packet()) No usable routes to 12345-10.9.8.62@tcp

Investigating on these errors.

Comment by Andreas Dilger [ 20/Dec/18 ]

Still seeing lnet-selftest on master after patch https://review.whamcloud.com/33231 "LU-11389 lnet: increase lnet transaction timeout" has landed.

This particular failure is with an Ubuntu 18.04 client for a regular review-ldiskfs test run:
https://testing.whamcloud.com/test_sets/9e8fb032-040f-11e9-b970-52540065bddc

Since bug LU-11389 has been closed for 2.12, and autotest is flagging this ticket for all of the current test failures, let's just use this ticket for the BRW errors currently being seen and reported here.

The following messages are in the MDS log:

[ 7418.687141] LNetError: 14039:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.230@tcp[2], ip 10.9.4.230:1021, with error, wanted: 1013616, left: 1013616, last alive is 0 secs ago
[ 7418.688754] LustreError: 5373:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.9.4.230@tcp: -5
[ 7418.688761] LustreError: 5373:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer to 12345-10.9.4.230@tcp has failed: -5
[ 7469.603687] LustreError: 5373:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.230@tcp: -4
[ 7469.605632] LustreError: 5373:0:(brw_test.c:415:brw_bulk_ready()) Skipped 2 previous similar messages
[ 7469.607239] LustreError: 5373:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.230@tcp has failed: -5
[ 7469.609038] LustreError: 5373:0:(brw_test.c:389:brw_server_rpc_done()) Skipped 2 previous similar messages
[ 7481.461431] LNet: 5389:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.230@tcp, timeout 64.
[ 7481.463637] LustreError: 5373:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.230@tcp failed with -110
[ 7595.562071] LustreError: 5373:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.9.4.231@tcp: -5
[ 7595.562075] LNetError: 14039:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.231@tcp[2], ip 10.9.4.231:1021, with error, wanted: 487400, left: 487400, last alive is 0 secs ago
Comment by Andreas Dilger [ 20/Dec/18 ]

I'm actually seeing both the ping and BRW timeout errors in the same run, so it doesn't seem like they are different problems in the end. Some logs from https://testing.whamcloud.com/sub_tests/4acfe152-0102-11e9-b970-52540065bddc :

[ 6714.891483] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[ 6715.091612] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[ 6715.292141] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[ 6715.492122] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[ 6715.692075] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[ 6715.892115] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[ 6716.092499] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[ 6716.291542] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[ 6716.491536] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[ 6716.691533] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[ 6780.283118] LNet: 27912:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 12, peer 12345-10.9.5.35@tcp, timeout 64.
[ 6780.286872] LustreError: 27910:0:(ping_test.c:132:ping_client_done_rpc()) Unable to ping 12345-10.9.5.35@tcp (0): -110
[ 7260.303131] LNet: 27912:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.5.35@tcp, timeout 64.
[ 7260.303351] LNet: 27912:0:(rpc.c:1072:srpc_client_rpc_expired()) Skipped 29 previous similar messages
[ 7260.303813] LustreError: 27910:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.5.35@tcp failed with -110
[ 7268.313051] LNet: 27912:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.5.35@tcp, timeout 64.
[ 7268.313209] LNet: 27912:0:(rpc.c:1072:srpc_client_rpc_expired()) Skipped 80 previous similar messages
[ 7268.313397] LustreError: 27910:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.5.35@tcp failed with -110
[ 7268.313470] LustreError: 27910:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 80 previous similar messages
[ 8519.063017] LNetError: 20589:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[ 8520.507784] Lustre: 27908:0:(ping_test.c:79:ping_client_fini()) 30 pings have failed.

Over the past 4 weeks we are seeing about 10% failures in lnet-selftest overall and for x86/RHEL 7.5 in general, jumping up to 20% for ARM (4.14 kernel) and 100% failures for Ubuntu 18.08 (4.15 kernel) and PPC kernels. I think this issue needs more attention.

Comment by Sonia Sharma (Inactive) [ 21/Dec/18 ]

I investigated the lnet_selftest failure with this patch https://testing.whamcloud.com/test_sets/e1d460c4-f7e7-11e8-b67f-52540065bddc (LU-7159). I wasn't able to reproduce it using the same build on 2 ARM nodes and just running selftest between them. So I tried changing the order of the tests in gerrit by listing these test parameters.

Test-Parameters: clientarch=aarch64 testlist=node-provisioning-1,lustre-initialization-1,lnet-selftest,sanity

With the change in the order of the execution of lnet_selftest and sanity i.e. running lnet_selftest before sanity, lnet_selftest pass but the test 103b in sanity crashes.
So I would think the issue is not with the lnet_selftest in this particular failure case.

Still investigating selftest failure on Ubuntu.

Comment by Andreas Dilger [ 06/Feb/19 ]

The test_103b failure is a known issue - LU-11878. I'd be reasonably content if we changed the test order to run lnet-selftest before sanity, since that also seems more logical.

Comment by Andreas Dilger [ 06/Feb/19 ]

I pushed patch https://review.whamcloud.com/34201 to change the order of the tests for review-ldiskfs and review-ldiskfs-arm so that lnet-selftest runs before sanity. Not ideal, but it may allow ARM testing to be more reliable.

Comment by James A Simmons [ 26/Feb/19 ]

I see this Ubuntu18 testing as well.

Comment by Andreas Dilger [ 04/Mar/19 ]

Sonia, lnet-selftest is still failing even after 34201 landed, putting it at the start of the test sessions.

Comment by Sonia Sharma (Inactive) [ 04/Mar/19 ]

Are the failures for ARM as well? Any recent test-failure I can look at?

Comment by Andreas Dilger [ 04/Mar/19 ]

It seems more prevalent on ARM, but it also happens on x86 client/server as well, please see:

https://testing.whamcloud.com/sub_tests/query?utf8=%E2%9C%93&warn%5Bnotice%5D=&test_set_script_id=c24874b2-4a56-11e0-a7f6-52540025f9af&sub_test_script_id=c252c2c8-4a56-11e0-a7f6-52540025f9af&status%5B%5D=FAIL&horizon=518400&commit=Update+results&buggable_class=SubTest

Comment by Sonia Sharma (Inactive) [ 06/Mar/19 ]

I checked the logs of the various failure cases in that link. There are two kind of logs we see depending on if it is the brw_errors or the ping errors. We also have cases where  both ping errors and brw errors are there.

One example of the ping error case - 
https://testing.whamcloud.com/sub_tests/92ecef78-3e26-11e9-9720-52540065bddc

The ping error logs looks like below. These logs just suggests pings being timed out.

00000400:00000100:1.0:1551576978.818367:2480:6658:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.9.5.36@tcp portal 52 match 11025656312733141039 offset 0 length 160: 4
00000400:00000100:1.0:1551576978.818373:2480:6658:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 
00000400:00000400:0.0:1551577048.385694:1680:16992:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 12, peer 12345-10.9.5.35@tcp, timeout 64.
00000400:00000400:0.0:1551577048.385846:1680:16992:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 12, peer 12345-10.9.5.36@tcp, timeout 64.
00000400:00000400:0.0:1551577048.385848:1680:16992:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 12, peer 12345-10.9.5.35@tcp, timeout 64
00000400:00000100:0.0:1551577048.385892:2032:16988:0:(rpc.c:1146:srpc_client_rpc_done()) Client RPC done: service 12, peer 12345-10.9.5.35@tcp, status SWI_STATE_REQUEST_SENT:1:-4
00000001:00020000:0.0:1551577048.385895:2432:16988:0:(ping_test.c:132:ping_client_done_rpc()) Unable to ping 12345-10.9.5.35@tcp (0): -110
00000400:00000100:0.0:1551577048.385970:2032:16988:0:(rpc.c:1146:srpc_client_rpc_done()) Client RPC done: service 12, peer 12345-10.9.5.36@tcp, status SWI_STATE_REQUEST_SENT:1:-4
00000001:00020000:0.0:1551577048.385972:2432:16988:0:(ping_test.c:132:ping_client_done_rpc()) Unable to ping 12345-10.9.5.36@tcp (1): -110

I tried to reproduce locally again but could not. 

At this point, I think I would like to check if there is any issue with the way maloo is setting up lnet-selftest or the network connectivity between the nodes that the selftest is running on. For example - if the IP address of the nodes is taken correctly.

I think it would be best to first rule out any possibility of the issue with the way maloo is running this test.

Comment by Minh Diep [ 22/Mar/19 ]

+1 on b2_12 but not on ARM
https://testing.whamcloud.com/test_sets/4ad11d4c-4c6b-11e9-a256-52540065bddc

Comment by Andreas Dilger [ 28/Mar/19 ]

Have hit this five times in a row on non-ARM review-ldiskfs when testing Ubuntu 18.04 on the client for patch https://review.whamcloud.com/34456 which is only changing the Debian build files:
https://testing.whamcloud.com/test_sessions/34b7097c-f756-48b9-8e32-26d9ff4122a6
https://testing.whamcloud.com/test_sessions/270c5ece-da24-4307-9001-6989c1a8c5c5
https://testing.whamcloud.com/test_sessions/6706e3d9-698a-4346-98cd-3b4a6c45d6e0
https://testing.whamcloud.com/test_sessions/c3e2410d-f9c5-4aba-996e-371dcb810afb
https://testing.whamcloud.com/test_sessions/2151adcf-c785-46c4-973f-683f1f7561c7

It looks like Minh's recent failure is Ubuntu 18.04 as well, so that may be a better way to debug this issue.

Comment by Gerrit Updater [ 29/Mar/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34543
Subject: LU-10073 tests: stop running smoke test for ARM
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c890e0700c671258c6c3105b1eaa05c331ebca3e

Comment by Gerrit Updater [ 08/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34543/
Subject: LU-10073 tests: stop running smoke test
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ddf3c0416790f74e10abc39543843e0de49b176e

Comment by Peter Jones [ 08/Apr/19 ]

Landed for 2.13

Comment by James Nunez (Inactive) [ 08/Apr/19 ]

The patch that landed adds test smoke to the ALWAYS_EXCEPT list.

We do not have a solution for these failures. Thus, I'm reopening the ticket.

Comment by Andreas Dilger [ 06/May/19 ]

The lnet-selftest has failed 5x in a row with patch https://review.whamcloud.com/34800 "LU-11960 build: Add missing libssl-dev DEB package" when running an Ubuntu 18.04 client, so it seems like Ubuntu is a 100% reproducer for this issue, and should allow debugging what is going wrong.

Comment by Gerrit Updater [ 06/May/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34814
Subject: LU-10073 tests: stop running smoke test
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 8e3aa2e7ca5e35a68100aa564f901f49868b8a84

Comment by Gerrit Updater [ 10/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34814/
Subject: LU-10073 tests: stop running smoke test
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 0411b8139e18c243547535b5733b6408461ec72a

Comment by Minh Diep [ 08/Jul/19 ]

also hit it on el8 client https://testing.whamcloud.com/test_sessions/39b4428a-bf50-4283-b6c2-bcc389e99b19

Comment by Joseph Gmitter (Inactive) [ 09/Jul/19 ]

Hi Amir,

Can you look into this failure?

Thanks.
Joe

Comment by Peter Jones [ 12/Jul/19 ]

simmonsja is this issue something on your radar?

Comment by James A Simmons [ 12/Jul/19 ]

Does this happen on Power with RHEL 3.10 kernels?

Comment by Peter Jones [ 12/Jul/19 ]

I think that the ppc tests that hit it were running the same kernel as Arm - 4.14

Comment by Gerrit Updater [ 17/Jul/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35540
Subject: LU-10073 lnet: increase transaction timeout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 07b40bdd960d579a70c933f4d17faec844b211a4

Comment by Amir Shehata (Inactive) [ 18/Jul/19 ]

I was able to reproduce this on onyx-121vm[1-4].  I then run lnetctl set transaction_timeout 60 once lnet_selftest is loaded up and when I re-ran lnet-selftest.sh the problem did not occur. I pushed the same change for test with autotest and the problem happens. Once the test failed I took over the cluster: onyx-30vm[1-4], and ran lnet-selftest.sh manually (the same way I do with on onyx-121vm[1-4]) and it failed. I made sure that both clusters are setup with the same kernel and the same lustre build

onyx-121vm[1-2] 
----------------
lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64
kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
kernel-4.18.0-80.4.2.el8_0.x86_64
kernel-tools-libs-4.18.0-80.el8.x86_64
kernel-tools-4.18.0-80.el8.x86_64
kernel-4.18.0-80.el8.x86_64
kernel-core-4.18.0-80.4.2.el8_0.x86_64
kernel-core-4.18.0-80.el8.x86_64
kernel-modules-4.18.0-80.el8.x86_64
kernel-headers-4.18.0-80.el8.x86_64
kernel-modules-4.18.0-80.4.2.el8_0.x86_64

onyx-121vm[3-4]
----------------
kernel-tools-libs-3.10.0-957.21.3.el7_lustre.x86_64
kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
python-perf-3.10.0-957.21.3.el7_lustre.x86_64
kernel-tools-3.10.0-957.21.3.el7_lustre.x86_64
kernel-devel-3.10.0-957.21.3.el7_lustre.x86_64
lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-3.10.0-957.21.3.el7_lustre.x86_64
kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-headers-3.10.0-957.21.3.el7_lustre.x86_6

 

onyx-30vm[1-2]
---------------
kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64
lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64
kernel-modules-4.18.0-80.4.2.el8_0.x86_64
kernel-headers-4.18.0-80.el8.x86_64
kernel-core-4.18.0-80.el8.x86_64
kernel-modules-4.18.0-80.el8.x86_64
kernel-tools-libs-4.18.0-80.el8.x86_64
kernel-4.18.0-80.el8.x86_64
kernel-core-4.18.0-80.4.2.el8_0.x86_64
kernel-tools-4.18.0-80.el8.x86_64
kernel-4.18.0-80.4.2.el8_0.x86_64


onyx-30vm[3-4]
--------------
lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-2.12.56_1_g07b40bd-1.el7.x86_64
lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-3.10.0-957.21.3.el7_lustre.x86_64
lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64
kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64
kernel-headers-3.10.0-957.1.3.el7.x86_64
kernel-tools-libs-3.10.0-957.1.3.el7.x86_64
kernel-3.10.0-957.1.3.el7.x86_64
kernel-tools-3.10.0-957.1.3.el7.x86_64
kernel-devel-3.10.0-957.1.3.el7.x86_64

The cluster setup by autotest has some differences in rpms installed, but nothing I would think would cause this problem.

To make sure it's not the transaction_timeout (health code) which is causing this. I turned up the transaction timeout to 9000 seconds, turned off retries and turned off health sensitivity and reran the test. It still failed on the autotest setup cluster

Results from ony-121vm[1-4]

 1 batch in stopping
Batch is stopped
c:
Total 0 error nodes in c
s:
Total 0 error nodes in s
session is ended
Total 0 error nodes in c
Total 0 error nodes in s
onyx-121vm4: onyx-121vm4.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm3: onyx-121vm3.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup
onyx-121vm2: onyx-121vm2.onyx.whamcloud.com: executing lst_cleanup

Results from onyx-30vm[1-4]:

 killing 14697 ...
2 batch in stopping
Batch is stopped
12345-10.2.4.93@tcp: [Session 206 brw errors, 15 ping errors] [RPC: 117 errors, 113 dropped, 216 expired]
12345-10.2.4.94@tcp: [Session 230 brw errors, 30 ping errors] [RPC: 62 errors, 144 dropped, 253 expired]
c:
Total 2 error nodes in c
12345-10.2.4.95@tcp: [Session 232 brw errors, 15 ping errors] [RPC: 95 errors, 151 dropped, 247 expired]
12345-10.2.4.96@tcp: [Session 239 brw errors, 30 ping errors] [RPC: 72 errors, 101 dropped, 266 expired]
s:
Total 2 error nodes in s
session is ended
Total 2 error nodes in c
Total 2 error nodes in s
Starting client onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2:  -o user_xattr,flock onyx-30vm4@tcp:/lustre /mnt/lustre
Started clients onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2: 
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs)
onyx-30vm1: onyx-30vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
onyx-30vm2: onyx-30vm2.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
 lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6030:error()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:161:check_lst_err()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:187:test_smoke()
  = /usr/lib64/lustre/tests/test-framework.sh:6332:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:6371:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:6217:run_test()
  = /usr/lib64/lustre/tests/lnet-selftest.sh:190:main()

When the test fails, I see two issues:

1) BRW timeouts. Basically lots of

(socklnd_cb.c:2593:ksocknal_check_peer_timeouts()) Total 2 stale ZC_REQs for peer_ni 10.2.4.95@tcp detected; the oldest(0000000025bf6991) timed out 1 secs ago, resid: 0, wmem: 977752

(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.2.4.95@tcp[2], ip 10.2.4.95:7988, with error, wanted: 523368, left: 523368, last alive is 0 secs ago

2) lnet_selftest outputs a message stating that it's uanble to get stats from 1 or 2 of the nodes.

I tried to check the differences in HW between onyx-30 and onyx-121.

onyx-121
---------
 [root@onyx-121 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
Stepping:              4
CPU MHz:               3000.058
CPU max MHz:           3500.0000
CPU min MHz:           1200.0000
BogoMIPS:              5387.10
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-11
NUMA node1 CPU(s):     12-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

[root@onyx-121 ~]# lspci | grep -i eth
02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
 [root@onyx-30 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Model name:            Genuine Intel(R) CPU  @ 2.60GHz
Stepping:              5
CPU MHz:               2999.902
CPU max MHz:           3300.0000
CPU min MHz:           1200.0000
BogoMIPS:              5187.48
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts

[root@onyx-30 ~]# lspci | grep -i eth
04:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

It appears like onyx-30 only has one ethernet device. trevis-41 which is another clustre the test failed on, has only 2 ethernet devices. onyx-121 has 4.

It's important to note also that the lnet-selftest.sh add 71 tests. Running just 1 test doesn't cause a problem. So it appears like batching up so many tests in one go causes the problem to occur.

I'm investigating whether the number of interfaces accounts for the difference in behavior. So far it appears like this is the most significant change between the two clusters. Software wise both clusters have the same set of software installed.

Comment by Amir Shehata (Inactive) [ 19/Jul/19 ]

After moving onyx-30vm[1-2] from RHEL8 clients to RHEL7.6 clients the issue is not reproducible any longer.

Will start backing out socklnd patches to see if any of them is the problem.

 

Comment by Andreas Dilger [ 20/Jul/19 ]

After moving onyx-30vm[1-2] from RHEL8 clients to RHEL7.6 clients the issue is not reproducible any longer.

This is also true of ARM and Ubuntu clients - they cause constant failures, but x86 clients do not. My thinking is that this relates to the specific kernel version used, and it just happens that ARM, Ubuntu, and RHEL8 are newer kernels that somehow changed the kernel socket interfaces.

Comment by Amir Shehata (Inactive) [ 25/Jul/19 ]

I believe so too.

I tried RHEL8 across all the VMs and the problem persists.

I then increased the lnet_selftest rpc timeout to 256 seconds. And the test passed. IE no RPC errors or drops.

I measured the time it takes to complete RPCs from lnet_selftest perspective and I noticed the following behavior:

  1. with one test in the batch, RPCs take a max of 1 second to complete
  2. with two tests in the batch I see RPCs taking close to 10 seconds
  3. As I increase the number of tests in the batch I see that there are RPCs which take longer and longer to complete. With 40+ tests in the batch (which what lnet-selftest.sh does) I see RPCs taking up to 130 seconds to complete.

I then went to the previous setup with 2 RHEL8 clients and 2 RHEL7.6 servers and captured performance data using perf and generated flame graphs

perf-kernel-vm1.svg
perf-kernel-vm2.svg
perf-kernel-vm3.svg
perf-kernel-vm4.svg

There appears to be a key difference in the flamegraphs captured on the RHEL8 VMs vs the RHEL7.6 VMs. The ksoftirqd/1 is appearing significantly less on the RHEL7.6 VMs (~43 samples) vs RHEL8 VMs (~7000 samples).

My next steps are

  1. Attempt and reproduce this on physical nodes
  2. Investigate and see why interrupt handling on RHEL 8 is happening much more frequently. Is it only on VMs or physical machines as well

 

Comment by Amir Shehata (Inactive) [ 26/Jul/19 ]

I setup 2 nodes with Ubuntu 18: 4.15.0-45-generic and two nodes with centos 7.6: 3.10.0-957.21.3.el7_lustre.x86_64

We couldn't install RHEL8 on the physical nodes, so the setup is not 100% the same as the VM one. Once the RHEL8 installation is resolved I'll attempt the test again.

Running the exact same script which failed on the VMs, passed on the physical setup.

I collected the flamegraphs below and there is no significant differences between the Ubuntu18 and centos 7.6 with regards to softirq handling.

Ubuntu 18 client: perf-kernel-121.svg
Ubuntu 18 client: perf-kernel-122.svg
RHEL 7.6 server: perf-kernel-123.svg
RHEL 7.6 server: perf-kernel-124.svg

This issue appears to be localized to VM setups. As far as I know there hasn't been reports of test failure on a physical setup.

The VMs are started on a RHEL7.5 host. Could there be an interaction issue between host and VM? One thing to try is to deploy the VMs on a RHEL8 host and try the test. Currently there is a problem installing RHEL8 on physical nodes. Will try the test once this issue is resolved.

Comment by Jian Yu [ 27/Aug/19 ]

The failure occurred on RHEL 8.0 vm client+server against master branch:
https://testing.whamcloud.com/test_sets/380744c0-c709-11e9-a25b-52540065bddc

Comment by Chris Horn [ 18/Oct/19 ]

+1 on master https://testing.whamcloud.com/test_sessions/26e84ad7-8e0a-4307-a1f8-1c5281550588

Comment by Jian Yu [ 21/Jan/20 ]

+1 on RHEL 8.1 client testing: https://testing.whamcloud.com/test_sets/e47aaeaa-3ba7-11ea-bb75-52540065bddc

Comment by Jian Yu [ 21/Jan/20 ]

The failure also occurred on RHEL 7.7 client + server testing session: https://testing.whamcloud.com/test_sets/10d17588-3bb7-11ea-adca-52540065bddc

Comment by Gerrit Updater [ 05/Feb/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37450
Subject: LU-10073 tests: skip test smoke for PPC
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c28c9cabd71b8b0d4e45e909acfd4c797176ed59

Comment by Gerrit Updater [ 20/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37450/
Subject: LU-10073 tests: skip test smoke for PPC
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5ab2220687f4ef3a1d5b435f1e34f808723a9bf5

Comment by Gerrit Updater [ 08/Jun/20 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/38857
Subject: LU-10073 tests: re-enable lnet selftest smoke test for PPC + ARM
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ff47b8bfb0d507c8a75338b7ddfde4eef99d5bb6

Comment by Peter Jones [ 18/Oct/20 ]

James

It looks like this is an area that you are still looking into

Peter

Comment by Gerrit Updater [ 10/Jan/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/46037
Subject: LU-10073 tests: re-enable lnet selftest smoke test 4.4+ kernels
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 29bb14ccb62a1bccf066d558d35658fb57ffca11

Comment by James A Simmons [ 11/Jan/22 ]

Good news is that in my testing of LNet selftest on Ubuntu I didn't see any issues with newer kernels. Enabling it for RHEL8 maloo also passed. Most likely some recent LNet bug that was resolved fixed this issue. Due to the lack of ARM / PPC available for testing I can't prove if this is working on those platforms.

Comment by Xinliang Liu [ 12/Jan/22 ]

Let me check again, James. Last patch update time I didn't encounter this issue. I will run this test suite on arm again on master branch.

Comment by Xinliang Liu [ 12/Jan/22 ]

Tested with the patch on arm64. It passes.

$ ./auster  -vr lnet-selftest
...Batch is stopped
c:
Total 0 error nodes in c
s:
Total 0 error nodes in s
session is ended
Total 0 error nodes in c
Total 0 error nodes in s
lustre-aio: xxxx clients: 'lustre-aio lustre-aio mds-01'
lustre-aio: 1xxxx clients: 'lustre-aio
lustre-aio: mds-01'
lustre-aio: 2xxxx clients: 2 'lustre-aio,mds-01'
lustre-aio: lustre-aio: executing lst_cleanup
mds-01: liuxl-mds-test-01.novalocal: executing lst_cleanup
PASS smoke (322s) 
Comment by James A Simmons [ 12/Jan/22 ]

Once PowerPC comes back I will test on that platform. I suspect this issue is resolved but we will have to wait and see.

Comment by Gerrit Updater [ 18/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46037/
Subject: LU-10073 tests: re-enable lnet selftest smoke test 4.4+ kernels
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 24424791ac7233e393a32be189fe77102857653b

Comment by James A Simmons [ 18/Jan/22 ]

We still have Power8 to test on to verify it works.

Comment by Gerrit Updater [ 18/Jan/22 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/46169
Subject: LU-10073 tests: re-enable lnet selftest smoke test for PPC + ARM
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 187eef8fd9cc5dce0014afe7db45c60caf7ee605

Comment by Gerrit Updater [ 30/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/38857/
Subject: LU-10073 tests: re-enable lnet selftest smoke test for PPC + ARM
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c25a4e1fd92523247a0fa5a5c2809321765df263

Generated at Sat Feb 10 02:31:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.