[LU-10073] lnet-selftest test_smoke: lst Error found Created: 04/Oct/17 Updated: 20/Jan/24 Resolved: 31/May/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Casper | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LTS12, arm, ppc, rhel8, ubuntu | ||
| Environment: |
trevis, full, x86_64 servers, ppc clients |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
https://testing.whamcloud.com/test_sets/87032fec-9d50-11e7-b778-5254006e85c2 Seen previously in 2.9 testing ( From test_log: Batch is stopped 12345-10.9.0.84@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired] 12345-10.9.0.85@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired] c: Total 2 error nodes in c 12345-10.9.5.24@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired] 12345-10.9.5.25@tcp: [Session 0 brw errors, 30 ping errors] [RPC: 0 errors, 0 dropped, 30 expired] s: Total 2 error nodes in s session is ended Total 2 error nodes in c Total 2 error nodes in s and Started clients trevis-77vm3.trevis.hpdd.intel.com,trevis-77vm4: CMD: trevis-77vm3.trevis.hpdd.intel.com,trevis-77vm4 mount | grep /mnt/lustre' ' 10.9.5.25@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.9.5.25@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) CMD: trevis-77vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/openmpi/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all\" 4 trevis-77vm4: h2tcp: deprecated, use h2nettype instead trevis-77vm4: trevis-77vm4.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5289:error() = /usr/lib64/lustre/tests/lnet-selftest.sh:153:check_lst_err() = /usr/lib64/lustre/tests/lnet-selftest.sh:179:test_smoke() = /usr/lib64/lustre/tests/test-framework.sh:5565:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5604:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:5451:run_test() = /usr/lib64/lustre/tests/lnet-selftest.sh:182:main() |
| Comments |
| Comment by Andreas Dilger [ 27/Nov/18 ] |
|
Still seeing these failures on a regular basis - 50 failures in the past week, which is about 23% of all runs, increasing to 45% of ARM runs: |
| Comment by Sonia Sharma (Inactive) [ 27/Nov/18 ] |
|
I would say these current failures are different from what is mentioned in the description of this ticket. In the description, it shows to have ping errors but in all the above test failure links, the errors are the brw errors. |
| Comment by Artem Blagodarenko (Inactive) [ 12/Dec/18 ] |
|
Happened again https://testing.whamcloud.com/test_sets/e1d460c4-f7e7-11e8-b67f-52540065bddc |
| Comment by Sonia Sharma (Inactive) [ 13/Dec/18 ] |
|
Ideally this again looks more related to But in this particular case, I also see socklnd error messages in the logs like below which I did not see in other reported cases
LNetError: 19630:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.8.62@tcp[2], ip 10.9.8.62:1021, with error, wanted: 811328, left: 811328, last alive is 5 secs ago
LNet: 2475:0:(rpc.c:1250:srpc_send_rpc()) Remote error 74 at 12345-10.9.8.62@tcp, unlink bulk buffer in case peer didn't initiate bulk transfer
And these - 00000800:00020000:0.0:1543943185.853200:0:19630:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.8.62@tcp[2], ip 10.9.8.62:1021, with error, wanted: 811328, left: 811328, last alive is 5 secs ago 00000400:00000100:0.0:1543943185.858360:0:19630:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -5 type 2, RPC errors 1 00000400:00000100:0.0:1543943185.865739:0:19632:0:(lib-move.c:3856:lnet_parse_reply()) 10.9.4.72@tcp: Dropping REPLY from 12345-10.9.8.62@tcp for invalid MD 0x156d2737da0adbc9.0x483fb9 00000400:00000100:0.0:1543943185.866100:0:19632:0:(lib-move.c:3856:lnet_parse_reply()) 10.9.4.72@tcp: Dropping REPLY from 12345-10.9.8.62@tcp for invalid MD 0x156d2737da0adbc9.0x483fbd 00000400:00000100:0.0:1543943185.866322:0:19632:0:(lib-move.c:3856:lnet_parse_reply()) 10.9.4.72@tcp: Dropping REPLY from 12345-10.9.8.62@tcp for invalid MD 0x156d2737da0adbc9.0x483fc1 00000400:00000100:0.0:1543943185.866328:0:19632:0:(lib-move.c:3856:lnet_parse_reply()) 10.9.4.72@tcp: Dropping REPLY from 12345-10.9.8.62@tcp for invalid MD 0x156d2737da0adbc9.0x483fc5 00000400:00000100:0.0:1543943185.866333:0:19632:0:(lib-move.c:3856:lnet_parse_reply()) 10.9.4.72@tcp: Dropping REPLY from 12345-10.9.8.62@tcp for invalid MD 0x156d2737da0adbc9.0x483fc9 00000800:00000100:0.0:1543943185.866336:0:19632:0:(socklnd_cb.c:972:ksocknal_launch_packet()) No usable routes to 12345-10.9.8.62@tcp Investigating on these errors. |
| Comment by Andreas Dilger [ 20/Dec/18 ] |
|
Still seeing lnet-selftest on master after patch https://review.whamcloud.com/33231 " This particular failure is with an Ubuntu 18.04 client for a regular review-ldiskfs test run: Since bug The following messages are in the MDS log: [ 7418.687141] LNetError: 14039:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.230@tcp[2], ip 10.9.4.230:1021, with error, wanted: 1013616, left: 1013616, last alive is 0 secs ago [ 7418.688754] LustreError: 5373:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.9.4.230@tcp: -5 [ 7418.688761] LustreError: 5373:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer to 12345-10.9.4.230@tcp has failed: -5 [ 7469.603687] LustreError: 5373:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.230@tcp: -4 [ 7469.605632] LustreError: 5373:0:(brw_test.c:415:brw_bulk_ready()) Skipped 2 previous similar messages [ 7469.607239] LustreError: 5373:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.230@tcp has failed: -5 [ 7469.609038] LustreError: 5373:0:(brw_test.c:389:brw_server_rpc_done()) Skipped 2 previous similar messages [ 7481.461431] LNet: 5389:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.230@tcp, timeout 64. [ 7481.463637] LustreError: 5373:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.230@tcp failed with -110 [ 7595.562071] LustreError: 5373:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.9.4.231@tcp: -5 [ 7595.562075] LNetError: 14039:0:(socklnd.c:1679:ksocknal_destroy_conn()) Completing partial receive from 12345-10.9.4.231@tcp[2], ip 10.9.4.231:1021, with error, wanted: 487400, left: 487400, last alive is 0 secs ago |
| Comment by Andreas Dilger [ 20/Dec/18 ] |
|
I'm actually seeing both the ping and BRW timeout errors in the same run, so it doesn't seem like they are different problems in the end. Some logs from https://testing.whamcloud.com/sub_tests/4acfe152-0102-11e9-b970-52540065bddc : [ 6714.891483] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [ 6715.091612] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [ 6715.292141] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [ 6715.492122] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [ 6715.692075] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [ 6715.892115] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [ 6716.092499] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [ 6716.291542] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [ 6716.491536] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [ 6716.691533] LNet: 27908:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [ 6780.283118] LNet: 27912:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 12, peer 12345-10.9.5.35@tcp, timeout 64. [ 6780.286872] LustreError: 27910:0:(ping_test.c:132:ping_client_done_rpc()) Unable to ping 12345-10.9.5.35@tcp (0): -110 [ 7260.303131] LNet: 27912:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.5.35@tcp, timeout 64. [ 7260.303351] LNet: 27912:0:(rpc.c:1072:srpc_client_rpc_expired()) Skipped 29 previous similar messages [ 7260.303813] LustreError: 27910:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.5.35@tcp failed with -110 [ 7268.313051] LNet: 27912:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.5.35@tcp, timeout 64. [ 7268.313209] LNet: 27912:0:(rpc.c:1072:srpc_client_rpc_expired()) Skipped 80 previous similar messages [ 7268.313397] LustreError: 27910:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.5.35@tcp failed with -110 [ 7268.313470] LustreError: 27910:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 80 previous similar messages [ 8519.063017] LNetError: 20589:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [ 8520.507784] Lustre: 27908:0:(ping_test.c:79:ping_client_fini()) 30 pings have failed. Over the past 4 weeks we are seeing about 10% failures in lnet-selftest overall and for x86/RHEL 7.5 in general, jumping up to 20% for ARM (4.14 kernel) and 100% failures for Ubuntu 18.08 (4.15 kernel) and PPC kernels. I think this issue needs more attention. |
| Comment by Sonia Sharma (Inactive) [ 21/Dec/18 ] |
|
I investigated the lnet_selftest failure with this patch https://testing.whamcloud.com/test_sets/e1d460c4-f7e7-11e8-b67f-52540065bddc ( Test-Parameters: clientarch=aarch64 testlist=node-provisioning-1,lustre-initialization-1,lnet-selftest,sanity With the change in the order of the execution of lnet_selftest and sanity i.e. running lnet_selftest before sanity, lnet_selftest pass but the test 103b in sanity crashes. Still investigating selftest failure on Ubuntu. |
| Comment by Andreas Dilger [ 06/Feb/19 ] |
|
The test_103b failure is a known issue - |
| Comment by Andreas Dilger [ 06/Feb/19 ] |
|
I pushed patch https://review.whamcloud.com/34201 to change the order of the tests for review-ldiskfs and review-ldiskfs-arm so that lnet-selftest runs before sanity. Not ideal, but it may allow ARM testing to be more reliable. |
| Comment by James A Simmons [ 26/Feb/19 ] |
|
I see this Ubuntu18 testing as well. |
| Comment by Andreas Dilger [ 04/Mar/19 ] |
|
Sonia, lnet-selftest is still failing even after 34201 landed, putting it at the start of the test sessions. |
| Comment by Sonia Sharma (Inactive) [ 04/Mar/19 ] |
|
Are the failures for ARM as well? Any recent test-failure I can look at? |
| Comment by Andreas Dilger [ 04/Mar/19 ] |
|
It seems more prevalent on ARM, but it also happens on x86 client/server as well, please see: |
| Comment by Sonia Sharma (Inactive) [ 06/Mar/19 ] |
|
I checked the logs of the various failure cases in that link. There are two kind of logs we see depending on if it is the brw_errors or the ping errors. We also have cases where both ping errors and brw errors are there. One example of the ping error case - The ping error logs looks like below. These logs just suggests pings being timed out. 00000400:00000100:1.0:1551576978.818367:2480:6658:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 12345-10.9.5.36@tcp portal 52 match 11025656312733141039 offset 0 length 160: 4 00000400:00000100:1.0:1551576978.818373:2480:6658:0:(lib-move.c:3753:lnet_parse_put()) Dropping PUT from 00000400:00000400:0.0:1551577048.385694:1680:16992:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 12, peer 12345-10.9.5.35@tcp, timeout 64. 00000400:00000400:0.0:1551577048.385846:1680:16992:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 12, peer 12345-10.9.5.36@tcp, timeout 64. 00000400:00000400:0.0:1551577048.385848:1680:16992:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 12, peer 12345-10.9.5.35@tcp, timeout 64 00000400:00000100:0.0:1551577048.385892:2032:16988:0:(rpc.c:1146:srpc_client_rpc_done()) Client RPC done: service 12, peer 12345-10.9.5.35@tcp, status SWI_STATE_REQUEST_SENT:1:-4 00000001:00020000:0.0:1551577048.385895:2432:16988:0:(ping_test.c:132:ping_client_done_rpc()) Unable to ping 12345-10.9.5.35@tcp (0): -110 00000400:00000100:0.0:1551577048.385970:2032:16988:0:(rpc.c:1146:srpc_client_rpc_done()) Client RPC done: service 12, peer 12345-10.9.5.36@tcp, status SWI_STATE_REQUEST_SENT:1:-4 00000001:00020000:0.0:1551577048.385972:2432:16988:0:(ping_test.c:132:ping_client_done_rpc()) Unable to ping 12345-10.9.5.36@tcp (1): -110 I tried to reproduce locally again but could not. At this point, I think I would like to check if there is any issue with the way maloo is setting up lnet-selftest or the network connectivity between the nodes that the selftest is running on. For example - if the IP address of the nodes is taken correctly. I think it would be best to first rule out any possibility of the issue with the way maloo is running this test. |
| Comment by Minh Diep [ 22/Mar/19 ] |
|
+1 on b2_12 but not on ARM |
| Comment by Andreas Dilger [ 28/Mar/19 ] |
|
Have hit this five times in a row on non-ARM review-ldiskfs when testing Ubuntu 18.04 on the client for patch https://review.whamcloud.com/34456 which is only changing the Debian build files: It looks like Minh's recent failure is Ubuntu 18.04 as well, so that may be a better way to debug this issue. |
| Comment by Gerrit Updater [ 29/Mar/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34543 |
| Comment by Gerrit Updater [ 08/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34543/ |
| Comment by Peter Jones [ 08/Apr/19 ] |
|
Landed for 2.13 |
| Comment by James Nunez (Inactive) [ 08/Apr/19 ] |
|
The patch that landed adds test smoke to the ALWAYS_EXCEPT list. We do not have a solution for these failures. Thus, I'm reopening the ticket. |
| Comment by Andreas Dilger [ 06/May/19 ] |
|
The lnet-selftest has failed 5x in a row with patch https://review.whamcloud.com/34800 " |
| Comment by Gerrit Updater [ 06/May/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34814 |
| Comment by Gerrit Updater [ 10/May/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34814/ |
| Comment by Minh Diep [ 08/Jul/19 ] |
|
also hit it on el8 client https://testing.whamcloud.com/test_sessions/39b4428a-bf50-4283-b6c2-bcc389e99b19 |
| Comment by Joseph Gmitter (Inactive) [ 09/Jul/19 ] |
|
Hi Amir, Can you look into this failure? Thanks. |
| Comment by Peter Jones [ 12/Jul/19 ] |
|
simmonsja is this issue something on your radar? |
| Comment by James A Simmons [ 12/Jul/19 ] |
|
Does this happen on Power with RHEL 3.10 kernels? |
| Comment by Peter Jones [ 12/Jul/19 ] |
|
I think that the ppc tests that hit it were running the same kernel as Arm - 4.14 |
| Comment by Gerrit Updater [ 17/Jul/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35540 |
| Comment by Amir Shehata (Inactive) [ 18/Jul/19 ] |
|
I was able to reproduce this on onyx-121vm[1-4]. I then run lnetctl set transaction_timeout 60 once lnet_selftest is loaded up and when I re-ran lnet-selftest.sh the problem did not occur. I pushed the same change for test with autotest and the problem happens. Once the test failed I took over the cluster: onyx-30vm[1-4], and ran lnet-selftest.sh manually (the same way I do with on onyx-121vm[1-4]) and it failed. I made sure that both clusters are setup with the same kernel and the same lustre build onyx-121vm[1-2] ---------------- lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64 kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 kernel-4.18.0-80.4.2.el8_0.x86_64 kernel-tools-libs-4.18.0-80.el8.x86_64 kernel-tools-4.18.0-80.el8.x86_64 kernel-4.18.0-80.el8.x86_64 kernel-core-4.18.0-80.4.2.el8_0.x86_64 kernel-core-4.18.0-80.el8.x86_64 kernel-modules-4.18.0-80.el8.x86_64 kernel-headers-4.18.0-80.el8.x86_64 kernel-modules-4.18.0-80.4.2.el8_0.x86_64 onyx-121vm[3-4] ---------------- kernel-tools-libs-3.10.0-957.21.3.el7_lustre.x86_64 kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 python-perf-3.10.0-957.21.3.el7_lustre.x86_64 kernel-tools-3.10.0-957.21.3.el7_lustre.x86_64 kernel-devel-3.10.0-957.21.3.el7_lustre.x86_64 lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-3.10.0-957.21.3.el7_lustre.x86_64 kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-headers-3.10.0-957.21.3.el7_lustre.x86_6
onyx-30vm[1-2] --------------- kmod-lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-iokit-2.12.56_1_g07b40bd-1.el8.x86_64 lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 kmod-lustre-client-tests-2.12.56_1_g07b40bd-1.el8.x86_64 kernel-modules-4.18.0-80.4.2.el8_0.x86_64 kernel-headers-4.18.0-80.el8.x86_64 kernel-core-4.18.0-80.el8.x86_64 kernel-modules-4.18.0-80.el8.x86_64 kernel-tools-libs-4.18.0-80.el8.x86_64 kernel-4.18.0-80.el8.x86_64 kernel-core-4.18.0-80.4.2.el8_0.x86_64 kernel-tools-4.18.0-80.el8.x86_64 kernel-4.18.0-80.4.2.el8_0.x86_64 onyx-30vm[3-4] -------------- lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-tests-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-2.12.56_1_g07b40bd-1.el7.x86_64 lustre-iokit-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-3.10.0-957.21.3.el7_lustre.x86_64 lustre-osd-ldiskfs-mount-2.12.56_1_g07b40bd-1.el7.x86_64 kmod-lustre-osd-ldiskfs-2.12.56_1_g07b40bd-1.el7.x86_64 kernel-headers-3.10.0-957.1.3.el7.x86_64 kernel-tools-libs-3.10.0-957.1.3.el7.x86_64 kernel-3.10.0-957.1.3.el7.x86_64 kernel-tools-3.10.0-957.1.3.el7.x86_64 kernel-devel-3.10.0-957.1.3.el7.x86_64 The cluster setup by autotest has some differences in rpms installed, but nothing I would think would cause this problem. To make sure it's not the transaction_timeout (health code) which is causing this. I turned up the transaction timeout to 9000 seconds, turned off retries and turned off health sensitivity and reran the test. It still failed on the autotest setup cluster Results from ony-121vm[1-4] 1 batch in stopping Batch is stopped c: Total 0 error nodes in c s: Total 0 error nodes in s session is ended Total 0 error nodes in c Total 0 error nodes in s onyx-121vm4: onyx-121vm4.onyx.whamcloud.com: executing lst_cleanup onyx-121vm3: onyx-121vm3.onyx.whamcloud.com: executing lst_cleanup onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup onyx-121vm1: onyx-121vm1.onyx.whamcloud.com: executing lst_cleanup onyx-121vm2: onyx-121vm2.onyx.whamcloud.com: executing lst_cleanup Results from onyx-30vm[1-4]: killing 14697 ... 2 batch in stopping Batch is stopped 12345-10.2.4.93@tcp: [Session 206 brw errors, 15 ping errors] [RPC: 117 errors, 113 dropped, 216 expired] 12345-10.2.4.94@tcp: [Session 230 brw errors, 30 ping errors] [RPC: 62 errors, 144 dropped, 253 expired] c: Total 2 error nodes in c 12345-10.2.4.95@tcp: [Session 232 brw errors, 15 ping errors] [RPC: 95 errors, 151 dropped, 247 expired] 12345-10.2.4.96@tcp: [Session 239 brw errors, 30 ping errors] [RPC: 72 errors, 101 dropped, 266 expired] s: Total 2 error nodes in s session is ended Total 2 error nodes in c Total 2 error nodes in s Starting client onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2: -o user_xattr,flock onyx-30vm4@tcp:/lustre /mnt/lustre Started clients onyx-30vm1,onyx-30vm1.onyx.whamcloud.com,onyx-30vm2: 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) 10.2.4.96@tcp:/lustre on /mnt/lustre type lustre (rw,flock,user_xattr,lazystatfs) onyx-30vm1: onyx-30vm1.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 onyx-30vm2: onyx-30vm2.onyx.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 lnet-selftest test_smoke: @@@@@@ FAIL: lst Error found Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6030:error() = /usr/lib64/lustre/tests/lnet-selftest.sh:161:check_lst_err() = /usr/lib64/lustre/tests/lnet-selftest.sh:187:test_smoke() = /usr/lib64/lustre/tests/test-framework.sh:6332:run_one() = /usr/lib64/lustre/tests/test-framework.sh:6371:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:6217:run_test() = /usr/lib64/lustre/tests/lnet-selftest.sh:190:main() When the test fails, I see two issues: 1) BRW timeouts. Basically lots of
(socklnd_cb.c:2593:ksocknal_check_peer_timeouts()) Total 2 stale ZC_REQs for peer_ni 10.2.4.95@tcp detected; the oldest(0000000025bf6991) timed out 1 secs ago, resid: 0, wmem: 977752
(socklnd.c:1665:ksocknal_destroy_conn()) Completing partial receive from 12345-10.2.4.95@tcp[2], ip 10.2.4.95:7988, with error, wanted: 523368, left: 523368, last alive is 0 secs ago
2) lnet_selftest outputs a message stating that it's uanble to get stats from 1 or 2 of the nodes. I tried to check the differences in HW between onyx-30 and onyx-121. onyx-121 --------- [root@onyx-121 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 1 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Model name: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz Stepping: 4 CPU MHz: 3000.058 CPU max MHz: 3500.0000 CPU min MHz: 1200.0000 BogoMIPS: 5387.10 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0-11 NUMA node1 CPU(s): 12-23 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts [root@onyx-121 ~]# lspci | grep -i eth 02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 02:00.2 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 02:00.3 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) [root@onyx-30 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Model name: Genuine Intel(R) CPU @ 2.60GHz Stepping: 5 CPU MHz: 2999.902 CPU max MHz: 3300.0000 CPU min MHz: 1200.0000 BogoMIPS: 5187.48 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts [root@onyx-30 ~]# lspci | grep -i eth 04:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) It appears like onyx-30 only has one ethernet device. trevis-41 which is another clustre the test failed on, has only 2 ethernet devices. onyx-121 has 4. It's important to note also that the lnet-selftest.sh add 71 tests. Running just 1 test doesn't cause a problem. So it appears like batching up so many tests in one go causes the problem to occur. I'm investigating whether the number of interfaces accounts for the difference in behavior. So far it appears like this is the most significant change between the two clusters. Software wise both clusters have the same set of software installed. |
| Comment by Amir Shehata (Inactive) [ 19/Jul/19 ] |
|
After moving onyx-30vm[1-2] from RHEL8 clients to RHEL7.6 clients the issue is not reproducible any longer. Will start backing out socklnd patches to see if any of them is the problem.
|
| Comment by Andreas Dilger [ 20/Jul/19 ] |
This is also true of ARM and Ubuntu clients - they cause constant failures, but x86 clients do not. My thinking is that this relates to the specific kernel version used, and it just happens that ARM, Ubuntu, and RHEL8 are newer kernels that somehow changed the kernel socket interfaces. |
| Comment by Amir Shehata (Inactive) [ 25/Jul/19 ] |
|
I believe so too. I tried RHEL8 across all the VMs and the problem persists. I then increased the lnet_selftest rpc timeout to 256 seconds. And the test passed. IE no RPC errors or drops. I measured the time it takes to complete RPCs from lnet_selftest perspective and I noticed the following behavior:
I then went to the previous setup with 2 RHEL8 clients and 2 RHEL7.6 servers and captured performance data using perf and generated flame graphs perf-kernel-vm1.svg There appears to be a key difference in the flamegraphs captured on the RHEL8 VMs vs the RHEL7.6 VMs. The ksoftirqd/1 is appearing significantly less on the RHEL7.6 VMs (~43 samples) vs RHEL8 VMs (~7000 samples). My next steps are
|
| Comment by Amir Shehata (Inactive) [ 26/Jul/19 ] |
|
I setup 2 nodes with Ubuntu 18: 4.15.0-45-generic and two nodes with centos 7.6: 3.10.0-957.21.3.el7_lustre.x86_64 We couldn't install RHEL8 on the physical nodes, so the setup is not 100% the same as the VM one. Once the RHEL8 installation is resolved I'll attempt the test again. Running the exact same script which failed on the VMs, passed on the physical setup. I collected the flamegraphs below and there is no significant differences between the Ubuntu18 and centos 7.6 with regards to softirq handling. Ubuntu 18 client: perf-kernel-121.svg This issue appears to be localized to VM setups. As far as I know there hasn't been reports of test failure on a physical setup. The VMs are started on a RHEL7.5 host. Could there be an interaction issue between host and VM? One thing to try is to deploy the VMs on a RHEL8 host and try the test. Currently there is a problem installing RHEL8 on physical nodes. Will try the test once this issue is resolved. |
| Comment by Jian Yu [ 27/Aug/19 ] |
|
The failure occurred on RHEL 8.0 vm client+server against master branch: |
| Comment by Chris Horn [ 18/Oct/19 ] |
|
+1 on master https://testing.whamcloud.com/test_sessions/26e84ad7-8e0a-4307-a1f8-1c5281550588 |
| Comment by Jian Yu [ 21/Jan/20 ] |
|
+1 on RHEL 8.1 client testing: https://testing.whamcloud.com/test_sets/e47aaeaa-3ba7-11ea-bb75-52540065bddc |
| Comment by Jian Yu [ 21/Jan/20 ] |
|
The failure also occurred on RHEL 7.7 client + server testing session: https://testing.whamcloud.com/test_sets/10d17588-3bb7-11ea-adca-52540065bddc |
| Comment by Gerrit Updater [ 05/Feb/20 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37450 |
| Comment by Gerrit Updater [ 20/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37450/ |
| Comment by Gerrit Updater [ 08/Jun/20 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/38857 |
| Comment by Peter Jones [ 18/Oct/20 ] |
|
James It looks like this is an area that you are still looking into Peter |
| Comment by Gerrit Updater [ 10/Jan/22 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/46037 |
| Comment by James A Simmons [ 11/Jan/22 ] |
|
Good news is that in my testing of LNet selftest on Ubuntu I didn't see any issues with newer kernels. Enabling it for RHEL8 maloo also passed. Most likely some recent LNet bug that was resolved fixed this issue. Due to the lack of ARM / PPC available for testing I can't prove if this is working on those platforms. |
| Comment by Xinliang Liu [ 12/Jan/22 ] |
|
Let me check again, James. Last patch update time I didn't encounter this issue. I will run this test suite on arm again on master branch. |
| Comment by Xinliang Liu [ 12/Jan/22 ] |
|
Tested with the patch on arm64. It passes. $ ./auster -vr lnet-selftest ...Batch is stopped c: Total 0 error nodes in c s: Total 0 error nodes in s session is ended Total 0 error nodes in c Total 0 error nodes in s lustre-aio: xxxx clients: 'lustre-aio lustre-aio mds-01' lustre-aio: 1xxxx clients: 'lustre-aio lustre-aio: mds-01' lustre-aio: 2xxxx clients: 2 'lustre-aio,mds-01' lustre-aio: lustre-aio: executing lst_cleanup mds-01: liuxl-mds-test-01.novalocal: executing lst_cleanup PASS smoke (322s) |
| Comment by James A Simmons [ 12/Jan/22 ] |
|
Once PowerPC comes back I will test on that platform. I suspect this issue is resolved but we will have to wait and see. |
| Comment by Gerrit Updater [ 18/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46037/ |
| Comment by James A Simmons [ 18/Jan/22 ] |
|
We still have Power8 to test on to verify it works. |
| Comment by Gerrit Updater [ 18/Jan/22 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/46169 |
| Comment by Gerrit Updater [ 30/May/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/38857/ |