[LU-11389] lnet-setltest test smoke fails with ‘lst Error found’ Created: 17/Sep/18  Updated: 19/Dec/18  Resolved: 05/Dec/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Sonia Sharma (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10073 lnet-selftest test_smoke: lst Error f... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lnet-selftest test_smoke fails with brw errors.

In the test_log, we see that there are several brw errors

c:
Total 0 error nodes in c
12345-10.9.4.62@tcp: [Session 2 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 2 expired]
12345-10.9.4.63@tcp: [Session 7 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 7 expired]
s:
Total 2 error nodes in s
session is ended
Total 0 error nodes in c
Total 2 error nodes in s

In MDS1,3 (vm9) console log, we see

 [121187.820321] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
[121187.824636] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
[121187.827249] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
[121187.829223] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
[121187.836986] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
[121187.839282] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
[121187.841880] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
[121187.844418] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110

In the OST (vm8) console log, we see similar errors

[121123.915520] LNet: 14142:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[121203.866263] LNet: 14145:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
[121203.867961] LNet: 14145:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
[121203.868014] LustreError: 14143:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
[121203.871967] LustreError: 14143:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110

We don’t see any errors in the client’s (vm6) console logs, but we do see errors in the second client’s (vm7) console log:

[121224.824446] LNet: 19344:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[121231.318601] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4
[121231.320803] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5
[121231.322989] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4
[121231.325008] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5
[121231.327235] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4
[121231.329236] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5
[121231.332035] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4
[121231.334216] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5

This error may be different from LU-10073 since that ticket is for ping errors and these errors are brw errors and no ping errors are seen. Although they may have the same root cause.

In other recent lnet-selftest test smoke failures, https://testing.whamcloud.com/test_sets/e28426f6-ba77-11e8-b86b-52540065bddc, we also see dropped RPCs. From the test_log,

Batch is stopped
12345-10.9.4.250@tcp: [Session 33 brw errors, 0 ping errors] [RPC: 0 errors, 12 dropped, 33 expired]
12345-10.9.4.251@tcp: [Session 5 brw errors, 0 ping errors] [RPC: 0 errors, 2 dropped, 5 expired]
c:
Total 2 error nodes in c
12345-10.9.4.252@tcp: [Session 7 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 7 expired]
12345-10.9.4.253@tcp: [Session 1 brw errors, 0 ping errors] [RPC: 0 errors, 38 dropped, 1 expired]
12345-10.9.4.254@tcp: [Session 6 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 6 expired]
s:
Total 3 error nodes in s
session is ended
Total 2 error nodes in c
Total 3 error nodes in s

In the MDS (vm4) console log, we see

[22157.131485] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
[22157.133877] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -4
[22157.136025] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
[22157.138193] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -4
[22157.140326] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
[22189.464106] LNet: 24932:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.251@tcp, timeout 64.
[22189.467885] LustreError: 24882:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.251@tcp failed with -110
[22263.166107] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[22263.170894] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[22263.170903] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
[22263.170912] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
[22263.178407] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
[22263.178497] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
[22263.178505] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
[22263.184865] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
…
[22263.323774] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
[22263.327177] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
[22263.330541] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
[22263.333795] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
[22263.338011] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
[22263.341268] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
[22263.344774] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
[22263.348318] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
[22769.266186] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.251@tcp: -4
[22769.268133] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.251@tcp has failed: -5
[22769.270187] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.251@tcp: -4


 Comments   
Comment by James Nunez (Inactive) [ 19/Sep/18 ]

Sonia,

Would you please investigate these failures?

Thank you

Comment by James Nunez (Inactive) [ 20/Sep/18 ]

A few other comments on this issue:
All these failures are see for the full test group and none see for review-ldiskfs testing.

There were no lnet-selftest errors of this kind in July 2018. We had 52 errors in August for the master branch.

For master, it looks like the first of these errors started on 2018-08-21 with 2.11.54.11; https://testing.whamcloud.com/test_sets/391e400e-a539-11e8-a5f2-52540065bddc . We had some similar looking failures in early August, but they were interop testing or ppc testing.

Comment by James A Simmons [ 20/Sep/18 ]

This is due to lnet-selftest being broken with the new lnet health check system. This is a know problem.

Comment by Sonia Sharma (Inactive) [ 20/Sep/18 ]

Yes, lnet-selftest with health landing is not backward compatible anymore. So this should affect interop testing but not if just the current selftest module is being run with master

Comment by Sonia Sharma (Inactive) [ 25/Sep/18 ]

Atleast in one of the failure case - https://testing.whamcloud.com/test_sets/97e1a268-bacc-11e8-b86b-52540065bddc
I see this - 

00000400:00000100:1.0:1537223442.459967:0:19291:0:(lib-move.c:2754:lnet_finalize_expired_responses()) Response timed out: md = ffff8a3826d47800
00000001:00020000:1.0:1537223442.460023:0:28328:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.81@tcp: -4
00000400:00000100:1.0:1537223442.462309:0:28328:0:(rpc.c:905:srpc_server_rpc_done()) Server RPC ffff8a37ce122400 done: service brw_test, peer 12345-10.9.4.81@tcp, status SWI_STATE_BULK_STARTED:-5
00000001:00020000:1.0:1537223442.462317:0:28328:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.81@tcp has failed: -5
00000400:00000100:1.0:1537223442.650594:0:19288:0:(lib-move.c:3783:lnet_parse_reply()) 10.9.4.83@tcp: Dropping REPLY from 12345-10.9.4.81@tcp for invalid MD 0x1555319176943a6c.0x1ff0165
00000400:00000400:1.0:1537223897.120951:0:28330:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.81@tcp, timeout 64.
00000400:00000400:1.0:1537223897.123778:0:28330:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.81@tcp, timeout 64.
00000400:00000100:0.0:1537223897.123999:0:28328:0:(rpc.c:1146:srpc_client_rpc_done()) Client RPC done: service 11, peer 12345-10.9.4.81@tcp, status SWI_STATE_REQUEST_SENT:1:-4
00000001:00020000:0.0:1537223897.124015:0:28328:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.81@tcp failed with -110

This makes me think it might me because of the low transaction timeout value . We have this in lnet_monitor_thread()

»·······»·······if (wakeup_counter >= lnet_transaction_timeout / 2) {                                                                                                            
3293 »·······»·······»·······lnet_finalize_expired_responses(false);                                                                                                                  
3294 »·······»·······»·······wakeup_counter = 0;                                                                                                                                      
3295 »·······»·······}

I will upload a test patch with increased transaction timeout value and we can see if that affects the number of failure cases.

Comment by Gerrit Updater [ 25/Sep/18 ]

Sonia Sharma (sharmaso@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33231
Subject: LU-11389 lnet: debug selftest failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ad4b6882aa18df9bae668056e73ac7e98baee6b7

Comment by Andreas Dilger [ 27/Nov/18 ]

Still seeing these failures on a regular basis - 50 failures in the past week, which is about 23% of all runs, increasing to 45% of ARM runs:
https://testing.whamcloud.com/sub_tests/021c4e14-f01d-11e8-bfe1-52540065bddc
https://testing.whamcloud.com/sub_tests/0512c886-f1c7-11e8-815b-52540065bddc
https://testing.whamcloud.com/sub_tests/f7e4f2a6-edac-11e8-815b-52540065bddc
:
:

Comment by Bruno Faccini (Inactive) [ 03/Dec/18 ]

+1 at https://testing.whamcloud.com/test_sessions/47f9f586-4a90-4174-a076-b9b808107082

Comment by Gerrit Updater [ 05/Dec/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33231/
Subject: LU-11389 lnet: increase lnet transaction timeout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 73fdd1579d87c3fdaef71edaada0c4518cb00492

Comment by Peter Jones [ 05/Dec/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:43:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.