[LU-11389] lnet-setltest test smoke fails with ‘lst Error found’ Created: 17/Sep/18 Updated: 19/Dec/18 Resolved: 05/Dec/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | Sonia Sharma (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
lnet-selftest test_smoke fails with brw errors. In the test_log, we see that there are several brw errors c: Total 0 error nodes in c 12345-10.9.4.62@tcp: [Session 2 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 2 expired] 12345-10.9.4.63@tcp: [Session 7 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 7 expired] s: Total 2 error nodes in s session is ended Total 0 error nodes in c Total 2 error nodes in s In MDS1,3 (vm9) console log, we see [121187.820321] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121187.824636] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121187.827249] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121187.829223] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121187.836986] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110 [121187.839282] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110 [121187.841880] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110 [121187.844418] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110 In the OST (vm8) console log, we see similar errors [121123.915520] LNet: 14142:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [121203.866263] LNet: 14145:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121203.867961] LNet: 14145:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121203.868014] LustreError: 14143:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110 [121203.871967] LustreError: 14143:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110 We don’t see any errors in the client’s (vm6) console logs, but we do see errors in the second client’s (vm7) console log: [121224.824446] LNet: 19344:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [121231.318601] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4 [121231.320803] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5 [121231.322989] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4 [121231.325008] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5 [121231.327235] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4 [121231.329236] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5 [121231.332035] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4 [121231.334216] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5 This error may be different from In other recent lnet-selftest test smoke failures, https://testing.whamcloud.com/test_sets/e28426f6-ba77-11e8-b86b-52540065bddc, we also see dropped RPCs. From the test_log, Batch is stopped 12345-10.9.4.250@tcp: [Session 33 brw errors, 0 ping errors] [RPC: 0 errors, 12 dropped, 33 expired] 12345-10.9.4.251@tcp: [Session 5 brw errors, 0 ping errors] [RPC: 0 errors, 2 dropped, 5 expired] c: Total 2 error nodes in c 12345-10.9.4.252@tcp: [Session 7 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 7 expired] 12345-10.9.4.253@tcp: [Session 1 brw errors, 0 ping errors] [RPC: 0 errors, 38 dropped, 1 expired] 12345-10.9.4.254@tcp: [Session 6 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 6 expired] s: Total 3 error nodes in s session is ended Total 2 error nodes in c Total 3 error nodes in s In the MDS (vm4) console log, we see [22157.131485] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22157.133877] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -4 [22157.136025] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22157.138193] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -4 [22157.140326] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22189.464106] LNet: 24932:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.251@tcp, timeout 64. [22189.467885] LustreError: 24882:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.251@tcp failed with -110 [22263.166107] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [22263.170894] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [22263.170903] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.170912] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22263.178407] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [22263.178497] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.178505] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22263.184865] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) … [22263.323774] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.327177] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22263.330541] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.333795] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22263.338011] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.341268] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22263.344774] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.348318] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22769.266186] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.251@tcp: -4 [22769.268133] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.251@tcp has failed: -5 [22769.270187] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.251@tcp: -4 |
| Comments |
| Comment by James Nunez (Inactive) [ 19/Sep/18 ] |
|
Sonia, Would you please investigate these failures? Thank you |
| Comment by James Nunez (Inactive) [ 20/Sep/18 ] |
|
A few other comments on this issue: There were no lnet-selftest errors of this kind in July 2018. We had 52 errors in August for the master branch. For master, it looks like the first of these errors started on 2018-08-21 with 2.11.54.11; https://testing.whamcloud.com/test_sets/391e400e-a539-11e8-a5f2-52540065bddc . We had some similar looking failures in early August, but they were interop testing or ppc testing. |
| Comment by James A Simmons [ 20/Sep/18 ] |
|
This is due to lnet-selftest being broken with the new lnet health check system. This is a know problem. |
| Comment by Sonia Sharma (Inactive) [ 20/Sep/18 ] |
|
Yes, lnet-selftest with health landing is not backward compatible anymore. So this should affect interop testing but not if just the current selftest module is being run with master |
| Comment by Sonia Sharma (Inactive) [ 25/Sep/18 ] |
|
Atleast in one of the failure case - https://testing.whamcloud.com/test_sets/97e1a268-bacc-11e8-b86b-52540065bddc 00000400:00000100:1.0:1537223442.459967:0:19291:0:(lib-move.c:2754:lnet_finalize_expired_responses()) Response timed out: md = ffff8a3826d47800 00000001:00020000:1.0:1537223442.460023:0:28328:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.81@tcp: -4 00000400:00000100:1.0:1537223442.462309:0:28328:0:(rpc.c:905:srpc_server_rpc_done()) Server RPC ffff8a37ce122400 done: service brw_test, peer 12345-10.9.4.81@tcp, status SWI_STATE_BULK_STARTED:-5 00000001:00020000:1.0:1537223442.462317:0:28328:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.81@tcp has failed: -5 00000400:00000100:1.0:1537223442.650594:0:19288:0:(lib-move.c:3783:lnet_parse_reply()) 10.9.4.83@tcp: Dropping REPLY from 12345-10.9.4.81@tcp for invalid MD 0x1555319176943a6c.0x1ff0165 00000400:00000400:1.0:1537223897.120951:0:28330:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.81@tcp, timeout 64. 00000400:00000400:1.0:1537223897.123778:0:28330:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.81@tcp, timeout 64. 00000400:00000100:0.0:1537223897.123999:0:28328:0:(rpc.c:1146:srpc_client_rpc_done()) Client RPC done: service 11, peer 12345-10.9.4.81@tcp, status SWI_STATE_REQUEST_SENT:1:-4 00000001:00020000:0.0:1537223897.124015:0:28328:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.81@tcp failed with -110 This makes me think it might me because of the low transaction timeout value . We have this in lnet_monitor_thread() »·······»·······if (wakeup_counter >= lnet_transaction_timeout / 2) { 3293 »·······»·······»·······lnet_finalize_expired_responses(false); 3294 »·······»·······»·······wakeup_counter = 0; 3295 »·······»·······} I will upload a test patch with increased transaction timeout value and we can see if that affects the number of failure cases. |
| Comment by Gerrit Updater [ 25/Sep/18 ] |
|
Sonia Sharma (sharmaso@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33231 |
| Comment by Andreas Dilger [ 27/Nov/18 ] |
|
Still seeing these failures on a regular basis - 50 failures in the past week, which is about 23% of all runs, increasing to 45% of ARM runs: |
| Comment by Bruno Faccini (Inactive) [ 03/Dec/18 ] |
|
+1 at https://testing.whamcloud.com/test_sessions/47f9f586-4a90-4174-a076-b9b808107082 |
| Comment by Gerrit Updater [ 05/Dec/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33231/ |
| Comment by Peter Jones [ 05/Dec/18 ] |
|
Landed for 2.12 |