Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.0
-
None
-
3
-
9223372036854775807
Description
lnet-selftest test_smoke fails with brw errors.
In the test_log, we see that there are several brw errors
c: Total 0 error nodes in c 12345-10.9.4.62@tcp: [Session 2 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 2 expired] 12345-10.9.4.63@tcp: [Session 7 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 7 expired] s: Total 2 error nodes in s session is ended Total 0 error nodes in c Total 2 error nodes in s
In MDS1,3 (vm9) console log, we see
[121187.820321] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121187.824636] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121187.827249] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121187.829223] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121187.836986] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110 [121187.839282] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110 [121187.841880] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110 [121187.844418] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
In the OST (vm8) console log, we see similar errors
[121123.915520] LNet: 14142:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [121203.866263] LNet: 14145:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121203.867961] LNet: 14145:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64. [121203.868014] LustreError: 14143:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110 [121203.871967] LustreError: 14143:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
We don’t see any errors in the client’s (vm6) console logs, but we do see errors in the second client’s (vm7) console log:
[121224.824446] LNet: 19344:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [121231.318601] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4 [121231.320803] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5 [121231.322989] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4 [121231.325008] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5 [121231.327235] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4 [121231.329236] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5 [121231.332035] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4 [121231.334216] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5
This error may be different from LU-10073 since that ticket is for ping errors and these errors are brw errors and no ping errors are seen. Although they may have the same root cause.
In other recent lnet-selftest test smoke failures, https://testing.whamcloud.com/test_sets/e28426f6-ba77-11e8-b86b-52540065bddc, we also see dropped RPCs. From the test_log,
Batch is stopped 12345-10.9.4.250@tcp: [Session 33 brw errors, 0 ping errors] [RPC: 0 errors, 12 dropped, 33 expired] 12345-10.9.4.251@tcp: [Session 5 brw errors, 0 ping errors] [RPC: 0 errors, 2 dropped, 5 expired] c: Total 2 error nodes in c 12345-10.9.4.252@tcp: [Session 7 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 7 expired] 12345-10.9.4.253@tcp: [Session 1 brw errors, 0 ping errors] [RPC: 0 errors, 38 dropped, 1 expired] 12345-10.9.4.254@tcp: [Session 6 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 6 expired] s: Total 3 error nodes in s session is ended Total 2 error nodes in c Total 3 error nodes in s
In the MDS (vm4) console log, we see
[22157.131485] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22157.133877] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -4 [22157.136025] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22157.138193] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -4 [22157.140326] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22189.464106] LNet: 24932:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.251@tcp, timeout 64. [22189.467885] LustreError: 24882:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.251@tcp failed with -110 [22263.166107] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [22263.170894] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [22263.170903] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.170912] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22263.178407] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) [22263.178497] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.178505] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22263.184865] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) … [22263.323774] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.327177] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22263.330541] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.333795] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22263.338011] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.341268] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22263.344774] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125 [22263.348318] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5 [22769.266186] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.251@tcp: -4 [22769.268133] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.251@tcp has failed: -5 [22769.270187] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.251@tcp: -4