Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11389

lnet-setltest test smoke fails with ‘lst Error found’

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.12.0
    • Fix Version/s: Lustre 2.12.0
    • Labels:
      None
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      lnet-selftest test_smoke fails with brw errors.

      In the test_log, we see that there are several brw errors

      c:
      Total 0 error nodes in c
      12345-10.9.4.62@tcp: [Session 2 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 2 expired]
      12345-10.9.4.63@tcp: [Session 7 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 7 expired]
      s:
      Total 2 error nodes in s
      session is ended
      Total 0 error nodes in c
      Total 2 error nodes in s
      

      In MDS1,3 (vm9) console log, we see

       [121187.820321] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
      [121187.824636] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
      [121187.827249] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
      [121187.829223] LNet: 14521:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
      [121187.836986] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
      [121187.839282] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
      [121187.841880] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
      [121187.844418] LustreError: 14519:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
      

      In the OST (vm8) console log, we see similar errors

      [121123.915520] LNet: 14142:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
      [121203.866263] LNet: 14145:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
      [121203.867961] LNet: 14145:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.61@tcp, timeout 64.
      [121203.868014] LustreError: 14143:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
      [121203.871967] LustreError: 14143:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.61@tcp failed with -110
      

      We don’t see any errors in the client’s (vm6) console logs, but we do see errors in the second client’s (vm7) console log:

      [121224.824446] LNet: 19344:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
      [121231.318601] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4
      [121231.320803] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5
      [121231.322989] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4
      [121231.325008] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5
      [121231.327235] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4
      [121231.329236] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5
      [121231.332035] LustreError: 19345:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.63@tcp: -4
      [121231.334216] LustreError: 19345:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.63@tcp has failed: -5
      

      This error may be different from LU-10073 since that ticket is for ping errors and these errors are brw errors and no ping errors are seen. Although they may have the same root cause.

      In other recent lnet-selftest test smoke failures, https://testing.whamcloud.com/test_sets/e28426f6-ba77-11e8-b86b-52540065bddc, we also see dropped RPCs. From the test_log,

      Batch is stopped
      12345-10.9.4.250@tcp: [Session 33 brw errors, 0 ping errors] [RPC: 0 errors, 12 dropped, 33 expired]
      12345-10.9.4.251@tcp: [Session 5 brw errors, 0 ping errors] [RPC: 0 errors, 2 dropped, 5 expired]
      c:
      Total 2 error nodes in c
      12345-10.9.4.252@tcp: [Session 7 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 7 expired]
      12345-10.9.4.253@tcp: [Session 1 brw errors, 0 ping errors] [RPC: 0 errors, 38 dropped, 1 expired]
      12345-10.9.4.254@tcp: [Session 6 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 6 expired]
      s:
      Total 3 error nodes in s
      session is ended
      Total 2 error nodes in c
      Total 3 error nodes in s
      

      In the MDS (vm4) console log, we see

      [22157.131485] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
      [22157.133877] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -4
      [22157.136025] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
      [22157.138193] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -4
      [22157.140326] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
      [22189.464106] LNet: 24932:0:(rpc.c:1072:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-10.9.4.251@tcp, timeout 64.
      [22189.467885] LustreError: 24882:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-10.9.4.251@tcp failed with -110
      [22263.166107] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
      [22263.170894] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
      [22263.170903] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
      [22263.170912] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
      [22263.178407] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
      [22263.178497] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
      [22263.178505] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
      [22263.184865] LNetError: 6241:0:(lib-msg.c:794:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)
      …
      [22263.323774] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
      [22263.327177] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
      [22263.330541] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
      [22263.333795] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
      [22263.338011] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
      [22263.341268] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
      [22263.344774] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.250@tcp: -125
      [22263.348318] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.250@tcp has failed: -5
      [22769.266186] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.251@tcp: -4
      [22769.268133] LustreError: 24882:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.9.4.251@tcp has failed: -5
      [22769.270187] LustreError: 24882:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.9.4.251@tcp: -4
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                sharmaso Sonia Sharma (Inactive)
                Reporter:
                jamesanunez James Nunez
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: