[LU-17101] sanity-lnet test_220: timeout - route goes down Created: 08/Sep/23  Updated: 16/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for eaujames <eaujames@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/a5dba607-7817-42ca-9075-4c7880f9082c

test_220 failed with the following error:

Timeout occurred after 369 minutes, last suite running was sanity-lnet

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/97846 - 4.18.0-425.10.1.el8_7.aarch64
servers: https://build.whamcloud.com/job/lustre-reviews/97846 - 4.18.0-477.15.1.el8_lustre.x86_64

route goes down during lnet_selftest:

[Thu Sep  7 20:51:51 2023] Lustre: DEBUG MARKER: Start LST rw
[Thu Sep  7 20:51:51 2023] LNet: 1042010:0:(rpc.c:641:srpc_service_add_buffers()) waiting for adding buffer
[Thu Sep  7 20:51:51 2023] LNet: 943043:0:(rpc.c:641:srpc_service_add_buffers()) waiting for adding buffer
[Thu Sep  7 20:52:05 2023] LNetError: 1068901:0:(lib-lnet.h:1305:lnet_set_route_aliveness()) route to tcp2 through 10.240.44.207@tcp1 has gone from up to down
[Thu Sep  7 20:52:05 2023] LNetError: 1068901:0:(lib-lnet.h:1305:lnet_set_route_aliveness()) Skipped 1 previous similar message
[Thu Sep  7 20:52:06 2023] LNetError: 943043:0:(lib-move.c:2341:lnet_handle_find_routed_path()) no route to 10.240.45.24@tcp2 from <?>
[Thu Sep  7 20:52:06 2023] Lustre: DEBUG MARKER: lst stop brw_rw
[Thu Sep  7 20:52:07 2023] Lustre: DEBUG MARKER: lst stop brw_rw
[Thu Sep  7 20:52:07 2023] Lustre: DEBUG MARKER: Stop LST rw
[Thu Sep  7 20:52:07 2023] LNetError: 1042010:0:(lib-move.c:2341:lnet_handle_find_routed_path()) no route to 10.240.45.24@tcp2 from 10.240.44.206@tcp1
[Thu Sep  7 20:52:07 2023] LNetError: 1042010:0:(lib-move.c:2341:lnet_handle_find_routed_path()) Skipped 1 previous similar message
[Thu Sep  7 20:52:07 2023] LustreError: 1042010:0:(brw_test.c:388:brw_server_rpc_done()) Bulk transfer from 12345-10.240.45.24@tcp2 has failed: -113

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lnet test_220 - Timeout occurred after 369 minutes, last suite running was sanity-lnet



 Comments   
Comment by Andreas Dilger [ 19/Sep/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/577acae6-b6ef-4498-8be0-b2c176a7675f

Comment by Andreas Dilger [ 16/Jan/24 ]

+1 on master: https://testing.whamcloud.com/test_sessions/a43e4b8d-ee24-4dbd-a7e7-7a2c35f0ae93

Generated at Sat Feb 10 03:32:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.