[LU-17103] sanity-lnet test_207: timed out Created: 09/Sep/23  Updated: 09/Nov/23  Resolved: 08/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-17207 LNet: race between monitor thread sto... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5c0452f5-d96f-4c8d-9d27-04b71f59c2ee

test_207 failed with the following error:

Timeout occurred after 400 minutes, last suite running was sanity-lnet

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/96852 - 4.18.0-425.10.1.el8_7.aarch64
servers: https://build.whamcloud.com/job/lustre-reviews/96852 - 4.18.0-477.15.1.el8_lustre.x86_64

It looks like this started failing on 2023-08-11 before patch https://review.whamcloud.com/51921 "LU-17006 lnet: set up routes for going across subnets" landed, then started failing for everyone when it landed to master on 2023-08-31.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lnet test_207 - Timeout occurred after 400 minutes, last suite running was sanity-lnet



 Comments   
Comment by Andreas Dilger [ 09/Sep/23 ]

Serguei, could you please investigate/reassign. It looks like this is a test regression from landing your routes across subnets patch.

Comment by Andreas Dilger [ 09/Sep/23 ]

https://testing.whamcloud.com/search?horizon=7776000&status%5B%5D=TIMEOUT&test_set_script_id=a2b1c4b2-b449-11e9-b88c-52540065bddc&sub_test_script_id=2df9a444-4299-4232-9b37-fc5fe71b6614&source=sub_tests#redirect

Comment by Gerrit Updater [ 11/Sep/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52341
Subject: LU-17103 socklnd: revert LU-17006 routing setup change
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ab2b4363141dd11d3dd6e16c4076afce359dbb77

Comment by Serguei Smirnov [ 22/Sep/23 ]

So far all failed cases have similar trace:

[Tue Sep 19 20:54:39 2023] task:lnetctl         state:D stack:    0 pid:949425 ppid:948947 flags:0x00004080
[Tue Sep 19 20:54:39 2023] Call Trace:
[Tue Sep 19 20:54:39 2023]  __schedule+0x2d1/0x870
[Tue Sep 19 20:54:39 2023]  schedule+0x55/0xf0
[Tue Sep 19 20:54:39 2023]  schedule_timeout+0x197/0x300
[Tue Sep 19 20:54:39 2023]  ? __next_timer_interrupt+0xf0/0xf0
[Tue Sep 19 20:54:39 2023]  ? prepare_to_wait_event+0xa0/0x180
[Tue Sep 19 20:54:39 2023]  lnet_ping_md_unlink+0xd0/0x190 [lnet]
[Tue Sep 19 20:54:39 2023]  ? init_wait_var_entry+0x50/0x50
[Tue Sep 19 20:54:39 2023]  lnet_ping_target_update+0x490/0x580 [lnet]
[Tue Sep 19 20:54:39 2023]  lnet_add_net_common+0x28b/0x490 [lnet]
[Tue Sep 19 20:54:39 2023]  lnet_dyn_add_ni+0x24c/0x2f0 [lnet]
[Tue Sep 19 20:54:39 2023]  lnet_genl_parse_local_ni.isra.47+0x190/0x1180 [lnet]
[Tue Sep 19 20:54:39 2023]  lnet_net_cmd+0x2fe/0x6e0 [lnet]
[Tue Sep 19 20:54:39 2023]  genl_family_rcv_msg_doit.isra.17+0x113/0x150
[Tue Sep 19 20:54:39 2023]  genl_family_rcv_msg+0xb7/0x170
[Tue Sep 19 20:54:39 2023]  ? lnet_dyn_del_net+0x1f0/0x1f0 [lnet]
[Tue Sep 19 20:54:39 2023]  ? lnet_shutdown_lndnets+0x300/0x300 [lnet]
[Tue Sep 19 20:54:39 2023]  ? lnet_counters_get+0x140/0x140 [lnet]
[Tue Sep 19 20:54:39 2023]  ? lnet_ping_event_handler+0x130/0x130 [lnet]
[Tue Sep 19 20:54:39 2023]  genl_rcv_msg+0x47/0xa0
[Tue Sep 19 20:54:39 2023]  ? genl_family_rcv_msg+0x170/0x170
[Tue Sep 19 20:54:39 2023]  netlink_rcv_skb+0x4c/0x130
[Tue Sep 19 20:54:39 2023]  genl_rcv+0x24/0x40
[Tue Sep 19 20:54:39 2023]  netlink_unicast+0x19a/0x230
[Tue Sep 19 20:54:39 2023]  netlink_sendmsg+0x204/0x3d0
[Tue Sep 19 20:54:39 2023]  sock_sendmsg+0x50/0x60
[Tue Sep 19 20:54:39 2023]  ____sys_sendmsg+0x22a/0x250
[Tue Sep 19 20:54:39 2023]  ? copy_msghdr_from_user+0x5c/0x90
[Tue Sep 19 20:54:39 2023]  ? ____sys_recvmsg+0xb0/0x150
[Tue Sep 19 20:54:39 2023]  ___sys_sendmsg+0x7c/0xc0
[Tue Sep 19 20:54:39 2023]  ? copy_msghdr_from_user+0x5c/0x90
[Tue Sep 19 20:54:39 2023]  ? ___sys_recvmsg+0x89/0xc0
[Tue Sep 19 20:54:39 2023]  ? __wake_up_common_lock+0x89/0xc0
[Tue Sep 19 20:54:39 2023]  __sys_sendmsg+0x57/0xa0
[Tue Sep 19 20:54:39 2023]  do_syscall_64+0x5b/0x1b0
[Tue Sep 19 20:54:39 2023]  entry_SYSCALL_64_after_hwframe+0x61/0xc6

[Tue Sep 19 20:54:39 2023] task:monitor_thread  state:D stack:    0 pid:16801 ppid:     2 flags:0x80004080
[Tue Sep 19 20:54:39 2023] Call Trace:
[Tue Sep 19 20:54:39 2023]  __schedule+0x2d1/0x870
[Tue Sep 19 20:54:39 2023]  schedule+0x55/0xf0
[Tue Sep 19 20:54:39 2023]  schedule_preempt_disabled+0xa/0x10
[Tue Sep 19 20:54:39 2023]  __mutex_lock.isra.7+0x349/0x420
[Tue Sep 19 20:54:39 2023]  ? vsnprintf+0x380/0x520
[Tue Sep 19 20:54:39 2023]  lnet_peerni_by_nid_locked+0x5f/0x100 [lnet]
[Tue Sep 19 20:54:39 2023]  lnet_select_pathway+0xd7/0x1ae0 [lnet]
[Tue Sep 19 20:54:39 2023]  ? cfs_trace_unlock_tcd+0x20/0x70 [libcfs]
[Tue Sep 19 20:54:39 2023]  ? libcfs_debug_msg+0x98d/0xb10 [libcfs]
[Tue Sep 19 20:54:39 2023]  lnet_send+0x6d/0x1e0 [lnet]
[Tue Sep 19 20:54:39 2023]  LNetGet+0x384/0x880 [lnet]
[Tue Sep 19 20:54:39 2023]  lnet_send_ping+0x116/0x1f0 [lnet]
[Tue Sep 19 20:54:39 2023]  ? lnet_monitor_thread+0x670/0x670 [lnet]
[Tue Sep 19 20:54:39 2023]  ? lnet_monitor_thread+0x670/0x670 [lnet]
[Tue Sep 19 20:54:39 2023]  lnet_recover_local_nis.constprop.40+0x326/0x740 [lnet]
[Tue Sep 19 20:54:39 2023]  lnet_monitor_thread+0x3b0/0x670 [lnet]
[Tue Sep 19 20:54:39 2023]  ? update_load_avg+0x7e/0x710
[Tue Sep 19 20:54:39 2023]  ? update_load_avg+0x7e/0x710
[Tue Sep 19 20:54:39 2023]  ? set_next_entity+0xb5/0x1e0
[Tue Sep 19 20:54:39 2023]  ? __raw_spin_unlock_irq+0x5/0x20
[Tue Sep 19 20:54:39 2023]  ? finish_task_switch+0x86/0x2e0
[Tue Sep 19 20:54:39 2023]  ? __schedule+0x2d9/0x870
[Tue Sep 19 20:54:39 2023]  ? lnet_recover_local_nis.constprop.40+0x740/0x740 [lnet]
[Tue Sep 19 20:54:39 2023]  kthread+0x134/0x150
[Tue Sep 19 20:54:39 2023]  ? set_kthread_struct+0x50/0x50
[Tue Sep 19 20:54:39 2023]  ret_from_fork+0x35/0x40

 

Comment by Chris Horn [ 22/Sep/23 ]

From https://testing.whamcloud.com/test_logs/d7f79d84-b2fa-4a99-95cd-2d55f67fa07c/show_text

00000400:00000200:1.0:1693543306.254808:1168:7233:0:(peer.c:3885:lnet_peer_send_push()) peer 10.240.44.204@tcp
00000400:00000200:1.0:1693543306.254811:992:7233:0:(peer.c:4085:lnet_peer_discovery()) peer 10.240.44.204@tcp(00000000e0dde996) state 0x571 rc 0
00000800:00000100:0.0:1693543356.094024:1280:7229:0:(socklnd_cb.c:430:ksocknal_txlist_done()) Deleting packet type 1 len 48 0@<0:0>->10.240.44.204@tcp
00000400:00000200:0.0:1693543356.094053:1664:7229:0:(lib-msg.c:819:lnet_health_check()) health check: 10.240.44.209@tcp->10.240.44.204@tcp: PUT: LOCAL_TIMEOUT
00000400:00000200:0.0:1693543356.094078:1792:7229:0:(lib-msg.c:620:lnet_resend_msg_locked()) 10.240.44.209@tcp->10.240.44.204@tcp:PUT:LOCAL_TIMEOUT - queuing msg (000000007e1d92a1) for resend
00000400:00000200:1.0:1693543356.094161:1744:7234:0:(lib-move.c:3437:lnet_resend_pending_msgs_locked()) resending <?>->12345-10.240.44.204@tcp: PUT recovery 0 try# 1
00000400:00000200:1.0:1693543356.094182:2288:7234:0:(lib-move.c:2945:lnet_handle_send_case_locked()) Source ANY to MR:  10.240.44.204@tcp local destination
00000400:00000200:1.0:1693543356.094214:2496:7234:0:(lib-move.c:1422:lnet_select_peer_ni()) sd_best_lpni = 10.240.44.204@tcp
00000400:00000200:1.0:1693543356.094222:2448:7234:0:(lib-move.c:1895:lnet_handle_send()) 10.240.44.209@tcp NI seq info: [98481:98481:256:4294967295] 10.240.44.204@tcp LPNI seq info [10:10:8:4294967295]
00000400:00000200:1.0:1693543356.094234:2448:7234:0:(lib-move.c:1997:lnet_handle_send()) rspt_next_hop_nid = 10.240.44.204@tcp
00000400:00000200:1.0:1693543356.094244:2448:7234:0:(lib-move.c:2012:lnet_handle_send()) TRACE: 10.240.44.209@tcp(10.240.44.209@tcp:<?>) -> 10.240.44.204@tcp(10.240.44.204@tcp:10.240.44.204@tcp) <?> : PUT try# 1
00000800:00000200:1.0:1693543356.094252:1952:7234:0:(socklnd_cb.c:1007:ksocknal_send()) sending 48 bytes in 1 frags to 12345-10.240.44.204@tcp
00000800:00000200:1.0:1693543356.094270:2304:7234:0:(socklnd.c:246:ksocknal_find_peer_locked()) got peer_ni [00000000d7cebc4a] -> 12345-10.240.44.204@tcp (2)
00000800:00000200:1.1:1693543356.094277:2304:7234:0:(socklnd.c:246:ksocknal_find_peer_locked()) got peer_ni [00000000d7cebc4a] -> 12345-10.240.44.204@tcp (2)
00000400:00000100:1.0:1693543380.174252:1376:7224:0:(acceptor.c:109:lnet_connect_console_error()) Connection to 10.240.44.204@tcp at host 10.240.44.204:7988 took too long: that node may be hung or experiencing high load.
00000400:00000200:1.0:1693543380.174267:1344:7224:0:(router.c:1691:lnet_notify()) 10.240.44.209@tcp notifying 10.240.44.204@tcp: down
00000800:00000100:1.0:1693543380.174288:1264:7224:0:(socklnd_cb.c:430:ksocknal_txlist_done()) Deleting packet type 1 len 48 0@<0:0>->10.240.44.204@tcp
00000400:00000200:1.0:1693543380.174311:1648:7224:0:(lib-msg.c:819:lnet_health_check()) health check: 10.240.44.209@tcp->10.240.44.204@tcp: PUT: LOCAL_TIMEOUT
00000400:00000200:1.0:1693543380.174332:1776:7224:0:(lib-msg.c:620:lnet_resend_msg_locked()) 10.240.44.209@tcp->10.240.44.204@tcp:PUT:LOCAL_TIMEOUT - queuing msg (000000007e1d92a1) for resend

The PUSH is on resendq, but monitor thread is stuck trying to get mutex. lnetctl PID holding mutex is waiting for this PUSH to be finalized. Deadlock

Comment by Gerrit Updater [ 22/Sep/23 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52479
Subject: LU-17103 lnet: Avoid deadlock when updating ping buffer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6ad2cc51c56b8d0b4745ef585a10665b925ef0f6

Comment by Gerrit Updater [ 27/Sep/23 ]

"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52522
Subject: LU-17103 lnet: use workqueue for lnd ping buffer updates
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d95f9f0ff460f8ca794d8cadf4b481afce881fe8

Comment by Gerrit Updater [ 03/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52479/
Subject: LU-17103 lnet: Avoid deadlock when updating ping buffer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3ca6ba39a21cfebc81bbe7f889c486bb82bb563a

Comment by Gerrit Updater [ 08/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52522/
Subject: LU-17103 lnet: use workqueue for lnd ping buffer updates
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a8e66b899afc0bc09285d426e3acef9d3d1f7694

Comment by Peter Jones [ 08/Nov/23 ]

Merged for 2.16

Generated at Sat Feb 10 03:32:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.