[LU-17103] sanity-lnet test_207: timed out Created: 09/Sep/23 Updated: 09/Nov/23 Resolved: 08/Nov/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5c0452f5-d96f-4c8d-9d27-04b71f59c2ee test_207 failed with the following error: Timeout occurred after 400 minutes, last suite running was sanity-lnet Test session details: It looks like this started failing on 2023-08-11 before patch https://review.whamcloud.com/51921 " VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 09/Sep/23 ] |
|
Serguei, could you please investigate/reassign. It looks like this is a test regression from landing your routes across subnets patch. |
| Comment by Andreas Dilger [ 09/Sep/23 ] |
| Comment by Gerrit Updater [ 11/Sep/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52341 |
| Comment by Serguei Smirnov [ 22/Sep/23 ] |
|
So far all failed cases have similar trace: [Tue Sep 19 20:54:39 2023] task:lnetctl state:D stack: 0 pid:949425 ppid:948947 flags:0x00004080 [Tue Sep 19 20:54:39 2023] Call Trace: [Tue Sep 19 20:54:39 2023] __schedule+0x2d1/0x870 [Tue Sep 19 20:54:39 2023] schedule+0x55/0xf0 [Tue Sep 19 20:54:39 2023] schedule_timeout+0x197/0x300 [Tue Sep 19 20:54:39 2023] ? __next_timer_interrupt+0xf0/0xf0 [Tue Sep 19 20:54:39 2023] ? prepare_to_wait_event+0xa0/0x180 [Tue Sep 19 20:54:39 2023] lnet_ping_md_unlink+0xd0/0x190 [lnet] [Tue Sep 19 20:54:39 2023] ? init_wait_var_entry+0x50/0x50 [Tue Sep 19 20:54:39 2023] lnet_ping_target_update+0x490/0x580 [lnet] [Tue Sep 19 20:54:39 2023] lnet_add_net_common+0x28b/0x490 [lnet] [Tue Sep 19 20:54:39 2023] lnet_dyn_add_ni+0x24c/0x2f0 [lnet] [Tue Sep 19 20:54:39 2023] lnet_genl_parse_local_ni.isra.47+0x190/0x1180 [lnet] [Tue Sep 19 20:54:39 2023] lnet_net_cmd+0x2fe/0x6e0 [lnet] [Tue Sep 19 20:54:39 2023] genl_family_rcv_msg_doit.isra.17+0x113/0x150 [Tue Sep 19 20:54:39 2023] genl_family_rcv_msg+0xb7/0x170 [Tue Sep 19 20:54:39 2023] ? lnet_dyn_del_net+0x1f0/0x1f0 [lnet] [Tue Sep 19 20:54:39 2023] ? lnet_shutdown_lndnets+0x300/0x300 [lnet] [Tue Sep 19 20:54:39 2023] ? lnet_counters_get+0x140/0x140 [lnet] [Tue Sep 19 20:54:39 2023] ? lnet_ping_event_handler+0x130/0x130 [lnet] [Tue Sep 19 20:54:39 2023] genl_rcv_msg+0x47/0xa0 [Tue Sep 19 20:54:39 2023] ? genl_family_rcv_msg+0x170/0x170 [Tue Sep 19 20:54:39 2023] netlink_rcv_skb+0x4c/0x130 [Tue Sep 19 20:54:39 2023] genl_rcv+0x24/0x40 [Tue Sep 19 20:54:39 2023] netlink_unicast+0x19a/0x230 [Tue Sep 19 20:54:39 2023] netlink_sendmsg+0x204/0x3d0 [Tue Sep 19 20:54:39 2023] sock_sendmsg+0x50/0x60 [Tue Sep 19 20:54:39 2023] ____sys_sendmsg+0x22a/0x250 [Tue Sep 19 20:54:39 2023] ? copy_msghdr_from_user+0x5c/0x90 [Tue Sep 19 20:54:39 2023] ? ____sys_recvmsg+0xb0/0x150 [Tue Sep 19 20:54:39 2023] ___sys_sendmsg+0x7c/0xc0 [Tue Sep 19 20:54:39 2023] ? copy_msghdr_from_user+0x5c/0x90 [Tue Sep 19 20:54:39 2023] ? ___sys_recvmsg+0x89/0xc0 [Tue Sep 19 20:54:39 2023] ? __wake_up_common_lock+0x89/0xc0 [Tue Sep 19 20:54:39 2023] __sys_sendmsg+0x57/0xa0 [Tue Sep 19 20:54:39 2023] do_syscall_64+0x5b/0x1b0 [Tue Sep 19 20:54:39 2023] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [Tue Sep 19 20:54:39 2023] task:monitor_thread state:D stack: 0 pid:16801 ppid: 2 flags:0x80004080 [Tue Sep 19 20:54:39 2023] Call Trace: [Tue Sep 19 20:54:39 2023] __schedule+0x2d1/0x870 [Tue Sep 19 20:54:39 2023] schedule+0x55/0xf0 [Tue Sep 19 20:54:39 2023] schedule_preempt_disabled+0xa/0x10 [Tue Sep 19 20:54:39 2023] __mutex_lock.isra.7+0x349/0x420 [Tue Sep 19 20:54:39 2023] ? vsnprintf+0x380/0x520 [Tue Sep 19 20:54:39 2023] lnet_peerni_by_nid_locked+0x5f/0x100 [lnet] [Tue Sep 19 20:54:39 2023] lnet_select_pathway+0xd7/0x1ae0 [lnet] [Tue Sep 19 20:54:39 2023] ? cfs_trace_unlock_tcd+0x20/0x70 [libcfs] [Tue Sep 19 20:54:39 2023] ? libcfs_debug_msg+0x98d/0xb10 [libcfs] [Tue Sep 19 20:54:39 2023] lnet_send+0x6d/0x1e0 [lnet] [Tue Sep 19 20:54:39 2023] LNetGet+0x384/0x880 [lnet] [Tue Sep 19 20:54:39 2023] lnet_send_ping+0x116/0x1f0 [lnet] [Tue Sep 19 20:54:39 2023] ? lnet_monitor_thread+0x670/0x670 [lnet] [Tue Sep 19 20:54:39 2023] ? lnet_monitor_thread+0x670/0x670 [lnet] [Tue Sep 19 20:54:39 2023] lnet_recover_local_nis.constprop.40+0x326/0x740 [lnet] [Tue Sep 19 20:54:39 2023] lnet_monitor_thread+0x3b0/0x670 [lnet] [Tue Sep 19 20:54:39 2023] ? update_load_avg+0x7e/0x710 [Tue Sep 19 20:54:39 2023] ? update_load_avg+0x7e/0x710 [Tue Sep 19 20:54:39 2023] ? set_next_entity+0xb5/0x1e0 [Tue Sep 19 20:54:39 2023] ? __raw_spin_unlock_irq+0x5/0x20 [Tue Sep 19 20:54:39 2023] ? finish_task_switch+0x86/0x2e0 [Tue Sep 19 20:54:39 2023] ? __schedule+0x2d9/0x870 [Tue Sep 19 20:54:39 2023] ? lnet_recover_local_nis.constprop.40+0x740/0x740 [lnet] [Tue Sep 19 20:54:39 2023] kthread+0x134/0x150 [Tue Sep 19 20:54:39 2023] ? set_kthread_struct+0x50/0x50 [Tue Sep 19 20:54:39 2023] ret_from_fork+0x35/0x40
|
| Comment by Chris Horn [ 22/Sep/23 ] |
|
From https://testing.whamcloud.com/test_logs/d7f79d84-b2fa-4a99-95cd-2d55f67fa07c/show_text 00000400:00000200:1.0:1693543306.254808:1168:7233:0:(peer.c:3885:lnet_peer_send_push()) peer 10.240.44.204@tcp 00000400:00000200:1.0:1693543306.254811:992:7233:0:(peer.c:4085:lnet_peer_discovery()) peer 10.240.44.204@tcp(00000000e0dde996) state 0x571 rc 0 00000800:00000100:0.0:1693543356.094024:1280:7229:0:(socklnd_cb.c:430:ksocknal_txlist_done()) Deleting packet type 1 len 48 0@<0:0>->10.240.44.204@tcp 00000400:00000200:0.0:1693543356.094053:1664:7229:0:(lib-msg.c:819:lnet_health_check()) health check: 10.240.44.209@tcp->10.240.44.204@tcp: PUT: LOCAL_TIMEOUT 00000400:00000200:0.0:1693543356.094078:1792:7229:0:(lib-msg.c:620:lnet_resend_msg_locked()) 10.240.44.209@tcp->10.240.44.204@tcp:PUT:LOCAL_TIMEOUT - queuing msg (000000007e1d92a1) for resend 00000400:00000200:1.0:1693543356.094161:1744:7234:0:(lib-move.c:3437:lnet_resend_pending_msgs_locked()) resending <?>->12345-10.240.44.204@tcp: PUT recovery 0 try# 1 00000400:00000200:1.0:1693543356.094182:2288:7234:0:(lib-move.c:2945:lnet_handle_send_case_locked()) Source ANY to MR: 10.240.44.204@tcp local destination 00000400:00000200:1.0:1693543356.094214:2496:7234:0:(lib-move.c:1422:lnet_select_peer_ni()) sd_best_lpni = 10.240.44.204@tcp 00000400:00000200:1.0:1693543356.094222:2448:7234:0:(lib-move.c:1895:lnet_handle_send()) 10.240.44.209@tcp NI seq info: [98481:98481:256:4294967295] 10.240.44.204@tcp LPNI seq info [10:10:8:4294967295] 00000400:00000200:1.0:1693543356.094234:2448:7234:0:(lib-move.c:1997:lnet_handle_send()) rspt_next_hop_nid = 10.240.44.204@tcp 00000400:00000200:1.0:1693543356.094244:2448:7234:0:(lib-move.c:2012:lnet_handle_send()) TRACE: 10.240.44.209@tcp(10.240.44.209@tcp:<?>) -> 10.240.44.204@tcp(10.240.44.204@tcp:10.240.44.204@tcp) <?> : PUT try# 1 00000800:00000200:1.0:1693543356.094252:1952:7234:0:(socklnd_cb.c:1007:ksocknal_send()) sending 48 bytes in 1 frags to 12345-10.240.44.204@tcp 00000800:00000200:1.0:1693543356.094270:2304:7234:0:(socklnd.c:246:ksocknal_find_peer_locked()) got peer_ni [00000000d7cebc4a] -> 12345-10.240.44.204@tcp (2) 00000800:00000200:1.1:1693543356.094277:2304:7234:0:(socklnd.c:246:ksocknal_find_peer_locked()) got peer_ni [00000000d7cebc4a] -> 12345-10.240.44.204@tcp (2) 00000400:00000100:1.0:1693543380.174252:1376:7224:0:(acceptor.c:109:lnet_connect_console_error()) Connection to 10.240.44.204@tcp at host 10.240.44.204:7988 took too long: that node may be hung or experiencing high load. 00000400:00000200:1.0:1693543380.174267:1344:7224:0:(router.c:1691:lnet_notify()) 10.240.44.209@tcp notifying 10.240.44.204@tcp: down 00000800:00000100:1.0:1693543380.174288:1264:7224:0:(socklnd_cb.c:430:ksocknal_txlist_done()) Deleting packet type 1 len 48 0@<0:0>->10.240.44.204@tcp 00000400:00000200:1.0:1693543380.174311:1648:7224:0:(lib-msg.c:819:lnet_health_check()) health check: 10.240.44.209@tcp->10.240.44.204@tcp: PUT: LOCAL_TIMEOUT 00000400:00000200:1.0:1693543380.174332:1776:7224:0:(lib-msg.c:620:lnet_resend_msg_locked()) 10.240.44.209@tcp->10.240.44.204@tcp:PUT:LOCAL_TIMEOUT - queuing msg (000000007e1d92a1) for resend The PUSH is on resendq, but monitor thread is stuck trying to get mutex. lnetctl PID holding mutex is waiting for this PUSH to be finalized. Deadlock |
| Comment by Gerrit Updater [ 22/Sep/23 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52479 |
| Comment by Gerrit Updater [ 27/Sep/23 ] |
|
"Serguei Smirnov <ssmirnov@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52522 |
| Comment by Gerrit Updater [ 03/Nov/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52479/ |
| Comment by Gerrit Updater [ 08/Nov/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52522/ |
| Comment by Peter Jones [ 08/Nov/23 ] |
|
Merged for 2.16 |