Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.13.0
-
None
-
3
-
9223372036854775807
Description
On master, when running conf-sanity I often see mount stuck in the following stack
trace:
n:lustre-release# stack1 llog 29833 llog_process_th [<ffffffffc06be64b>] lnet_discover_peer_locked+0x10b/0x380 [lnet] [<ffffffffc06be930>] LNetPrimaryNID+0x70/0x1a0 [lnet] [<ffffffffc0990ade>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] [<ffffffffc098518c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc] [<ffffffffc09580c2>] import_set_conn+0xb2/0x7a0 [ptlrpc] [<ffffffffc09587c3>] client_import_add_conn+0x13/0x20 [ptlrpc] [<ffffffffc074efa9>] class_add_conn+0x419/0x680 [obdclass] [<ffffffffc0750bc6>] class_process_config+0x19b6/0x27e0 [obdclass] [<ffffffffc0753644>] class_config_llog_handler+0x934/0x14d0 [obdclass] [<ffffffffc0717904>] llog_process_thread+0x834/0x1550 [obdclass] [<ffffffffc071902f>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass] [<ffffffff810b252f>] kthread+0xcf/0xe0 [<ffffffff816b8798>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff
conf-sanity has some tests that use bogus NIDs like 1.2.3.4 and 4.3.2.1.These are actually real IPv4 addresses but AFAICT they just discard all packets.I can see that the discovery thread cancels discovery on these peers but the llog_process_thread seems to stay in lnet_discover_peer_locked() for upto 60 seconds after. Looking at the code I can't see how it would get worken up in this case. Why doesn't lnet_peer_cancel_discovery() wake up the waiters on lp_dc_waitq? Or why don't we use schedule_timeout() with the discovery/transaction timeout in lnet_discover_peer_locked()?
Attachments
Issue Links
- is duplicated by
-
LU-12416 NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mount.lustre:11956]
- Resolved
- is related to
-
LU-12442 recovery-small test_136: mounts stuck in lnet_discover_peer_locked()
- Resolved
-
LU-12519 sanity-sec test 31 crashes with ASSERTION( list_empty(&lp->lp_peer_nets) )
- Resolved
- is related to
-
LU-12424 LNet MR routing: possible loop when discovery is off
- Reopened