[LU-12122] Deadlock with check_routers_before_use and discovery Created: 26/Mar/19  Updated: 03/Jan/20  Resolved: 26/Mar/19

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I found this issue when testing Amir's new patches (see https://review.whamcloud.com/#/c/33651/9), but I believe the issue exists in current master.

LNetNIInit() calls lnet_monitor_thr_start() -> lnet_router_post_mt_start() -> lnet_wait_known_routerstate()

lnet_wait_known_routerstate() will wait indefinitely until all gateways have been discovered.

However, the discovery thread is not started until after lnet_monitor_thr_start() returns. Thus, LNet never finishes starting.

Logs slowly fill with:

[7073564.123980] LNetError: 31952:0:(router.c:873:lnet_check_routers()) Failed to discover router 192.168.2.26@tcp4

Reproduced on simple three node VM.

The LNet configuration:

sles15build01:/etc/modprobe.d # pdsh -w sles15build01,sles15c01,sles15s01 cat /etc/modprobe.d/lnet.conf | dshbak -c
----------------
sles15s01
----------------
options lnet networks="tcp(eth0)"
options lnet routes="tcp4 192.168.2.26@tcp"
options lnet lnet_peer_discovery_disabled=0
options lnet check_routers_before_use=1
----------------
sles15c01
----------------
options lnet ip2nets="tcp4(eth0) 192.168.*.*; tcp99(eth0) 192.168.*.*"
options lnet routes="tcp 192.168.2.26@tcp4"
options lnet lnet_peer_discovery_disabled=0
options lnet check_routers_before_use=1
----------------
sles15build01
----------------
options lnet ip2nets="tcp(eth0) 192.168.*.*; tcp4(eth1) 192.168.*.*; tcp99(eth1) 192.168.*.*"
options lnet forwarding=enabled
options lnet lnet_peer_discovery_disabled=0
sles15build01:/etc/modprobe.d #

Attempt to start LNet:

sles15build01:/etc/modprobe.d # pdsh -w sles15build01,sles15c01,sles15s01
pdsh> modprobe lnet
pdsh> lctl net up
sles15build01: LNET configured

LNet can start on the node acting as a router, but the hangs indefinitely on the other two nodes.



 Comments   
Comment by Chris Horn [ 26/Mar/19 ]

In trying out a quick fix for this issue I noticed that 'lctl net up' will hang if check_routers_before_use is enabled but the routers don't have lnet loaded
I think what ought to happen is we attempt discovery to each router once, and set them up or down as appropriate
rather than wait forever

Comment by Chris Horn [ 26/Mar/19 ]

I must've gotten mixed-up when checking whether this bug exists in master. The switch from the router_checker_thread to the monitoring thread is in master, but this bug is only introduced by amir's patch to replace the router pings with discovery under LU-11299. Since that patch is still under review I will close this ticket and share my feedback in the code review.

Comment by Malcolm Haak - NCI (Inactive) [ 02/Jan/20 ]

This needs to be reopened.

LU-11299 was commited and the bug does not appear to be addressed in the final version of LU-11299.

We had all of production go down due to lustre servers not being able to resolve the status of lnet routers.

Comment by Peter Jones [ 02/Jan/20 ]

mhaakddn rather than reopening ancient similar tickets, please open a new ticket with the details of the incident for analysis.

Comment by Chris Horn [ 02/Jan/20 ]

mhaak perhaps you experienced LU-13001?

Comment by Malcolm Haak - NCI (Inactive) [ 03/Jan/20 ]

@Chris Horn,

That would appear to be it.

Generated at Sat Feb 10 02:49:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.