[LU-12122] Deadlock with check_routers_before_use and discovery Created: 26/Mar/19 Updated: 03/Jan/20 Resolved: 26/Mar/19 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | WC Triage |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
I found this issue when testing Amir's new patches (see https://review.whamcloud.com/#/c/33651/9), but I believe the issue exists in current master. LNetNIInit() calls lnet_monitor_thr_start() -> lnet_router_post_mt_start() -> lnet_wait_known_routerstate() lnet_wait_known_routerstate() will wait indefinitely until all gateways have been discovered. However, the discovery thread is not started until after lnet_monitor_thr_start() returns. Thus, LNet never finishes starting. Logs slowly fill with: [7073564.123980] LNetError: 31952:0:(router.c:873:lnet_check_routers()) Failed to discover router 192.168.2.26@tcp4 Reproduced on simple three node VM. The LNet configuration: sles15build01:/etc/modprobe.d # pdsh -w sles15build01,sles15c01,sles15s01 cat /etc/modprobe.d/lnet.conf | dshbak -c ---------------- sles15s01 ---------------- options lnet networks="tcp(eth0)" options lnet routes="tcp4 192.168.2.26@tcp" options lnet lnet_peer_discovery_disabled=0 options lnet check_routers_before_use=1 ---------------- sles15c01 ---------------- options lnet ip2nets="tcp4(eth0) 192.168.*.*; tcp99(eth0) 192.168.*.*" options lnet routes="tcp 192.168.2.26@tcp4" options lnet lnet_peer_discovery_disabled=0 options lnet check_routers_before_use=1 ---------------- sles15build01 ---------------- options lnet ip2nets="tcp(eth0) 192.168.*.*; tcp4(eth1) 192.168.*.*; tcp99(eth1) 192.168.*.*" options lnet forwarding=enabled options lnet lnet_peer_discovery_disabled=0 sles15build01:/etc/modprobe.d # Attempt to start LNet: sles15build01:/etc/modprobe.d # pdsh -w sles15build01,sles15c01,sles15s01 pdsh> modprobe lnet pdsh> lctl net up sles15build01: LNET configured LNet can start on the node acting as a router, but the hangs indefinitely on the other two nodes. |
| Comments |
| Comment by Chris Horn [ 26/Mar/19 ] |
|
In trying out a quick fix for this issue I noticed that 'lctl net up' will hang if check_routers_before_use is enabled but the routers don't have lnet loaded |
| Comment by Chris Horn [ 26/Mar/19 ] |
|
I must've gotten mixed-up when checking whether this bug exists in master. The switch from the router_checker_thread to the monitoring thread is in master, but this bug is only introduced by amir's patch to replace the router pings with discovery under |
| Comment by Malcolm Haak - NCI (Inactive) [ 02/Jan/20 ] |
|
This needs to be reopened.
We had all of production go down due to lustre servers not being able to resolve the status of lnet routers. |
| Comment by Peter Jones [ 02/Jan/20 ] |
|
mhaakddn rather than reopening ancient similar tickets, please open a new ticket with the details of the incident for analysis. |
| Comment by Chris Horn [ 02/Jan/20 ] |
| Comment by Malcolm Haak - NCI (Inactive) [ 03/Jan/20 ] |
|
@Chris Horn, That would appear to be it. |