[LU-13001] check_routers_before_use causes LNet to hang indefinitely if any router is down Created: 22/Nov/19  Updated: 06/Jan/21  Resolved: 14/Dec/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Historically, check_routers_before_use would cause LNet
initialization to pause until all routers had been ping'd once.

This behavior was changed in commit
fe17e9b8370affe063769b880f02b9190584baaa from LU-11298. Now, LNet
will wait indefinitely until discovery completes on all routers.
This is problematic, because if even one router is down then LNet
will stall forever.



 Comments   
Comment by Gerrit Updater [ 22/Nov/19 ]

Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/36820
Subject: LU-13001 lnet: Wait for single discovery attempt of routers
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9282b4f0c0198407d285bfa4c05f4dc5fa82b60c

Comment by Gerrit Updater [ 14/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36820/
Subject: LU-13001 lnet: Wait for single discovery attempt of routers
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d45a032d9a5c6929f62e00e75d8fb0103cc0fbb4

Comment by Peter Jones [ 14/Dec/19 ]

Landed for 2.14

Comment by Malcolm Haak - NCI (Inactive) [ 03/Jan/20 ]

I believe this can be hit on 2.12.x Is there any plan to backport to b2_12?

Comment by Jay Lan (Inactive) [ 05/Mar/20 ]

2.13.0 was affected.

Comment by Mahmoud Hanafi [ 06/Jan/21 ]

Can we get a backport to 2.12

 

Generated at Sat Feb 10 02:57:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.