[LU-14081] Filesystem timeout alignment Created: 28/Oct/20  Updated: 28/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Attachments: PDF File 13_lnet_tunings_lyashkov.pdf     Microsoft PowerPoint LAD-devel-2014.pptx    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There is an implicit dependency between the various timeouts in the system: ldlm, Adaptive Timeout, LNet transaction timeout, LND timeout and the underlying protocol timeout (TCP or IB). Ideally the lower layers should timeout before the upper layers. What happens now though is the timeouts are independently tuned. This could run us in a situation where the Adaptive timeout is triggered first forcing all the memory descriptors to be cleaned up. However since the LNet/LND connection is still up we could receive messages which reference MDs which have been freed.

It'll be better to devise a method to keep these timeout values in sync. One method is bottom up, where the LND timeout determines what the AT min is. Another approach is top down where the AT min determines what the LND timeout is.

We need to investigate the best approach.

There have been a few presentations on the subject attached to the ticket.


Generated at Sat Feb 10 03:06:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.