[LU-15245] getxattr can lead to MDS thread exhaustion and deadlock Created: 17/Nov/21  Updated: 05/May/22  Resolved: 06/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Improvement Priority: Major
Reporter: Patrick Farrell Assignee: Patrick Farrell
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

When selinux is enabled, a getxattr becomes part of lookup.

So, this sequence occurs:

--------
Client A performs lookup on resource X, gets a lock - call it lock A - on resource X.
Client A starts getxattr request on resource X, getxattr takes a reference on lock A.  Lock A now can't be cancelled until the getxattr is complete.
[Pause client A here]

Client B attempts a modifying operation on resource X, so it requests a conflicting lock on resource X.
Call it lock B.
Lock B is now waiting behind lock A.  The waiting is done by an MDT worker thread, so this thread is now 'consumed' - it's busy and can't handle any other requests.

Now we have client C...  Client D... client E, F, G, H ... etc.
And they also want locks on resource X.
All of these locks (Lock C, lock D, etc...) wait behind Lock B.  (Some of them may be modifying locks, some may not be, it doesn't matter.)
Each of these waiting locks 'consumes' an MDT worker thread, because the thread has to do the waiting.

You can easily see how any number of threads can be consumed like this.

So now, go back to client A:
Getxattr request arrives at MDT
No threads are available to service the getxattr request, so it waits 

And now client A is holding lock A, but it cannot complete the operation.  So, eventually client A is evicted because it can't give the lock back.


The solution is to move getaxttr - and getattr for consistency, as there may be a possible bug there as well - to the MDS_READPAGE portal.



 Comments   
Comment by Gerrit Updater [ 17/Nov/21 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45593
Subject: LU-15245 mdc: GET(X)ATTR to READPAGE portal
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ebb035756eb059b255d4c8245d42bc5d5b96bab9

Comment by Gerrit Updater [ 06/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45593/
Subject: LU-15245 mdc: GET(X)ATTR to READPAGE portal
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5552eba1451d47ce1ba6c7ca112aa4b9b2f87292

Comment by Peter Jones [ 06/Jan/22 ]

Landed for 2.15

Generated at Sat Feb 10 03:16:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.