[LU-16536] MDS umount can get stuck due to LDLM locks Created: 07/Feb/23  Updated: 29/Jun/23  Resolved: 14/Feb/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Alex Zhuravlev Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16753 replay-single: test_135 timeout Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

the scenario is the following:

  • rmdir is trying to unlink a striped dir
  • a set of LDLM locks (from different MDT) are held by MDT
  • number of RPCs are sent to another MDTs to destroy the objects
  • another MDT umounts and no reply is sent back
  • this MDT starts to umount, deactivate OSP#0 (corresponding RPCs are interrupted) and waiting for LDLM lock in this namespace to release
  • the lock is still held by the original MDT thread doing rmdir
  • other OSPs are still active trying to reconnect to umounted MDTs
  • deadlock


 Comments   
Comment by Gerrit Updater [ 07/Feb/23 ]

"Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49925
Subject: LU-16536 osp: don't cleanup ldlm in precleanup phase
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fcec71d596988513e1ba841eda30a7336705c400

Comment by Gerrit Updater [ 14/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49925/
Subject: LU-16536 osp: don't cleanup ldlm in precleanup phase
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: eed4d4c7523c26cfc5bc230986d96b2acf152dee

Comment by Peter Jones [ 14/Feb/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:27:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.