[LU-13617] Client dead lock leads to eviction from MDS (selinux is enabled) Created: 01/Jun/20  Updated: 27/Jun/22  Resolved: 13/Aug/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.12.7, Lustre 2.12.8
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Alexander Boyko
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

One thread got PR lock and waited on inode_lock to notify security context. Another thread got inode_lock and waited on CW lock(conflicts with PR). After timeout the client was evicted, it didn't cancel a PR lock. The dead lock happens on a parent directory.



 Comments   
Comment by Gerrit Updater [ 01/Jun/20 ]

Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/38792
Subject: LU-13617 llite: don't hold inode_lock for security notify
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3f0d01e64c5248cf9542625bf5d5d3034460912b

Comment by Gerrit Updater [ 01/Jun/20 ]

Alexander Boyko (alexander.boyko@hpe.com) uploaded a new patch: https://review.whamcloud.com/38793
Subject: LU-13617 tests: check client deadlock selinux
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5efb04c50e42abc2e017cef49e065cc089fb7323

Comment by Gerrit Updater [ 23/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38792/
Subject: LU-13617 llite: don't hold inode_lock for security notify
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f87359b51f61a4baa9bf62faebb6625d518d23b4

Comment by Gerrit Updater [ 13/Aug/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38793/
Subject: LU-13617 tests: check client deadlock selinux
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f519f22c8ba3a6de00af0bef77cae3b4b18acdab

Comment by Hans Henrik Happe [ 22/Mar/22 ]

We hit this behavior in 2.12.8. Client got evicted due to lock timeout when selinux is enabled. Wonder if this patch also should go into 2.12?

Comment by Alexander Boyko [ 23/Mar/22 ]

regression was

 Fixes: 1d44980bcb ("LU-8956 llite: set sec ctx on client's inode at create time")

git log -grep LU-8956  -oneline origin/b2_12
1d44980 LU-8956 llite: set sec ctx on client's inode at create time

So 2.12 has the same problem as description.

Comment by Etienne Aujames [ 08/Apr/22 ]

We hit this on a robinhood node:

  • with cleanup policy doing parallel "mv" on a folder
  • and stat/getstripe threads on the same folders
Comment by Gerrit Updater [ 08/Apr/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47025
Subject: LU-13617 llite: don't hold inode_lock for security notify
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: bf9ca48adbb69cda4196622ed89a55778c773e85

Comment by Gerrit Updater [ 11/Apr/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47034
Subject: LU-13617 tests: check client deadlock selinux
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: c0906f8c6d23382fb36eb6e7c309012048d8b67f

Comment by Etienne Aujames [ 28/Apr/22 ]

We hit this issue during maintenance regression tests (after updating Lustre clients from 2.12.6 LTS to 2.12.7 LTS on compute nodes).
The file system was not usable with selinux enabled (permissive or enabled) on compute nodes: client evictions, mds unreachable (lot of mdt threads hang waiting ldlm lock).

We were able to reproduce the issue with 5 nodes running mdtest.

The CEA will patch Lustre clients with https://review.whamcloud.com/47034 on compute nodes and run regression tests on it.
I will let you know the results.

For now, we cannot activate selinux on clients without this patch on Lustre LTS >= 2.12.7.
The b2_12's regression seems to come from the https://review.whamcloud.com/41387/ ("LU-9193 security: return security context for metadata ops
")

Generated at Sat Feb 10 03:02:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.