Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.5.3, Lustre 2.8.0
-
None
-
3
-
9223372036854775807
Description
A performance problem at one of our customers led us to find that the granted ldlm locks counter (found in /proc/fs/lustre/ldlm/namespaces/mdt-fsname-MDT0000_UUID/pool/granted) is actually missing some decrements in some conditions (yet to be determined).
This leads after some time to have this counter largely exceed the number found in /proc/fs/lustre/ldlm/namespaces/mdt-fsname-MDT0000_UUID/pool/limit.
See here:
[root@prolixmds1 pool]# pwd /proc/fs/lustre/ldlm/namespaces/mdt-scratch-MDT0000_UUID/pool [root@prolixmds1 pool]# cat limit 3203616 [root@prolixmds1 pool]# cat granted 54882822
However, summing up all granted locks as seen by the all the clients, we get only 16k locks, which is also consistent with the slab consumption on the MDS.
Once above the limit, the MDS will then constantly try to cancel locks, even those which are not above max_age. Clients then reacquire the locks, but lose time in the process (then showing the performance problem).
Note that as this is only the counter which is false, we don't have any resource overconsumption tied to this problem.
We found that this problem is also seen on 2.8.
Can you help find where the leak comes from ?
I also wonder if there is any relation with the last comment from Shuichi Ihara in LU-5727.
I also think Christopher Morrone pointed this out here
Attachments
Issue Links
- is related to
-
LU-8634 2.8.0 MDS (layout.c:2025:__req_capsule_get()) @@@ Wrong buffer for field `quota_body' (3 of 1) in format `LDLM_INTENT_QUOTA': 0 vs. 112 (server)
- Resolved