Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.12.0
-
None
-
CentOS 7.6
-
3
-
9223372036854775807
Description
I'm investigating a metadata slowdown we had tonight on Fir, in terms of metadata. A simple find was super slow. However, when I started to gather stats, the performance came back, so now it seems ok. However, I can still see a lot of ldlm_cancel RPCs so I wanted to report it. I have a script (that I can share if needed)) that takes a 5 secs sample of Lustre RPCs on the MDS and I can see there is a high rate of ldlm_cancel locks. I also see a lot of Prolong DOM lock in the full logs also.
I'm attaching the output of my script as fir-md1-s2-lrpc-sample.log
, which shows the NIDs from a 5s rpctrace/rpc debug along with each RPC type found and RPC count), for example:
Total_RPC_count_for_NID NID LND# RPC_type:count,RPC_type:count,... 3718 sh-107-42-ib0.ib o2ib4 mds_close:1285,ldlm_enqueue:1213,ldlm_cancel:1220
Also attaching a 5 sec full rpctrace/rpc of fir-md1-s2 (MDT0001 and MDT0003) as fir-md1-s2-rpctrace-rpc-20190430-5s.log.gz
. This is the most loaded server that's why.
I wonder, could the patch for LU-10777 (DoM performance is bad with FIO write), that just landed into master, help us in this case (or the question is... does resends trigger such ldlm_cancel rpcs?).