Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12250

MDS with high rate of ldlm_cancel and Prolong DOM lock

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6
    • 3
    • 9223372036854775807

    Description

      I'm investigating a metadata slowdown we had tonight on Fir, in terms of metadata. A simple find was super slow. However, when I started to gather stats, the performance came back, so now it seems ok. However, I can still see a lot of ldlm_cancel RPCs so I wanted to report it. I have a script (that I can share if needed)) that takes a 5 secs sample of Lustre RPCs on the MDS and I can see there is a high rate of ldlm_cancel locks. I also see a lot of Prolong DOM lock in the full logs also.
      I'm attaching the output of my script as fir-md1-s2-lrpc-sample.log, which shows the NIDs from a 5s rpctrace/rpc debug along with each RPC type found and RPC count), for example:

      Total_RPC_count_for_NID NID LND# RPC_type:count,RPC_type:count,...
      3718 sh-107-42-ib0.ib o2ib4 mds_close:1285,ldlm_enqueue:1213,ldlm_cancel:1220
      

      Also attaching a 5 sec full rpctrace/rpc of fir-md1-s2 (MDT0001 and MDT0003) as fir-md1-s2-rpctrace-rpc-20190430-5s.log.gz . This is the most loaded server that's why.

      I wonder, could the patch for LU-10777 (DoM performance is bad with FIO write), that just landed into master, help us in this case (or the question is... does resends trigger such ldlm_cancel rpcs?).

      Attachments

        Activity

          People

            green Oleg Drokin
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: