Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3308

large readdir chunk size slows unlink/"rm -r" performance

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.1.6, Lustre 2.4.1, Lustre 2.5.0, Lustre 2.12.0
    • 3
    • 8195

    Description

      Shared directory unlinks using metabench seem significantly slower with Lustre 2.x clients versus Lustre 1.8.x

      512 processes, 8 ppn, 2.1 server
      1.8.6 clients: 18k unlinks/s
      2.3 clients: 7k unlinks/s

      creates are comprable at 28k creates/s

      Mdtest shows no such regression. Digging into metabench a little, it seems that when deleting files, metabench processes (I am told) readdir, select "the next" file, and delete it, effectively racing each other, whereas mdtest deterministically choses the files to delete based on process id.

      This seems to imply it's directory locking contention on the MDT.

      On the clients, the majority difference in time is spend in ptlrpc_queue_wait (not sure of units):
      1.8.8: 347
      2.3: 923

      On the MDT, the big difference is mdt_object_find_lock:
      1.8.8: 51us
      2.3: 1110us

      Also, using ldlm stats, it seems the 2.3 clients cause twice as many ldlm_bl_callbacks as 1.8.8 clients.

      So apparently the 2.3 client is holding directory locks differently than the 1.8 clients. We're still looking into this, but if anyone has thoughts we'd love to hear them.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              nrutman Nathan Rutman
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated: