Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7988

HSM: high lock contention for cdt_llog_lock

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      There is an important locking issue around cdt_llog_lock when adding new HSM requests.

      # time wc -l /proc/fs/lustre/mdt/snx11133-MDT0000/hsm/actions
      219759 /proc/fs/lustre/mdt/snx11133-MDT0000/hsm/actions
      
      real    11m45.068s
      user    0m0.020s
      sys     0m21.372s
      

      11 minutes to cat the list is too high. Such operation should take a couple seconds at most.

      The contention appears to come from the coordinator. Every time a new request is posted, the whole list of request is browsed, under that lock. That's not a problem when there is only a handful of request, but it doesn't scale when there is hundreds of thousands of them.

      I recompiled a centos 7 kernel with CONFIG_LOCK_STAT on a VM. I ran test creating 10000 files and archiving them without a copytool present. Total time was 146 seconds. Lock contention result:

      lock_stat version 0.3
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      
      [...]
      
                           &cdt->cdt_llog_lock:          6296           6296          15.45       23074.17    43436574.06          17791          27134          25.09       37745.03   138558199.24
                           -------------------
                           &cdt->cdt_llog_lock           6296          [<ffffffffa0fb096d>] cdt_llog_process+0x9d/0x3a0 [mdt]
                           -------------------
                           &cdt->cdt_llog_lock           6296          [<ffffffffa0fb096d>] cdt_llog_process+0x9d/0x3a0 [mdt]
      [...]
      

      (time units are micro-seconds).

      With waittime-total=43 seconds and holdtime-total=138s, this is a very contentious lock, way above the other locks in Lustre or the whole system.

      AFAICS, contention is between these mechanisms:

      • adding a new request (lfs hsm_archive, ...)
      • changing a request status (WAITING->STARTED->SUCCEED)
      • removing a request (archive completed)
      • housekeeping (coordinator loop every 10 seconds)
      • dumping the list of actions from /proc

      The net result is that when there is a lot of requests, they trickle down to the copytool, exacerbating the problem by increasing the number in the list.

      Attachments

        Issue Links

          Activity

            People

              fzago Frank Zago (Inactive)
              fzago Frank Zago (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              30 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: