Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5749

osd-zfs: object creation may serialize on lu_site::ls_purge_mutex

Details

    • Improvement
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.7.0
    • 16144

    Description

      LU-5331 introduced lu_site::ls_purge_mutex to serialize lu_site_purge(). But in osd-zfs, when every new object is created, lu_object_limit() is called which calls lu_site_purge() if the cache is too big.

      Contention on the mutex can happen when multiple threads are creating objects and the cache is near the lu_cache_nr limit. In LU-5747 I saw stacks like:

       [<ffffffff8106306c>] try_to_wake_up+0x3c/0x3e0
       [<ffffffffa0f0e219>] ? echo_object_free+0x159/0x2f0 [obdecho]
       [<ffffffff81063465>] wake_up_process+0x15/0x20
       [<ffffffff8150f7e4>] __mutex_unlock_slowpath+0x44/0x60
       [<ffffffff8150f79b>] mutex_unlock+0x1b/0x20
       [<ffffffffa07a4907>] lu_site_purge+0x3f7/0x4e0 [obdclass]
       [<ffffffffa07a4e31>] lu_object_limit+0x71/0x80 [obdclass]
       [<ffffffffa07a4f93>] lu_object_find_try+0x153/0x2b0 [obdclass]
      

      Which indicated contention on the mutex. So this may hurt object creation rates on osd-zfs. But I don't have any data to support it yet, due to LU-5747.

      Attachments

        Issue Links

          Activity

            [LU-5749] osd-zfs: object creation may serialize on lu_site::ls_purge_mutex

            Fixed via patch http://review.whamcloud.com/19082 "LU-7896: do not call lu_site_purge() for single object exceed".

            adilger Andreas Dilger added a comment - Fixed via patch http://review.whamcloud.com/19082 " LU-7896 : do not call lu_site_purge() for single object exceed ".
            yong.fan nasf (Inactive) added a comment - I hit it on master: https://testing.hpdd.intel.com/test_sets/a6c0f402-d2ed-11e4-a357-5254006e85c2
            adilger Andreas Dilger added a comment - - edited

            It probably makes sense for lu_site_purge() to use mutex_trylock() and just return immediately if ls_purge_mutex is held and another thread is dropping the cache (need a static variable that is updated by the thread holding ls_purge_mutex indicating if it is doing a full purge or not). There is no reason for other threads to be blocked if one is already dropping the entire cache. There is also no reason for threads to block when doing a limited cache shrink if another thread is also doing a limited shrink.

            adilger Andreas Dilger added a comment - - edited It probably makes sense for lu_site_purge() to use mutex_trylock() and just return immediately if ls_purge_mutex is held and another thread is dropping the cache (need a static variable that is updated by the thread holding ls_purge_mutex indicating if it is doing a full purge or not). There is no reason for other threads to be blocked if one is already dropping the entire cache. There is also no reason for threads to block when doing a limited cache shrink if another thread is also doing a limited shrink.

            People

              bzzz Alex Zhuravlev
              isaac Isaac Huang (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: