[LU-5749] osd-zfs: object creation may serialize on lu_site::ls_purge_mutex Created: 16/Oct/14  Updated: 13/Feb/19  Resolved: 13/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Isaac Huang (Inactive) Assignee: Alex Zhuravlev
Resolution: Duplicate Votes: 0
Labels: RZ_LS, zfs

Issue Links:
Related
is related to LU-5331 qsd_handler.c:1139:qsd_op_adjust()) A... Resolved
is related to LU-7896 lu_object_limit() is called too frequ... Resolved
Rank (Obsolete): 16144

 Description   

LU-5331 introduced lu_site::ls_purge_mutex to serialize lu_site_purge(). But in osd-zfs, when every new object is created, lu_object_limit() is called which calls lu_site_purge() if the cache is too big.

Contention on the mutex can happen when multiple threads are creating objects and the cache is near the lu_cache_nr limit. In LU-5747 I saw stacks like:

 [<ffffffff8106306c>] try_to_wake_up+0x3c/0x3e0
 [<ffffffffa0f0e219>] ? echo_object_free+0x159/0x2f0 [obdecho]
 [<ffffffff81063465>] wake_up_process+0x15/0x20
 [<ffffffff8150f7e4>] __mutex_unlock_slowpath+0x44/0x60
 [<ffffffff8150f79b>] mutex_unlock+0x1b/0x20
 [<ffffffffa07a4907>] lu_site_purge+0x3f7/0x4e0 [obdclass]
 [<ffffffffa07a4e31>] lu_object_limit+0x71/0x80 [obdclass]
 [<ffffffffa07a4f93>] lu_object_find_try+0x153/0x2b0 [obdclass]

Which indicated contention on the mutex. So this may hurt object creation rates on osd-zfs. But I don't have any data to support it yet, due to LU-5747.



 Comments   
Comment by Andreas Dilger [ 17/Oct/14 ]

It probably makes sense for lu_site_purge() to use mutex_trylock() and just return immediately if ls_purge_mutex is held and another thread is dropping the cache (need a static variable that is updated by the thread holding ls_purge_mutex indicating if it is doing a full purge or not). There is no reason for other threads to be blocked if one is already dropping the entire cache. There is also no reason for threads to block when doing a limited cache shrink if another thread is also doing a limited shrink.

Comment by nasf (Inactive) [ 26/Mar/15 ]

I hit it on master:
https://testing.hpdd.intel.com/test_sets/a6c0f402-d2ed-11e4-a357-5254006e85c2

Comment by Andreas Dilger [ 13/Feb/19 ]

Fixed via patch http://review.whamcloud.com/19082 "LU-7896: do not call lu_site_purge() for single object exceed".

Generated at Sat Feb 10 01:54:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.