[LU-17181] lu_sites_guard sem caused a page reclaim starvation. Created: 11/Oct/23  Updated: 08/Nov/23  Resolved: 08/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Linux MM can run serval cache reclaim in parallel,
but lu_site_guard blocks an other threads to be work.
A specially due cond_resched inside.

PID: 98822  TASK: ffff9766015e0000  CPU: 11  COMMAND: "zabbix_agent2"
 #0 [ffffbeed4e617920] __schedule+708 at ffffffff9b54e1d4
 #1 [ffffbeed4e6179b8] schedule+56 at ffffffff9b54e648
 #2 [ffffbeed4e6179c8] rwsem_down_read_slowpath+864 at ffffffff9b5511d0
 #3 [ffffbeed4e617a60] lu_cache_shrink_count+30 at ffffffffc0fb34fe [obdclass]
 #4 [ffffbeed4e617a70] do_shrink_slab+84 at ffffffff9ae74344
 #5 [ffffbeed4e617ae0] shrink_slab+190 at ffffffff9ae74b6e
 #6 [ffffbeed4e617b60] shrink_node+412 at ffffffff9ae795ec
 #7 [ffffbeed4e617be0] do_try_to_free_pages+201 at ffffffff9ae79bb9
 #8 [ffffbeed4e617c30] try_to_free_pages+239 at ffffffff9ae79fbf
 #9 [ffffbeed4e617cd0] __alloc_pages_slowpath+945 at ffffffff9aebd7b1
#10 [ffffbeed4e617dc8] __alloc_pages_nodemask+643 at ffffffff9aebe3a3
#11 [ffffbeed4e617e28] __get_free_pages+10 at ffffffff9aeb86ca

vs

PID: 98811  TASK: ffff977ba2865f00  CPU: 16  COMMAND: "p_check_lustre_"
 #0 [ffffbeed6cf3f828] __schedule+708 at ffffffff9b54e1d4
 #1 [ffffbeed6cf3f8c0] preempt_schedule_common+10 at ffffffff9b54e6fa
 #2 [ffffbeed6cf3f8c8] _cond_resched+29 at ffffffff9b54e72d
 #3 [ffffbeed6cf3f8d0] mutex_lock+14 at ffffffff9b55087e
 #4 [ffffbeed6cf3f8e0] lod_striping_free+27 at ffffffffc1693a2b [lod]
 #5 [ffffbeed6cf3f900] lod_object_free+158 at ffffffffc169c43e [lod]
 #6 [ffffbeed6cf3f910] lu_object_free+216 at ffffffffc0fb2ed8 [obdclass]
 #7 [ffffbeed6cf3f978] lu_site_purge_objects+982 at ffffffffc0fb5d16 [obdclass]
 #8 [ffffbeed6cf3fa18] lu_cache_shrink_scan+146 at ffffffffc0fb5fe2 [obdclass]
 #9 [ffffbeed6cf3fa70] do_shrink_slab+300 at ffffffff9ae7441c
#10 [ffffbeed6cf3fae0] shrink_slab+190 at ffffffff9ae74b6e


 Comments   
Comment by Gerrit Updater [ 11/Oct/23 ]

"Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52627
Subject: LU-17181 misc: don't block reclaim threads
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6c1ed8ddd5c6d7d35283dd6217c95fe90c4d1a2e

Comment by Gerrit Updater [ 08/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52627/
Subject: LU-17181 misc: don't block reclaim threads
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2c97684db9d9286a2916420138529b4fbd0e4bbe

Comment by Peter Jones [ 08/Nov/23 ]

Landed for 2.16

Generated at Sat Feb 10 03:33:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.