[LU-7997] RCU stalls waiting for lu_sites_guard mutex in lu_cache_shrink_count - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.7.0
Labels:
None
Environment:
Lustre 2.7.0 clients on SLES12

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The lu_cache_shrink_count algorithm introduced by ~~LU-6365~~ does not scale well as the number of processors increases. In low memory conditions, many processes calling into lu_cache_shrink concurrently trigger RCU stalls. Most of the processes are waiting on the lu_sites_guard mutex. The process holding the mutex is executing in ls_stats_read.

c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 201}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 175}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 116}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 253}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 194}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 21}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 207}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 230}  (t=60004 jiffies g=111081 c=111080 q=22552)
c0-0c1s14n0 INFO: rcu_sched detected stalls on CPUs/tasks: { 230} (detected by 265, t=60005 jiffies, g=111081, c=111080, q=22552)


c0-0c1s14n0 CPU: 182 PID: 47501 Comm: mem_seg_registe Tainted: P           O  3.12.51-52.31.1_1.0000.9069-cray_ari_c #1
c0-0c1s14n0 RIP: 0010:[<ffffffffa04f5a51>]  [<ffffffffa04f5a51>] lprocfs_stats_collect+0xb1/0x180 [obdclass]
c0-0c1s14n0 Call Trace:
c0-0c1s14n0 [<ffffffffa05188d9>] ls_stats_read+0x19/0x50 [obdclass]
c0-0c1s14n0 [<ffffffffa051a66c>] lu_cache_shrink_count+0x5c/0x120 [obdclass]
c0-0c1s14n0 [<ffffffff81132c45>] shrink_slab_node+0x45/0x290
c0-0c1s14n0 [<ffffffff8113393b>] shrink_slab+0x8b/0x160
c0-0c1s14n0 [<ffffffff81136d9f>] do_try_to_free_pages+0x33f/0x4a0
c0-0c1s14n0 [<ffffffff81136fbf>] try_to_free_pages+0xbf/0x150
c0-0c1s14n0 [<ffffffff8112b205>] __alloc_pages_nodemask+0x6a5/0xb00
c0-0c1s14n0 [<ffffffff8116ab80>] alloc_pages_vma+0xa0/0x180
c0-0c1s14n0 [<ffffffff8114c6ea>] handle_mm_fault+0x8ba/0xb60
c0-0c1s14n0 [<ffffffff8114caf6>] __get_user_pages+0x166/0x5b0
c0-0c1s14n0 [<ffffffff8114cf92>] get_user_pages+0x52/0x60
c0-0c1s14n0 [<ffffffff8103f182>] get_user_pages_fast+0xb2/0x1b0
c0-0c1s14n0 [<ffffffffa019b23d>] kgni_mem_set_pages+0xfd/0x1710 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019c8a5>] kgni_mem_register_pin_pages+0x55/0x2f0 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019d850>] kgni_mem_seg_register_pin+0xd10/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01a02ee>] kgni_mem_register+0x158e/0x3160 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01d1ab2>] kgni_ioctl+0xd02/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffff8119476d>] do_vfs_ioctl+0x2dd/0x4b0
c0-0c1s14n0 [<ffffffff81194985>] SyS_ioctl+0x45/0x80
c0-0c1s14n0 [<ffffffff8149faf2>] system_call_fastpath+0x16/0x1b
c0-0c1s14n0 [<000000002013d7a7>] 0x2013d7a6


c0-0c1s14n0 NMI backtrace for cpu 116
c0-0c1s14n0 CPU: 116 PID: 47508 Comm: mem_seg_registe Tainted: P           O  3.12.51-52.31.1_1.0000.9069-cray_ari_c #1
c0-0c1s14n0 RIP: 0010:[<ffffffff810895fa>]  [<ffffffff810895fa>] osq_lock+0x5a/0xb0
c0-0c1s14n0 Call Trace:
c0-0c1s14n0 [<ffffffff8149614a>] __mutex_lock_slowpath+0x5a/0x1a0
c0-0c1s14n0 [<ffffffff814962a7>] mutex_lock+0x17/0x27
c0-0c1s14n0 [<ffffffffa051a636>] lu_cache_shrink_count+0x26/0x120 [obdclass]
c0-0c1s14n0 [<ffffffff81132c45>] shrink_slab_node+0x45/0x290
c0-0c1s14n0 [<ffffffff8113393b>] shrink_slab+0x8b/0x160
c0-0c1s14n0 [<ffffffff81136d9f>] do_try_to_free_pages+0x33f/0x4a0
c0-0c1s14n0 [<ffffffff81136fbf>] try_to_free_pages+0xbf/0x150
c0-0c1s14n0 [<ffffffff8112b205>] __alloc_pages_nodemask+0x6a5/0xb00
c0-0c1s14n0 [<ffffffff8116ab80>] alloc_pages_vma+0xa0/0x180
c0-0c1s14n0 [<ffffffff8114c6ea>] handle_mm_fault+0x8ba/0xb60
c0-0c1s14n0 [<ffffffff8114caf6>] __get_user_pages+0x166/0x5b0
c0-0c1s14n0 [<ffffffff8114cf92>] get_user_pages+0x52/0x60
c0-0c1s14n0 [<ffffffff8103f182>] get_user_pages_fast+0xb2/0x1b0
c0-0c1s14n0 [<ffffffffa019b23d>] kgni_mem_set_pages+0xfd/0x1710 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019c8a5>] kgni_mem_register_pin_pages+0x55/0x2f0 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019d850>] kgni_mem_seg_register_pin+0xd10/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01a02ee>] kgni_mem_register+0x158e/0x3160 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01d1ab2>] kgni_ioctl+0xd02/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffff8119476d>] do_vfs_ioctl+0x2dd/0x4b0
c0-0c1s14n0 [<ffffffff81194985>] SyS_ioctl+0x45/0x80
c0-0c1s14n0 [<ffffffff8149faf2>] system_call_fastpath+0x16/0x1b

As the number of cpus grows, the summing of the LU_SS_LRU_LEN counters is not significantly faster than summing counters across hash buckets, as was done prior to the ~~LU-6365~~ patch. Processes needing memory bottleneck waiting to get the lu_sites_guard mutex.

The proposed solution is a two pronged attack:

1. Reduce the time spent getting the object count by replacing the
the LU_SS_LRU_LEN counter in lu_sites.stats with a kernel percpu_counter. This shifts the overhead of summing across the cpus from lu_cache_shrink_count to the functions that increment/decrement the counter. The summing is only done when an individual cpu count exceeds a threshold so the overhead along the increment/decrement paths is minimized. lu_cache_shrink_count may return a stale value but this is acceptable for the purposes of a shrinker. (Using the kernel's percpu_counter was also proposed as an improvement to the ~~LU-6365~~ patch.)

2. Increase concurrent access to the lu_sites list by changing the lu_sites_guard lock from a mutex to a read/write semaphore.
lu_cache_shrink_count simply reads data so it does not need to wait for other readers. lu_cache_shrink_scan, which actually frees the unused objects, is still serialized.

Attachments

Issue Links

is related to

LU-7896 lu_object_limit() is called too frequently

Resolved

RCU stalls waiting for lu_sites_guard mutex in lu_cache_shrink_count

Details

Description

Attachments

Issue Links

Activity

People

Dates