[LU-7997] RCU stalls waiting for lu_sites_guard mutex in lu_cache_shrink_count Created: 07/Apr/16  Updated: 15/Aug/17  Resolved: 15/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Ann Koehler (Inactive) Assignee: Ann Koehler (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre 2.7.0 clients on SLES12


Issue Links:
Related
is related to LU-7896 lu_object_limit() is called too frequ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The lu_cache_shrink_count algorithm introduced by LU-6365 does not scale well as the number of processors increases. In low memory conditions, many processes calling into lu_cache_shrink concurrently trigger RCU stalls. Most of the processes are waiting on the lu_sites_guard mutex. The process holding the mutex is executing in ls_stats_read.

c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 201}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 175}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 116}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 253}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 194}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 21}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 207}  (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 230}  (t=60004 jiffies g=111081 c=111080 q=22552)
c0-0c1s14n0 INFO: rcu_sched detected stalls on CPUs/tasks: { 230} (detected by 265, t=60005 jiffies, g=111081, c=111080, q=22552)


c0-0c1s14n0 CPU: 182 PID: 47501 Comm: mem_seg_registe Tainted: P           O  3.12.51-52.31.1_1.0000.9069-cray_ari_c #1
c0-0c1s14n0 RIP: 0010:[<ffffffffa04f5a51>]  [<ffffffffa04f5a51>] lprocfs_stats_collect+0xb1/0x180 [obdclass]
c0-0c1s14n0 Call Trace:
c0-0c1s14n0 [<ffffffffa05188d9>] ls_stats_read+0x19/0x50 [obdclass]
c0-0c1s14n0 [<ffffffffa051a66c>] lu_cache_shrink_count+0x5c/0x120 [obdclass]
c0-0c1s14n0 [<ffffffff81132c45>] shrink_slab_node+0x45/0x290
c0-0c1s14n0 [<ffffffff8113393b>] shrink_slab+0x8b/0x160
c0-0c1s14n0 [<ffffffff81136d9f>] do_try_to_free_pages+0x33f/0x4a0
c0-0c1s14n0 [<ffffffff81136fbf>] try_to_free_pages+0xbf/0x150
c0-0c1s14n0 [<ffffffff8112b205>] __alloc_pages_nodemask+0x6a5/0xb00
c0-0c1s14n0 [<ffffffff8116ab80>] alloc_pages_vma+0xa0/0x180
c0-0c1s14n0 [<ffffffff8114c6ea>] handle_mm_fault+0x8ba/0xb60
c0-0c1s14n0 [<ffffffff8114caf6>] __get_user_pages+0x166/0x5b0
c0-0c1s14n0 [<ffffffff8114cf92>] get_user_pages+0x52/0x60
c0-0c1s14n0 [<ffffffff8103f182>] get_user_pages_fast+0xb2/0x1b0
c0-0c1s14n0 [<ffffffffa019b23d>] kgni_mem_set_pages+0xfd/0x1710 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019c8a5>] kgni_mem_register_pin_pages+0x55/0x2f0 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019d850>] kgni_mem_seg_register_pin+0xd10/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01a02ee>] kgni_mem_register+0x158e/0x3160 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01d1ab2>] kgni_ioctl+0xd02/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffff8119476d>] do_vfs_ioctl+0x2dd/0x4b0
c0-0c1s14n0 [<ffffffff81194985>] SyS_ioctl+0x45/0x80
c0-0c1s14n0 [<ffffffff8149faf2>] system_call_fastpath+0x16/0x1b
c0-0c1s14n0 [<000000002013d7a7>] 0x2013d7a6


c0-0c1s14n0 NMI backtrace for cpu 116
c0-0c1s14n0 CPU: 116 PID: 47508 Comm: mem_seg_registe Tainted: P           O  3.12.51-52.31.1_1.0000.9069-cray_ari_c #1
c0-0c1s14n0 RIP: 0010:[<ffffffff810895fa>]  [<ffffffff810895fa>] osq_lock+0x5a/0xb0
c0-0c1s14n0 Call Trace:
c0-0c1s14n0 [<ffffffff8149614a>] __mutex_lock_slowpath+0x5a/0x1a0
c0-0c1s14n0 [<ffffffff814962a7>] mutex_lock+0x17/0x27
c0-0c1s14n0 [<ffffffffa051a636>] lu_cache_shrink_count+0x26/0x120 [obdclass]
c0-0c1s14n0 [<ffffffff81132c45>] shrink_slab_node+0x45/0x290
c0-0c1s14n0 [<ffffffff8113393b>] shrink_slab+0x8b/0x160
c0-0c1s14n0 [<ffffffff81136d9f>] do_try_to_free_pages+0x33f/0x4a0
c0-0c1s14n0 [<ffffffff81136fbf>] try_to_free_pages+0xbf/0x150
c0-0c1s14n0 [<ffffffff8112b205>] __alloc_pages_nodemask+0x6a5/0xb00
c0-0c1s14n0 [<ffffffff8116ab80>] alloc_pages_vma+0xa0/0x180
c0-0c1s14n0 [<ffffffff8114c6ea>] handle_mm_fault+0x8ba/0xb60
c0-0c1s14n0 [<ffffffff8114caf6>] __get_user_pages+0x166/0x5b0
c0-0c1s14n0 [<ffffffff8114cf92>] get_user_pages+0x52/0x60
c0-0c1s14n0 [<ffffffff8103f182>] get_user_pages_fast+0xb2/0x1b0
c0-0c1s14n0 [<ffffffffa019b23d>] kgni_mem_set_pages+0xfd/0x1710 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019c8a5>] kgni_mem_register_pin_pages+0x55/0x2f0 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019d850>] kgni_mem_seg_register_pin+0xd10/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01a02ee>] kgni_mem_register+0x158e/0x3160 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01d1ab2>] kgni_ioctl+0xd02/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffff8119476d>] do_vfs_ioctl+0x2dd/0x4b0
c0-0c1s14n0 [<ffffffff81194985>] SyS_ioctl+0x45/0x80
c0-0c1s14n0 [<ffffffff8149faf2>] system_call_fastpath+0x16/0x1b

As the number of cpus grows, the summing of the LU_SS_LRU_LEN counters is not significantly faster than summing counters across hash buckets, as was done prior to the LU-6365 patch. Processes needing memory bottleneck waiting to get the lu_sites_guard mutex.

The proposed solution is a two pronged attack:

1. Reduce the time spent getting the object count by replacing the
the LU_SS_LRU_LEN counter in lu_sites.stats with a kernel percpu_counter. This shifts the overhead of summing across the cpus from lu_cache_shrink_count to the functions that increment/decrement the counter. The summing is only done when an individual cpu count exceeds a threshold so the overhead along the increment/decrement paths is minimized. lu_cache_shrink_count may return a stale value but this is acceptable for the purposes of a shrinker. (Using the kernel's percpu_counter was also proposed as an improvement to the LU-6365 patch.)

2. Increase concurrent access to the lu_sites list by changing the lu_sites_guard lock from a mutex to a read/write semaphore.
lu_cache_shrink_count simply reads data so it does not need to wait for other readers. lu_cache_shrink_scan, which actually frees the unused objects, is still serialized.



 Comments   
Comment by Gerrit Updater [ 07/Apr/16 ]

Ann Koehler (amk@cray.com) uploaded a new patch: http://review.whamcloud.com/19390
Subject: LU-7997 obd: RCU stalls in lu_cache_shrink_count()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 74d033dc90d22bf26d4c10cc6f5d19933669a0c2

Comment by Ann Koehler (Inactive) [ 08/Apr/16 ]

CentOS 7 has added a GFP flag parameter to the percpu_counter_init function. I don't have ready access to a CentOS 7 system. Could someone from Intel make the necessary code changes?

Comment by Andreas Dilger [ 10/Apr/16 ]

Ann, I commented in the patch about how to fix the percpu_counter_init() function for RHEL7. There is also a patch from Alex http://review.whamcloud.com/19082 that also optimizes this same code path to reduce overhead in lu_cache.

I'd be willing to land your current patch (with updates for RHEL7) to resolve the immediate problems you are seeing. That said, I think it also makes sense to look at the lprocfs_counter_* interfaces and update them to use percpu_* interfaces (internally at least) that are available in the kernel, instead of completely rolling our own. That is also work that is useful to have done to clean up the upstream kernel client.

There is even work in the upstream kernel to add percpu_stats_* interfaces that would make this transition even simpler (https://lkml.org/lkml/2016/4/8/397), but they aren't yet landed to mainline. One option would be to copy these new interfaces for older vendor kernels used by Lustre until a configure check shows them to be available.

Comment by Ann Koehler (Inactive) [ 11/Apr/16 ]

Andreas, Thank you for all your comments.

I made the change needed for RHEL7 and just submitted a new patch version. I'm relying on Maloo for testing.

http://review.whamcloud.com/19082 applies to the free object phase of the shrinker. All our problems have been in counting the number of freeable objects. Whenever I've dug this count out of dumps or traced the value, it has been 0. I don't think the lu_cache shrinker frees anything very often or at least I haven't been able to figure out the use case where it does. Nevertheless the stalls we see occur when there is nothing to free. So improvements are needed in count rather than purge.

I agree with you that replacing the lprocfs_counter_* interfaces is a good idea. How about a new bug for that work? With my current patch, I was focused only on improving the changes I did for LU-6365.

Comment by Gerrit Updater [ 14/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19390/
Subject: LU-7997 obd: RCU stalls in lu_cache_shrink_count()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 73c54401dad20b9cb927b68d0f1068a2bd8f9319

Comment by Joseph Gmitter (Inactive) [ 15/Jun/16 ]

patch has landed to master for 2.9.0

Generated at Sat Feb 10 02:13:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.