Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.7.0
-
None
-
Lustre 2.7.0 clients on SLES12
-
3
-
9223372036854775807
Description
The lu_cache_shrink_count algorithm introduced by LU-6365 does not scale well as the number of processors increases. In low memory conditions, many processes calling into lu_cache_shrink concurrently trigger RCU stalls. Most of the processes are waiting on the lu_sites_guard mutex. The process holding the mutex is executing in ls_stats_read.
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 201} (t=15000 jiffies g=111081 c=111080 q=22545) c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 175} (t=15000 jiffies g=111081 c=111080 q=22545) c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 116} (t=15000 jiffies g=111081 c=111080 q=22545) c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 253} (t=15000 jiffies g=111081 c=111080 q=22545) c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 194} (t=15000 jiffies g=111081 c=111080 q=22545) c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 21} (t=15000 jiffies g=111081 c=111080 q=22545) c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 207} (t=15000 jiffies g=111081 c=111080 q=22545) c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 230} (t=60004 jiffies g=111081 c=111080 q=22552) c0-0c1s14n0 INFO: rcu_sched detected stalls on CPUs/tasks: { 230} (detected by 265, t=60005 jiffies, g=111081, c=111080, q=22552) c0-0c1s14n0 CPU: 182 PID: 47501 Comm: mem_seg_registe Tainted: P O 3.12.51-52.31.1_1.0000.9069-cray_ari_c #1 c0-0c1s14n0 RIP: 0010:[<ffffffffa04f5a51>] [<ffffffffa04f5a51>] lprocfs_stats_collect+0xb1/0x180 [obdclass] c0-0c1s14n0 Call Trace: c0-0c1s14n0 [<ffffffffa05188d9>] ls_stats_read+0x19/0x50 [obdclass] c0-0c1s14n0 [<ffffffffa051a66c>] lu_cache_shrink_count+0x5c/0x120 [obdclass] c0-0c1s14n0 [<ffffffff81132c45>] shrink_slab_node+0x45/0x290 c0-0c1s14n0 [<ffffffff8113393b>] shrink_slab+0x8b/0x160 c0-0c1s14n0 [<ffffffff81136d9f>] do_try_to_free_pages+0x33f/0x4a0 c0-0c1s14n0 [<ffffffff81136fbf>] try_to_free_pages+0xbf/0x150 c0-0c1s14n0 [<ffffffff8112b205>] __alloc_pages_nodemask+0x6a5/0xb00 c0-0c1s14n0 [<ffffffff8116ab80>] alloc_pages_vma+0xa0/0x180 c0-0c1s14n0 [<ffffffff8114c6ea>] handle_mm_fault+0x8ba/0xb60 c0-0c1s14n0 [<ffffffff8114caf6>] __get_user_pages+0x166/0x5b0 c0-0c1s14n0 [<ffffffff8114cf92>] get_user_pages+0x52/0x60 c0-0c1s14n0 [<ffffffff8103f182>] get_user_pages_fast+0xb2/0x1b0 c0-0c1s14n0 [<ffffffffa019b23d>] kgni_mem_set_pages+0xfd/0x1710 [kgni_ari] c0-0c1s14n0 [<ffffffffa019c8a5>] kgni_mem_register_pin_pages+0x55/0x2f0 [kgni_ari] c0-0c1s14n0 [<ffffffffa019d850>] kgni_mem_seg_register_pin+0xd10/0x1520 [kgni_ari] c0-0c1s14n0 [<ffffffffa01a02ee>] kgni_mem_register+0x158e/0x3160 [kgni_ari] c0-0c1s14n0 [<ffffffffa01d1ab2>] kgni_ioctl+0xd02/0x1520 [kgni_ari] c0-0c1s14n0 [<ffffffff8119476d>] do_vfs_ioctl+0x2dd/0x4b0 c0-0c1s14n0 [<ffffffff81194985>] SyS_ioctl+0x45/0x80 c0-0c1s14n0 [<ffffffff8149faf2>] system_call_fastpath+0x16/0x1b c0-0c1s14n0 [<000000002013d7a7>] 0x2013d7a6 c0-0c1s14n0 NMI backtrace for cpu 116 c0-0c1s14n0 CPU: 116 PID: 47508 Comm: mem_seg_registe Tainted: P O 3.12.51-52.31.1_1.0000.9069-cray_ari_c #1 c0-0c1s14n0 RIP: 0010:[<ffffffff810895fa>] [<ffffffff810895fa>] osq_lock+0x5a/0xb0 c0-0c1s14n0 Call Trace: c0-0c1s14n0 [<ffffffff8149614a>] __mutex_lock_slowpath+0x5a/0x1a0 c0-0c1s14n0 [<ffffffff814962a7>] mutex_lock+0x17/0x27 c0-0c1s14n0 [<ffffffffa051a636>] lu_cache_shrink_count+0x26/0x120 [obdclass] c0-0c1s14n0 [<ffffffff81132c45>] shrink_slab_node+0x45/0x290 c0-0c1s14n0 [<ffffffff8113393b>] shrink_slab+0x8b/0x160 c0-0c1s14n0 [<ffffffff81136d9f>] do_try_to_free_pages+0x33f/0x4a0 c0-0c1s14n0 [<ffffffff81136fbf>] try_to_free_pages+0xbf/0x150 c0-0c1s14n0 [<ffffffff8112b205>] __alloc_pages_nodemask+0x6a5/0xb00 c0-0c1s14n0 [<ffffffff8116ab80>] alloc_pages_vma+0xa0/0x180 c0-0c1s14n0 [<ffffffff8114c6ea>] handle_mm_fault+0x8ba/0xb60 c0-0c1s14n0 [<ffffffff8114caf6>] __get_user_pages+0x166/0x5b0 c0-0c1s14n0 [<ffffffff8114cf92>] get_user_pages+0x52/0x60 c0-0c1s14n0 [<ffffffff8103f182>] get_user_pages_fast+0xb2/0x1b0 c0-0c1s14n0 [<ffffffffa019b23d>] kgni_mem_set_pages+0xfd/0x1710 [kgni_ari] c0-0c1s14n0 [<ffffffffa019c8a5>] kgni_mem_register_pin_pages+0x55/0x2f0 [kgni_ari] c0-0c1s14n0 [<ffffffffa019d850>] kgni_mem_seg_register_pin+0xd10/0x1520 [kgni_ari] c0-0c1s14n0 [<ffffffffa01a02ee>] kgni_mem_register+0x158e/0x3160 [kgni_ari] c0-0c1s14n0 [<ffffffffa01d1ab2>] kgni_ioctl+0xd02/0x1520 [kgni_ari] c0-0c1s14n0 [<ffffffff8119476d>] do_vfs_ioctl+0x2dd/0x4b0 c0-0c1s14n0 [<ffffffff81194985>] SyS_ioctl+0x45/0x80 c0-0c1s14n0 [<ffffffff8149faf2>] system_call_fastpath+0x16/0x1b
As the number of cpus grows, the summing of the LU_SS_LRU_LEN counters is not significantly faster than summing counters across hash buckets, as was done prior to the LU-6365 patch. Processes needing memory bottleneck waiting to get the lu_sites_guard mutex.
The proposed solution is a two pronged attack:
1. Reduce the time spent getting the object count by replacing the
the LU_SS_LRU_LEN counter in lu_sites.stats with a kernel percpu_counter. This shifts the overhead of summing across the cpus from lu_cache_shrink_count to the functions that increment/decrement the counter. The summing is only done when an individual cpu count exceeds a threshold so the overhead along the increment/decrement paths is minimized. lu_cache_shrink_count may return a stale value but this is acceptable for the purposes of a shrinker. (Using the kernel's percpu_counter was also proposed as an improvement to the LU-6365 patch.)
2. Increase concurrent access to the lu_sites list by changing the lu_sites_guard lock from a mutex to a read/write semaphore.
lu_cache_shrink_count simply reads data so it does not need to wait for other readers. lu_cache_shrink_scan, which actually frees the unused objects, is still serialized.
Attachments
Issue Links
- is related to
-
LU-7896 lu_object_limit() is called too frequently
- Resolved