[LU-7997] RCU stalls waiting for lu_sites_guard mutex in lu_cache_shrink_count Created: 07/Apr/16 Updated: 15/Aug/17 Resolved: 15/Jun/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Ann Koehler (Inactive) | Assignee: | Ann Koehler (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.7.0 clients on SLES12 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
The lu_cache_shrink_count algorithm introduced by c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 201} (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 175} (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 116} (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 253} (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 194} (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 21} (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 207} (t=15000 jiffies g=111081 c=111080 q=22545)
c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 230} (t=60004 jiffies g=111081 c=111080 q=22552)
c0-0c1s14n0 INFO: rcu_sched detected stalls on CPUs/tasks: { 230} (detected by 265, t=60005 jiffies, g=111081, c=111080, q=22552)
c0-0c1s14n0 CPU: 182 PID: 47501 Comm: mem_seg_registe Tainted: P O 3.12.51-52.31.1_1.0000.9069-cray_ari_c #1
c0-0c1s14n0 RIP: 0010:[<ffffffffa04f5a51>] [<ffffffffa04f5a51>] lprocfs_stats_collect+0xb1/0x180 [obdclass]
c0-0c1s14n0 Call Trace:
c0-0c1s14n0 [<ffffffffa05188d9>] ls_stats_read+0x19/0x50 [obdclass]
c0-0c1s14n0 [<ffffffffa051a66c>] lu_cache_shrink_count+0x5c/0x120 [obdclass]
c0-0c1s14n0 [<ffffffff81132c45>] shrink_slab_node+0x45/0x290
c0-0c1s14n0 [<ffffffff8113393b>] shrink_slab+0x8b/0x160
c0-0c1s14n0 [<ffffffff81136d9f>] do_try_to_free_pages+0x33f/0x4a0
c0-0c1s14n0 [<ffffffff81136fbf>] try_to_free_pages+0xbf/0x150
c0-0c1s14n0 [<ffffffff8112b205>] __alloc_pages_nodemask+0x6a5/0xb00
c0-0c1s14n0 [<ffffffff8116ab80>] alloc_pages_vma+0xa0/0x180
c0-0c1s14n0 [<ffffffff8114c6ea>] handle_mm_fault+0x8ba/0xb60
c0-0c1s14n0 [<ffffffff8114caf6>] __get_user_pages+0x166/0x5b0
c0-0c1s14n0 [<ffffffff8114cf92>] get_user_pages+0x52/0x60
c0-0c1s14n0 [<ffffffff8103f182>] get_user_pages_fast+0xb2/0x1b0
c0-0c1s14n0 [<ffffffffa019b23d>] kgni_mem_set_pages+0xfd/0x1710 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019c8a5>] kgni_mem_register_pin_pages+0x55/0x2f0 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019d850>] kgni_mem_seg_register_pin+0xd10/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01a02ee>] kgni_mem_register+0x158e/0x3160 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01d1ab2>] kgni_ioctl+0xd02/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffff8119476d>] do_vfs_ioctl+0x2dd/0x4b0
c0-0c1s14n0 [<ffffffff81194985>] SyS_ioctl+0x45/0x80
c0-0c1s14n0 [<ffffffff8149faf2>] system_call_fastpath+0x16/0x1b
c0-0c1s14n0 [<000000002013d7a7>] 0x2013d7a6
c0-0c1s14n0 NMI backtrace for cpu 116
c0-0c1s14n0 CPU: 116 PID: 47508 Comm: mem_seg_registe Tainted: P O 3.12.51-52.31.1_1.0000.9069-cray_ari_c #1
c0-0c1s14n0 RIP: 0010:[<ffffffff810895fa>] [<ffffffff810895fa>] osq_lock+0x5a/0xb0
c0-0c1s14n0 Call Trace:
c0-0c1s14n0 [<ffffffff8149614a>] __mutex_lock_slowpath+0x5a/0x1a0
c0-0c1s14n0 [<ffffffff814962a7>] mutex_lock+0x17/0x27
c0-0c1s14n0 [<ffffffffa051a636>] lu_cache_shrink_count+0x26/0x120 [obdclass]
c0-0c1s14n0 [<ffffffff81132c45>] shrink_slab_node+0x45/0x290
c0-0c1s14n0 [<ffffffff8113393b>] shrink_slab+0x8b/0x160
c0-0c1s14n0 [<ffffffff81136d9f>] do_try_to_free_pages+0x33f/0x4a0
c0-0c1s14n0 [<ffffffff81136fbf>] try_to_free_pages+0xbf/0x150
c0-0c1s14n0 [<ffffffff8112b205>] __alloc_pages_nodemask+0x6a5/0xb00
c0-0c1s14n0 [<ffffffff8116ab80>] alloc_pages_vma+0xa0/0x180
c0-0c1s14n0 [<ffffffff8114c6ea>] handle_mm_fault+0x8ba/0xb60
c0-0c1s14n0 [<ffffffff8114caf6>] __get_user_pages+0x166/0x5b0
c0-0c1s14n0 [<ffffffff8114cf92>] get_user_pages+0x52/0x60
c0-0c1s14n0 [<ffffffff8103f182>] get_user_pages_fast+0xb2/0x1b0
c0-0c1s14n0 [<ffffffffa019b23d>] kgni_mem_set_pages+0xfd/0x1710 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019c8a5>] kgni_mem_register_pin_pages+0x55/0x2f0 [kgni_ari]
c0-0c1s14n0 [<ffffffffa019d850>] kgni_mem_seg_register_pin+0xd10/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01a02ee>] kgni_mem_register+0x158e/0x3160 [kgni_ari]
c0-0c1s14n0 [<ffffffffa01d1ab2>] kgni_ioctl+0xd02/0x1520 [kgni_ari]
c0-0c1s14n0 [<ffffffff8119476d>] do_vfs_ioctl+0x2dd/0x4b0
c0-0c1s14n0 [<ffffffff81194985>] SyS_ioctl+0x45/0x80
c0-0c1s14n0 [<ffffffff8149faf2>] system_call_fastpath+0x16/0x1b
As the number of cpus grows, the summing of the LU_SS_LRU_LEN counters is not significantly faster than summing counters across hash buckets, as was done prior to the The proposed solution is a two pronged attack: 1. Reduce the time spent getting the object count by replacing the 2. Increase concurrent access to the lu_sites list by changing the lu_sites_guard lock from a mutex to a read/write semaphore. |
| Comments |
| Comment by Gerrit Updater [ 07/Apr/16 ] |
|
Ann Koehler (amk@cray.com) uploaded a new patch: http://review.whamcloud.com/19390 |
| Comment by Ann Koehler (Inactive) [ 08/Apr/16 ] |
|
CentOS 7 has added a GFP flag parameter to the percpu_counter_init function. I don't have ready access to a CentOS 7 system. Could someone from Intel make the necessary code changes? |
| Comment by Andreas Dilger [ 10/Apr/16 ] |
|
Ann, I commented in the patch about how to fix the percpu_counter_init() function for RHEL7. There is also a patch from Alex http://review.whamcloud.com/19082 that also optimizes this same code path to reduce overhead in lu_cache. I'd be willing to land your current patch (with updates for RHEL7) to resolve the immediate problems you are seeing. That said, I think it also makes sense to look at the lprocfs_counter_* interfaces and update them to use percpu_* interfaces (internally at least) that are available in the kernel, instead of completely rolling our own. That is also work that is useful to have done to clean up the upstream kernel client. There is even work in the upstream kernel to add percpu_stats_* interfaces that would make this transition even simpler (https://lkml.org/lkml/2016/4/8/397), but they aren't yet landed to mainline. One option would be to copy these new interfaces for older vendor kernels used by Lustre until a configure check shows them to be available. |
| Comment by Ann Koehler (Inactive) [ 11/Apr/16 ] |
|
Andreas, Thank you for all your comments. I made the change needed for RHEL7 and just submitted a new patch version. I'm relying on Maloo for testing. http://review.whamcloud.com/19082 applies to the free object phase of the shrinker. All our problems have been in counting the number of freeable objects. Whenever I've dug this count out of dumps or traced the value, it has been 0. I don't think the lu_cache shrinker frees anything very often or at least I haven't been able to figure out the use case where it does. Nevertheless the stalls we see occur when there is nothing to free. So improvements are needed in count rather than purge. I agree with you that replacing the lprocfs_counter_* interfaces is a good idea. How about a new bug for that work? With my current patch, I was focused only on improving the changes I did for |
| Comment by Gerrit Updater [ 14/Jun/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19390/ |
| Comment by Joseph Gmitter (Inactive) [ 15/Jun/16 ] |
|
patch has landed to master for 2.9.0 |