Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7997

RCU stalls waiting for lu_sites_guard mutex in lu_cache_shrink_count

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.7.0
    • None
    • Lustre 2.7.0 clients on SLES12
    • 3
    • 9223372036854775807

    Description

      The lu_cache_shrink_count algorithm introduced by LU-6365 does not scale well as the number of processors increases. In low memory conditions, many processes calling into lu_cache_shrink concurrently trigger RCU stalls. Most of the processes are waiting on the lu_sites_guard mutex. The process holding the mutex is executing in ls_stats_read.

      c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 201}  (t=15000 jiffies g=111081 c=111080 q=22545)
      c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 175}  (t=15000 jiffies g=111081 c=111080 q=22545)
      c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 116}  (t=15000 jiffies g=111081 c=111080 q=22545)
      c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 253}  (t=15000 jiffies g=111081 c=111080 q=22545)
      c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 194}  (t=15000 jiffies g=111081 c=111080 q=22545)
      c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 21}  (t=15000 jiffies g=111081 c=111080 q=22545)
      c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 207}  (t=15000 jiffies g=111081 c=111080 q=22545)
      c0-0c1s14n0 INFO: rcu_sched self-detected stall on CPU { 230}  (t=60004 jiffies g=111081 c=111080 q=22552)
      c0-0c1s14n0 INFO: rcu_sched detected stalls on CPUs/tasks: { 230} (detected by 265, t=60005 jiffies, g=111081, c=111080, q=22552)
      
      
      c0-0c1s14n0 CPU: 182 PID: 47501 Comm: mem_seg_registe Tainted: P           O  3.12.51-52.31.1_1.0000.9069-cray_ari_c #1
      c0-0c1s14n0 RIP: 0010:[<ffffffffa04f5a51>]  [<ffffffffa04f5a51>] lprocfs_stats_collect+0xb1/0x180 [obdclass]
      c0-0c1s14n0 Call Trace:
      c0-0c1s14n0 [<ffffffffa05188d9>] ls_stats_read+0x19/0x50 [obdclass]
      c0-0c1s14n0 [<ffffffffa051a66c>] lu_cache_shrink_count+0x5c/0x120 [obdclass]
      c0-0c1s14n0 [<ffffffff81132c45>] shrink_slab_node+0x45/0x290
      c0-0c1s14n0 [<ffffffff8113393b>] shrink_slab+0x8b/0x160
      c0-0c1s14n0 [<ffffffff81136d9f>] do_try_to_free_pages+0x33f/0x4a0
      c0-0c1s14n0 [<ffffffff81136fbf>] try_to_free_pages+0xbf/0x150
      c0-0c1s14n0 [<ffffffff8112b205>] __alloc_pages_nodemask+0x6a5/0xb00
      c0-0c1s14n0 [<ffffffff8116ab80>] alloc_pages_vma+0xa0/0x180
      c0-0c1s14n0 [<ffffffff8114c6ea>] handle_mm_fault+0x8ba/0xb60
      c0-0c1s14n0 [<ffffffff8114caf6>] __get_user_pages+0x166/0x5b0
      c0-0c1s14n0 [<ffffffff8114cf92>] get_user_pages+0x52/0x60
      c0-0c1s14n0 [<ffffffff8103f182>] get_user_pages_fast+0xb2/0x1b0
      c0-0c1s14n0 [<ffffffffa019b23d>] kgni_mem_set_pages+0xfd/0x1710 [kgni_ari]
      c0-0c1s14n0 [<ffffffffa019c8a5>] kgni_mem_register_pin_pages+0x55/0x2f0 [kgni_ari]
      c0-0c1s14n0 [<ffffffffa019d850>] kgni_mem_seg_register_pin+0xd10/0x1520 [kgni_ari]
      c0-0c1s14n0 [<ffffffffa01a02ee>] kgni_mem_register+0x158e/0x3160 [kgni_ari]
      c0-0c1s14n0 [<ffffffffa01d1ab2>] kgni_ioctl+0xd02/0x1520 [kgni_ari]
      c0-0c1s14n0 [<ffffffff8119476d>] do_vfs_ioctl+0x2dd/0x4b0
      c0-0c1s14n0 [<ffffffff81194985>] SyS_ioctl+0x45/0x80
      c0-0c1s14n0 [<ffffffff8149faf2>] system_call_fastpath+0x16/0x1b
      c0-0c1s14n0 [<000000002013d7a7>] 0x2013d7a6
      
      
      c0-0c1s14n0 NMI backtrace for cpu 116
      c0-0c1s14n0 CPU: 116 PID: 47508 Comm: mem_seg_registe Tainted: P           O  3.12.51-52.31.1_1.0000.9069-cray_ari_c #1
      c0-0c1s14n0 RIP: 0010:[<ffffffff810895fa>]  [<ffffffff810895fa>] osq_lock+0x5a/0xb0
      c0-0c1s14n0 Call Trace:
      c0-0c1s14n0 [<ffffffff8149614a>] __mutex_lock_slowpath+0x5a/0x1a0
      c0-0c1s14n0 [<ffffffff814962a7>] mutex_lock+0x17/0x27
      c0-0c1s14n0 [<ffffffffa051a636>] lu_cache_shrink_count+0x26/0x120 [obdclass]
      c0-0c1s14n0 [<ffffffff81132c45>] shrink_slab_node+0x45/0x290
      c0-0c1s14n0 [<ffffffff8113393b>] shrink_slab+0x8b/0x160
      c0-0c1s14n0 [<ffffffff81136d9f>] do_try_to_free_pages+0x33f/0x4a0
      c0-0c1s14n0 [<ffffffff81136fbf>] try_to_free_pages+0xbf/0x150
      c0-0c1s14n0 [<ffffffff8112b205>] __alloc_pages_nodemask+0x6a5/0xb00
      c0-0c1s14n0 [<ffffffff8116ab80>] alloc_pages_vma+0xa0/0x180
      c0-0c1s14n0 [<ffffffff8114c6ea>] handle_mm_fault+0x8ba/0xb60
      c0-0c1s14n0 [<ffffffff8114caf6>] __get_user_pages+0x166/0x5b0
      c0-0c1s14n0 [<ffffffff8114cf92>] get_user_pages+0x52/0x60
      c0-0c1s14n0 [<ffffffff8103f182>] get_user_pages_fast+0xb2/0x1b0
      c0-0c1s14n0 [<ffffffffa019b23d>] kgni_mem_set_pages+0xfd/0x1710 [kgni_ari]
      c0-0c1s14n0 [<ffffffffa019c8a5>] kgni_mem_register_pin_pages+0x55/0x2f0 [kgni_ari]
      c0-0c1s14n0 [<ffffffffa019d850>] kgni_mem_seg_register_pin+0xd10/0x1520 [kgni_ari]
      c0-0c1s14n0 [<ffffffffa01a02ee>] kgni_mem_register+0x158e/0x3160 [kgni_ari]
      c0-0c1s14n0 [<ffffffffa01d1ab2>] kgni_ioctl+0xd02/0x1520 [kgni_ari]
      c0-0c1s14n0 [<ffffffff8119476d>] do_vfs_ioctl+0x2dd/0x4b0
      c0-0c1s14n0 [<ffffffff81194985>] SyS_ioctl+0x45/0x80
      c0-0c1s14n0 [<ffffffff8149faf2>] system_call_fastpath+0x16/0x1b
      

      As the number of cpus grows, the summing of the LU_SS_LRU_LEN counters is not significantly faster than summing counters across hash buckets, as was done prior to the LU-6365 patch. Processes needing memory bottleneck waiting to get the lu_sites_guard mutex.

      The proposed solution is a two pronged attack:

      1. Reduce the time spent getting the object count by replacing the
      the LU_SS_LRU_LEN counter in lu_sites.stats with a kernel percpu_counter. This shifts the overhead of summing across the cpus from lu_cache_shrink_count to the functions that increment/decrement the counter. The summing is only done when an individual cpu count exceeds a threshold so the overhead along the increment/decrement paths is minimized. lu_cache_shrink_count may return a stale value but this is acceptable for the purposes of a shrinker. (Using the kernel's percpu_counter was also proposed as an improvement to the LU-6365 patch.)

      2. Increase concurrent access to the lu_sites list by changing the lu_sites_guard lock from a mutex to a read/write semaphore.
      lu_cache_shrink_count simply reads data so it does not need to wait for other readers. lu_cache_shrink_scan, which actually frees the unused objects, is still serialized.

      Attachments

        Issue Links

          Activity

            People

              amk Ann Koehler (Inactive)
              amk Ann Koehler (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: