[LU-1784] freeing cached clean pages is slow Created: 23/Aug/12  Updated: 08/Feb/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Ned Bass Assignee: Zhenyu Xu
Resolution: Unresolved Votes: 0
Labels: llnl

Attachments: PDF File 574.1.pdf     PDF File ll_releasepage.pdf    
Severity: 3
Rank (Obsolete): 10135

 Description   

Compared to NFS it seems to take Lustre a long time to free clean pages from the cache. Speeding this up may help client performance on a number of fronts since it's such a common operation. Giving up memory more quickly may also improve client stability under low memory conditions.

For example, if I cat 50 GB of data to /dev/null from NFS, it takes 3.7 seconds to drop caches.

# grove617 /root > dd if=/var/dumps/bigfile of=/dev/null bs=1M > /dev/null 2>&1
# grove617 /root > grep ^Cached /proc/meminfo ;  time echo 3 > /proc/sys/vm/drop_caches ; grep ^Cached /proc/meminfo 
Cached:         49492808 kB

real    0m3.707s
user    0m0.000s
sys     0m3.696s
Cached:            59436 kB

A similar test form Lustre takes about 7 times as long.

# grove617 /root > for x in `seq 129 154` ; do dd if=/p/lstest/bass6/fio/read.$x.0 of=/dev/null bs=1M >/dev/null 2>&1 ; done                                                           
# grove617 /root > grep ^Cached /proc/meminfo ;  time echo 3 > /proc/sys/vm/drop_caches ; grep ^Cached /proc/meminfo 
Cached:         47703020 kB

real    0m26.961s
user    0m0.000s
sys     0m26.870s
Cached:            59768 kB

Oprofile data for the Lustre test:

vma      samples  cum. samples  %        cum. %     linenr info                 app name                 symbol name
ffffffff8115ff60 60729    60729         10.3025  10.3025    slab.c:3836                 vmlinux                  kmem_cache_free
0000000000001810 37739    98468          6.4023  16.7048    lvfs_lib.c:111              lvfs.ko                  lprocfs_counter_sub
ffffffff81272930 25490    123958         4.3243  21.0290    radix-tree.c:1237           vmlinux                  radix_tree_delete
0000000000015cc0 21352    145310         3.6223  24.6513    osc_page.c:767              osc.ko                   osc_lru_del
0000000000000000 21289    166599         3.6116  28.2629    (no location information)   lustre                   /lustre
0000000000051030 18837    185436         3.1956  31.4586    cl_page.c:719               obdclass.ko              cl_vmpage_page
ffffffff8127d650 18123    203559         3.0745  34.5331    list_debug.c:45             vmlinux                  list_del
ffffffff811602a0 16864    220423         2.8609  37.3940    slab.c:3516                 vmlinux                  free_block
0000000000051d00 16518    236941         2.8022  40.1962    cl_page.c:284               obdclass.ko              cl_page_free
0000000000051200 15149    252090         2.5700  42.7662    cl_page.c:1124              obdclass.ko              cl_page_delete0
000000000004eb90 14321    266411         2.4295  45.1957    cl_page.c:1236              obdclass.ko              cl_page_export
0000000000052150 13936    280347         2.3642  47.5599    cl_page.c:651               obdclass.ko              cl_page_put
ffffffff8126c5e0 13737    294084         2.3304  49.8903    dec_and_lock.c:21           vmlinux                  _atomic_dec_and_lock
ffffffff8127d6f0 11844    305928         2.0093  51.8996    list_debug.c:22             vmlinux                  __list_add

and for NFS:

vma      samples  cum. samples  %        cum. %     linenr info                 app name                 symbol name
ffffffff8116b0f0 9239     9239           9.6682   9.6682    memcontrol.c:2362           vmlinux                  __mem_cgroup_uncharge_common
ffffffff81129490 7402     16641          7.7458  17.4140    vmscan.c:427                vmlinux                  __remove_mapping
ffffffff811243d0 7250     23891          7.5868  25.0008    page_alloc.c:1152           vmlinux                  free_hot_cold_page
ffffffff811234f0 6747     30638          7.0604  32.0612    page_alloc.c:592            vmlinux                  free_pcppages_bulk
ffffffff81127af0 5401     36039          5.6519  37.7131    swap.c:438                  vmlinux                  release_pages
ffffffff81272930 5302     41341          5.5483  43.2614    radix-tree.c:1237           vmlinux                  radix_tree_delete
ffffffff811330d0 4319     45660          4.5196  47.7810    vmstat.c:282                vmlinux                  __dec_zone_state
ffffffff811284b0 4049     49709          4.2371  52.0181    truncate.c:173              vmlinux                  invalidate_inode_page
ffffffff81110a20 3666     53375          3.8363  55.8544    filemap.c:794               vmlinux                  find_get_pages
ffffffff8110fdb0 3601     56976          3.7683  59.6226    filemap.c:527               vmlinux                  page_waitqueue
ffffffff8127d650 3367     60343          3.5234  63.1461    list_debug.c:45             vmlinux                  list_del
ffffffff81110e10 3007     63350          3.1467  66.2927    filemap.c:579               vmlinux                  unlock_page
ffffffff81167cd0 2976     66326          3.1142  69.4070    memcontrol.c:731            vmlinux                  mem_cgroup_del_lru_list
ffffffff81120e00 2883     69209          3.0169  72.4239    page_alloc.c:5350           vmlinux                  get_pageblock_flags_group
ffffffff81128720 2686     71895          2.8108  75.2347    truncate.c:330              vmlinux                  invalidate_mapping_pages
ffffffff81091010 2193     74088          2.2949  77.5295    wait.c:251                  vmlinux                  __wake_up_bit
ffffffff81167440 2175     76263          2.2760  79.8056    bit_spinlock.h:11           vmlinux                  bit_spin_lock
ffffffff81271c40 2100     78363          2.1975  82.0031    radix-tree.c:820            vmlinux                  __lookup
...
ffffffff8115ff60 182      89326          0.1905  93.4754    slab.c:3836                 vmlinux                  kmem_cache_free

It's interesting that kmem_cache_free accounts for 10% of CPU cycles in Lustre, but only 0.2% for NFS. It also seems we're spending a lot of time updating lprocfs counters.



 Comments   
Comment by Ned Bass [ 24/Aug/12 ]

I made this plot of the ll_readpage() call graph with data from systemtap to help visualize where the time goes. It doesn't really show any egregious offenders, but it shows that Lustre does a lot of work overall to release a page. cl_env_nested_get(), cl_page_delete(), and cl_page_put() account for most of the time, with cl_page_delete() taking the longest. The instrumentation seriously inflates the timing however, by my estimate a normal ll_readpage() call takes about 2 us, compared to 750 us in this graph.

Comment by Peter Jones [ 26/Aug/12 ]

Yujian

Could you please look into this one?

Thanks

Peter

Comment by Peter Jones [ 27/Aug/12 ]

I had not realized that this was CLIO-related. Jinshan is already looking into CLIO performance so it makes sense for him to consider this one too

Comment by Prakash Surya (Inactive) [ 27/Aug/12 ]

Ned, since systemtap "seriously inflates the timing", you might want to look into ftrace. IIRC, ftrace isn't as intrusive from a function execution run time perspective. Systemtap adds an order of magnitude to the time spent in each function (using it's callgraph example stap) when compared to the ftrace data on the same call paths.

Comment by Ned Bass [ 27/Aug/12 ]

I tried ftrace but I wasn't able to display data for the lustre modules. I probably just haven't found the right invocation.

Comment by Jinshan Xiong (Inactive) [ 27/Aug/12 ]

It's interesting that kmem_cache_free accounts for 10% of CPU cycles in Lustre, but only 0.2% for NFS. It also seems we're spending a lot of time updating lprocfs counters.

CLIO uses a log kmem_cache objects to manage pages, for example osc_page_kmem for osc_page and lov_page_kmem for lov_page, etc.

Comment by Nathan Rutman [ 07/Sep/12 ]

We've noticed the same thing. Extra swapping causes significant slowdown in file reads.

I'm attaching some graphs of memory usage over time, in 2.x (dropping the cache between tests and not) and 1.8.6; the differences are very obvious.

Comment by Nathan Rutman [ 07/Sep/12 ]

See also
http://comments.gmane.org/gmane.comp.file-systems.lustre.user/11327

Comment by Christopher Morrone [ 07/Sep/12 ]

I think those things might be somewhat different than what Ned describes, which is very specifically about the speed at which clean cache pages can be reclaimed.

The first issue that you mention may or may not be related to this ticket; it could also be that the client's memory reclaim hooks are also broken in CLIO. Some more information (scales on the graph X access, description of the IO load at the time) might help me see a clearer connection.

The problem discussed in that email thread about the max_cache_mb limit no longer working in CLIO is definitely a known issue. See LU-744, and patch 2514 which is in progress.

Generated at Sat Feb 10 01:19:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.