[LU-1784] freeing cached clean pages is slow Created: 23/Aug/12 Updated: 08/Feb/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ned Bass | Assignee: | Zhenyu Xu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 10135 |
| Description |
|
Compared to NFS it seems to take Lustre a long time to free clean pages from the cache. Speeding this up may help client performance on a number of fronts since it's such a common operation. Giving up memory more quickly may also improve client stability under low memory conditions. For example, if I cat 50 GB of data to /dev/null from NFS, it takes 3.7 seconds to drop caches. # grove617 /root > dd if=/var/dumps/bigfile of=/dev/null bs=1M > /dev/null 2>&1 # grove617 /root > grep ^Cached /proc/meminfo ; time echo 3 > /proc/sys/vm/drop_caches ; grep ^Cached /proc/meminfo Cached: 49492808 kB real 0m3.707s user 0m0.000s sys 0m3.696s Cached: 59436 kB A similar test form Lustre takes about 7 times as long. # grove617 /root > for x in `seq 129 154` ; do dd if=/p/lstest/bass6/fio/read.$x.0 of=/dev/null bs=1M >/dev/null 2>&1 ; done # grove617 /root > grep ^Cached /proc/meminfo ; time echo 3 > /proc/sys/vm/drop_caches ; grep ^Cached /proc/meminfo Cached: 47703020 kB real 0m26.961s user 0m0.000s sys 0m26.870s Cached: 59768 kB Oprofile data for the Lustre test: vma samples cum. samples % cum. % linenr info app name symbol name ffffffff8115ff60 60729 60729 10.3025 10.3025 slab.c:3836 vmlinux kmem_cache_free 0000000000001810 37739 98468 6.4023 16.7048 lvfs_lib.c:111 lvfs.ko lprocfs_counter_sub ffffffff81272930 25490 123958 4.3243 21.0290 radix-tree.c:1237 vmlinux radix_tree_delete 0000000000015cc0 21352 145310 3.6223 24.6513 osc_page.c:767 osc.ko osc_lru_del 0000000000000000 21289 166599 3.6116 28.2629 (no location information) lustre /lustre 0000000000051030 18837 185436 3.1956 31.4586 cl_page.c:719 obdclass.ko cl_vmpage_page ffffffff8127d650 18123 203559 3.0745 34.5331 list_debug.c:45 vmlinux list_del ffffffff811602a0 16864 220423 2.8609 37.3940 slab.c:3516 vmlinux free_block 0000000000051d00 16518 236941 2.8022 40.1962 cl_page.c:284 obdclass.ko cl_page_free 0000000000051200 15149 252090 2.5700 42.7662 cl_page.c:1124 obdclass.ko cl_page_delete0 000000000004eb90 14321 266411 2.4295 45.1957 cl_page.c:1236 obdclass.ko cl_page_export 0000000000052150 13936 280347 2.3642 47.5599 cl_page.c:651 obdclass.ko cl_page_put ffffffff8126c5e0 13737 294084 2.3304 49.8903 dec_and_lock.c:21 vmlinux _atomic_dec_and_lock ffffffff8127d6f0 11844 305928 2.0093 51.8996 list_debug.c:22 vmlinux __list_add and for NFS: vma samples cum. samples % cum. % linenr info app name symbol name ffffffff8116b0f0 9239 9239 9.6682 9.6682 memcontrol.c:2362 vmlinux __mem_cgroup_uncharge_common ffffffff81129490 7402 16641 7.7458 17.4140 vmscan.c:427 vmlinux __remove_mapping ffffffff811243d0 7250 23891 7.5868 25.0008 page_alloc.c:1152 vmlinux free_hot_cold_page ffffffff811234f0 6747 30638 7.0604 32.0612 page_alloc.c:592 vmlinux free_pcppages_bulk ffffffff81127af0 5401 36039 5.6519 37.7131 swap.c:438 vmlinux release_pages ffffffff81272930 5302 41341 5.5483 43.2614 radix-tree.c:1237 vmlinux radix_tree_delete ffffffff811330d0 4319 45660 4.5196 47.7810 vmstat.c:282 vmlinux __dec_zone_state ffffffff811284b0 4049 49709 4.2371 52.0181 truncate.c:173 vmlinux invalidate_inode_page ffffffff81110a20 3666 53375 3.8363 55.8544 filemap.c:794 vmlinux find_get_pages ffffffff8110fdb0 3601 56976 3.7683 59.6226 filemap.c:527 vmlinux page_waitqueue ffffffff8127d650 3367 60343 3.5234 63.1461 list_debug.c:45 vmlinux list_del ffffffff81110e10 3007 63350 3.1467 66.2927 filemap.c:579 vmlinux unlock_page ffffffff81167cd0 2976 66326 3.1142 69.4070 memcontrol.c:731 vmlinux mem_cgroup_del_lru_list ffffffff81120e00 2883 69209 3.0169 72.4239 page_alloc.c:5350 vmlinux get_pageblock_flags_group ffffffff81128720 2686 71895 2.8108 75.2347 truncate.c:330 vmlinux invalidate_mapping_pages ffffffff81091010 2193 74088 2.2949 77.5295 wait.c:251 vmlinux __wake_up_bit ffffffff81167440 2175 76263 2.2760 79.8056 bit_spinlock.h:11 vmlinux bit_spin_lock ffffffff81271c40 2100 78363 2.1975 82.0031 radix-tree.c:820 vmlinux __lookup ... ffffffff8115ff60 182 89326 0.1905 93.4754 slab.c:3836 vmlinux kmem_cache_free It's interesting that kmem_cache_free accounts for 10% of CPU cycles in Lustre, but only 0.2% for NFS. It also seems we're spending a lot of time updating lprocfs counters. |
| Comments |
| Comment by Ned Bass [ 24/Aug/12 ] |
|
I made this plot of the ll_readpage() call graph with data from systemtap to help visualize where the time goes. It doesn't really show any egregious offenders, but it shows that Lustre does a lot of work overall to release a page. cl_env_nested_get(), cl_page_delete(), and cl_page_put() account for most of the time, with cl_page_delete() taking the longest. The instrumentation seriously inflates the timing however, by my estimate a normal ll_readpage() call takes about 2 us, compared to 750 us in this graph. |
| Comment by Peter Jones [ 26/Aug/12 ] |
|
Yujian Could you please look into this one? Thanks Peter |
| Comment by Peter Jones [ 27/Aug/12 ] |
|
I had not realized that this was CLIO-related. Jinshan is already looking into CLIO performance so it makes sense for him to consider this one too |
| Comment by Prakash Surya (Inactive) [ 27/Aug/12 ] |
|
Ned, since systemtap "seriously inflates the timing", you might want to look into ftrace. IIRC, ftrace isn't as intrusive from a function execution run time perspective. Systemtap adds an order of magnitude to the time spent in each function (using it's callgraph example stap) when compared to the ftrace data on the same call paths. |
| Comment by Ned Bass [ 27/Aug/12 ] |
|
I tried ftrace but I wasn't able to display data for the lustre modules. I probably just haven't found the right invocation. |
| Comment by Jinshan Xiong (Inactive) [ 27/Aug/12 ] |
CLIO uses a log kmem_cache objects to manage pages, for example osc_page_kmem for osc_page and lov_page_kmem for lov_page, etc. |
| Comment by Nathan Rutman [ 07/Sep/12 ] |
|
We've noticed the same thing. Extra swapping causes significant slowdown in file reads. I'm attaching some graphs of memory usage over time, in 2.x (dropping the cache between tests and not) and 1.8.6; the differences are very obvious. |
| Comment by Nathan Rutman [ 07/Sep/12 ] |
|
See also |
| Comment by Christopher Morrone [ 07/Sep/12 ] |
|
I think those things might be somewhat different than what Ned describes, which is very specifically about the speed at which clean cache pages can be reclaimed. The first issue that you mention may or may not be related to this ticket; it could also be that the client's memory reclaim hooks are also broken in CLIO. Some more information (scales on the graph X access, description of the IO load at the time) might help me see a clearer connection. The problem discussed in that email thread about the max_cache_mb limit no longer working in CLIO is definitely a known issue. See |