[LU-6365] Eliminate unnecessary loop in lu_cache_shrink to improve performance Created: 13/Mar/15 Updated: 18/Jul/16 Resolved: 10/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Ann Koehler (Inactive) | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Customer has a test application that tries to allocate 20000 2M huge pages. After the node has be up and running for some period of time and fragmentation has occurred, the allocation takes several minutes before failing. # time aprun -m20000h -L 2168 -n 2 -N 2 -ss -cc cpu -j 2 -d 24 ./prog2 > > Started at 11:34:02 > > Node Name Huge Page(MB) MemFree(MB) Small0 (MB) Small1(MB) Huge0(MB) Huge1(MB) > 0 nid02168 40000 19908 9814 1109 0 8398 > > Allocated 19800MB > At address 10000080000 > Application 48079344 is crashing. ATP analysis proceeding... > > ATP Stack walkback for Rank 0 starting: > start_thread@0x2aaaaee6f805 > _new_slave_entry@0x2aaaab3b0f79 > memcheck__cray$mt$p0001@prog2.f90:103 > sub_@sub.f90:5 > sub2_@sub2.f90:9 > touch_@touch.f90:1 > ATP Stack walkback for Rank 0 done > Process died with signal 7: 'Bus error' > Forcing core dumps of ranks 0, 1 [skip] > [NID 02168] 2015-03-09 11:34:13 Apid 48079344: Huge page could not be allocated. Process terminated via bus error. > Application 48079344 exit codes: 139 > Application 48079344 resources: utime ~92s, stime ~9s, Rss ~123412, inblocks ~628, outblocks ~98 > > real 2m19.012s > user 0m0.984s > sys 0m0.172s In researching the slowness, I noticed that the lu_cache_shrinker is taking noticeably longer to execute than the other shrinkers. Kernel tracing - second column is time
<...>-35127 [013] 2965136.085636: shrink_slab <-do_try_to_free_pages
<...>-35127 [013] 2965136.085636: down_read_trylock <-shrink_slab
<...>-35127 [013] 2965136.085636: shrink_dcache_memory <-shrink_slab
<...>-35127 [013] 2965136.085636: shrink_icache_memory <-shrink_slab
<...>-35127 [013] 2965136.085636: shrink_dqcache_memory <-shrink_slab
<...>-35127 [013] 2965136.085637: lu_cache_shrink <-shrink_slab
<...>-35127 [013] 2965136.085990: enc_pools_shrink <-shrink_slab
<...>-35127 [013] 2965136.085990: ldlm_pools_srv_shrink <-shrink_slab
<...>-35127 [013] 2965136.085994: ldlm_pools_cli_shrink <-shrink_slab
<...>-35127 [013] 2965136.086004: up_read <-shrink_slab
<...>-35127 [013] 2965136.086004: _cond_resched <-shrink_slab
<...>-35127 [013] 2965136.086006: shrink_slab <-do_try_to_free_pages
Most of the time in lu_cache_shrink is spent repeatedly getting a spinlock in lu_site_stats_get(). > <...>-46696 [012] 4654974.718236: cfs_hash_hd_hhead_size <-lu_site_stats_get > <...>-46696 [012] 4654974.718237: _raw_spin_lock <-cfs_hash_spin_lock The lu_cache_shrink algothithm loops over the entries in lu_sites (8 in this customer's case) calling lu_site_stats_get for each to compute the number of freeable objects. lu_site_stats_get itself loops over each bucket in the ls_obj_hash (256 buckets) to summing the lengths of the lru list in each bucket. The debug data suggests that most of the time taken by lu_cache_shrink is spent in lu_site_stats_get. This overhead can be eliminated simply by keeping an aggregated total of the lsb_lru_len from all buckets in the lu_site struct. In other words, keep a running count of total lru objects rather recomputing the total each time lu_cache_shrink is called. |
| Comments |
| Comment by Gerrit Updater [ 13/Mar/15 ] |
|
Ann Koehler (amk@cray.com) uploaded a new patch: http://review.whamcloud.com/14066 |
| Comment by Ann Koehler (Inactive) [ 13/Mar/15 ] |
| Comment by Ann Koehler (Inactive) [ 14/Jul/15 ] |
|
The following performance stats were collected from a single client with 2 Lustre file systems mounted. The data is from the Ftrace kernel tracing facility. The traces are for calls to lu_cache_shrink while trying to allocate hugepages with the command: echo 50000 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages Function profile output without patch:
Function Hit Time Avg s^2
-------- --- ---- --- ---
shrink_slab 380928 112320774 us 294.860 us 29.819 us
lu_cache_shrink 380928 103623988 us 272.030 us 25.932 us
lu_site_stats_get 761856 102356451 us 134.351 us 11.891 us
ldlm_pools_shrink 761856 7272808 us 9.546 us 18.779 us
ldlm_pools_cli_shrink 380928 5328679 us 13.988 us 1.275 us
ldlm_pools_srv_shrink 380928 2113786 us 5.549 us 0.492 us
enc_pools_shrink 380928 428753.4 us 1.125 us 0.098 us
shrink_icache_memory 380928 24950.44 us 0.065 us 0.006 us
shrink_dcache_memory 380928 24717.95 us 0.064 us 0.004 us
shrink_dqcache_memory 380928 23187.41 us 0.060 us 0.005 us
Function profile output with patch:
Function Hit Time Avg s^2
-------- --- ---- --- ---
shrink_slab 421376 9721834 us 23.071 us 25.307 us
ldlm_pools_shrink 842752 7500651 us 8.900 us 14.925 us
ldlm_pools_cli_shrink 421376 5375698 us 12.757 us 2.718 us
ldlm_pools_srv_shrink 421376 2302504 us 5.464 us 0.514 us
lu_cache_shrink 421376 635577.7 us 1.508 us 0.127 us
enc_pools_shrink 421376 486304.1 us 1.154 us 0.099 us
shrink_icache_memory 421378 30136.85 us 0.071 us 20.470 us
shrink_dcache_memory 421376 26503.42 us 0.062 us 0.003 us
shrink_dqcache_memory 421376 25776.27 us 0.061 us 0.004 us
Function graph output without patch:
7) | shrink_slab() {
7) 0.084 us | down_read_trylock();
7) 0.063 us | shrink_dcache_memory();
7) 0.063 us | shrink_icache_memory();
7) 0.074 us | shrink_dqcache_memory();
7) | lu_cache_shrink() {
7) ! 1160.507 us | }
7) | enc_pools_shrink() {
7) 2.748 us | }
7) | ldlm_pools_srv_shrink() {
7) + 13.010 us | }
7) | ldlm_pools_cli_shrink() {
7) + 27.887 us | }
7) ! 1210.108 us | }
Function graph output with patch:
6) | shrink_slab() {
6) 0.182 us | down_read_trylock();
6) 0.061 us | shrink_dcache_memory();
6) 0.071 us | shrink_icache_memory();
6) 0.062 us | shrink_dqcache_memory();
6) | lu_cache_shrink() {
6) 5.886 us | }
6) | enc_pools_shrink() {
6) 2.905 us | }
6) | ldlm_pools_srv_shrink() {
6) + 15.619 us | }
6) | ldlm_pools_cli_shrink() {
6) + 35.413 us | }
6) + 67.345 us | }
Multiple measures were taken. Here's a brief summary: Time in Time in %shrink_slab time
shrink_slab lu_cache_shrink in lu_cache_shrink
no patch 1210.108 us 1160.507 us 95%
no patch 623.078 us 576.420 us 92%
patch 67.345 us 5.886 us 8.7%
patch 54.808 us 2.604 us 4.75%
|
| Comment by Gerrit Updater [ 09/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14066/ |
| Comment by Peter Jones [ 10/Aug/15 ] |
|
Landed for 2.8 |
| Comment by Jay Lan (Inactive) [ 30/Oct/15 ] |
|
When trying to back port the commit to 2.5.3-fe, I got a conflict in lu_object_put(): <<<<<<< HEAD Could you provide me a back port? Or do I need to pick up another commit before this? |
| Comment by Ann Koehler (Inactive) [ 30/Oct/15 ] |
|
Attached 2.5.3 patch file. If this doesn't work, give me a copy of your lustre/obdclass/lu_object.c and I'll make the necessary changes. |
| Comment by Jay Lan (Inactive) [ 30/Oct/15 ] |
|
The conflict I had was caused by two LU's that I do not have: |
| Comment by Ann Koehler (Inactive) [ 11/Nov/15 ] |
|
There's no dependency on (Sorry for the slow reply. I was out of the office and not reading e-mail.) |