[LU-9749] Reduce overhead for ll_do_fast_read Created: 07/Jul/17 Updated: 22/Jul/17 Resolved: 22/Jul/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
For small read sizes (1k and below), creating a cl_env can be over 90% of the time spent doing ll_do_fast_read. ll_do_fast_read doesn't really need a cl_env, it only uses it for some debug printing. Not prepping a cl_env for ll_do_fast_reads improves read rates (for already cached data) for 1 byte reads by ~20x, and 1k reads by ~10x. 4k reads are improved by ~5x. Patch coming shortly. |
| Comments |
| Comment by Gerrit Updater [ 07/Jul/17 ] |
|
Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/27970 |
| Comment by Patrick Farrell (Inactive) [ 17/Jul/17 ] |
|
4 threads of dd, as described in commit message. This is current master (with the lc_version fix to avoid refilling cached envs every time): Samples: 132K of event 'cpu-clock', Event count (approx.): 33152500000
Children Self Command Shared Object Symbol ◆
- 82.92% 0.09% dd [kernel.kallsyms] [k] system_call_fastpath ▒
- 82.82% system_call_fastpath ▒
- 73.17% sys_read ▒
- 71.71% vfs_read ▒
- 68.77% ll_file_read ▒
- 42.16% ll_file_aio_read ▒
- 15.10% cl_env_put ▒
4.70% _raw_read_lock ▒
- 10.15% cl_env_get ▒
3.58% _raw_read_lock ▒
0.80% __list_del_entry ▒
- 6.65% generic_file_aio_read ▒
+ 2.11% __find_get_page ▒
0.92% put_page ▒
0.70% touch_atime ▒
0.52% file_read_actor ▒
- 3.28% ll_cl_add ▒
1.60% _raw_write_lock ▒
- 1.87% ll_cl_remove ▒
0.67% _raw_write_lock ▒
+ 0.90% ll_stats_ops_tally ▒
- 11.90% cl_env_put ▒
3.00% _raw_read_lock ▒
0.51% lu_context_exit ▒
- 11.71% cl_env_get ▒
4.06% _raw_read_lock ▒
1.34% __list_del_entry ▒
+ 1.43% rw_verify_area ▒
0.63% fget_light ▒
+ 8.79% sys_write ▒
0.75% fget_light ▒
+ 80.65% 1.31% dd libc-2.17.so [.] __GI___libc_read ▒
+ 73.88% 1.32% dd [kernel.kallsyms] [k] sys_read ▒
+ 71.78% 0.72% dd [kernel.kallsyms] [k] vfs_read ▒
+ 68.82% 1.48% dd [kernel.kallsyms] [k] ll_file_read ▒
Tip: Search options using a keyword: perf report -h <keyword>
Notice all the time spent looking up the cl_env, etc. |
| Comment by Patrick Farrell (Inactive) [ 17/Jul/17 ] |
|
Here's with the current patch: Samples: 125K of event 'cpu-clock', Event count (approx.): 31399500000
Children Self Command Shared Object Symbol ◆
- 60.22% 0.24% dd [kernel.kallsyms] [k] system_call_fastpath ▒
- 59.98% system_call_fastpath ▒
- 40.61% sys_read ▒
- 37.48% vfs_read ▒
- 29.90% ll_file_read ▒
- 25.80% ll_file_aio_read ▒
- 16.66% generic_file_aio_read ▒
+ 7.20% __find_get_page ▒
2.13% put_page ▒
1.38% touch_atime ▒
1.07% file_read_actor ▒
0.61% copy_user_generic_unrolled ▒
+ 2.68% ll_stats_ops_tally ▒
0.75% ll_file_get_iov_count ▒
0.53% iov_iter_advance ▒
+ 3.01% rw_verify_area ▒
1.70% fsnotify ▒
1.27% fget_light ▒
+ 17.66% sys_write ▒
1.56% fget_light ▒
+ 57.28% 2.79% dd libc-2.17.so [.] __GI___libc_read ▒
+ 41.80% 2.66% dd [kernel.kallsyms] [k] sys_read ▒
+ 37.55% 2.16% dd [kernel.kallsyms] [k] vfs_read ▒
+ 34.48% 3.03% dd libc-2.17.so [.] __GI___libc_write ▒
+ 29.98% 3.43% dd [kernel.kallsyms] [k] ll_file_read ▒
+ 26.02% 4.53% dd [kernel.kallsyms] [k] ll_file_aio_read ▒
+ 19.25% 2.96% dd [kernel.kallsyms] [k] sys_write ▒
+ 16.86% 3.66% dd [kernel.kallsyms] [k] generic_file_aio_read ▒
+ 14.38% 1.60% dd [kernel.kallsyms] [k] vfs_write ▒
+ 14.20% 6.25% dd [kernel.kallsyms] [k] fsnotify ▒
+ 8.68% 0.38% dd [kernel.kallsyms] [k] sysret_audit ▒
+ 7.90% 7.90% dd [kernel.kallsyms] [k] system_call_after_swapgs▒
+ 7.73% 4.76% dd [kernel.kallsyms] [k] __audit_syscall_exit ▒
+ 7.27% 4.35% dd [kernel.kallsyms] [k] __find_get_page ▒
+ 4.68% 0.52% dd [kernel.kallsyms] [k] auditsys ▒
Tip: See assembly instructions with percentage: perf annotate <symbol>
Much better. |
| Comment by Patrick Farrell (Inactive) [ 17/Jul/17 ] |
|
And here's with the version suggested by Jinshan, with cl_env_percpu_get used. Timings are identical to above, and code's a little simpler: Samples: 176K of event 'cpu-clock', Event count (approx.): 44168250000
Children Self Command Shared Object Symbol ◆
- 60.22% 0.25% dd [kernel.kallsyms] [k] system_call_fastpath ▒
- 59.97% system_call_fastpath ▒
- 40.04% sys_read ▒
- 36.98% vfs_read ▒
- 29.83% ll_file_read ▒
- 25.86% ll_file_aio_read ▒
- 17.08% generic_file_aio_read ▒
+ 8.25% __find_get_page ▒
2.18% put_page ▒
1.33% touch_atime ▒
1.02% file_read_actor ▒
- 2.41% ll_stats_ops_tally ▒
1.72% lprocfs_counter_add ▒
0.77% ll_file_get_iov_count ▒
0.52% iov_iter_advance ▒
+ 2.90% rw_verify_area ▒
1.49% fsnotify ▒
1.21% fget_light ▒
+ 18.21% sys_write ▒
1.58% fget_light ▒
+ 56.70% 2.80% dd libc-2.17.so [.] __GI___libc_read ▒
+ 41.41% 2.86% dd [kernel.kallsyms] [k] sys_read ▒
+ 37.04% 2.07% dd [kernel.kallsyms] [k] vfs_read ▒
+ 34.36% 2.84% dd libc-2.17.so [.] __GI___libc_write ▒
+ 29.90% 3.28% dd [kernel.kallsyms] [k] ll_file_read ▒
+ 26.09% 4.50% dd [kernel.kallsyms] [k] ll_file_aio_read ▒
+ 19.95% 3.06% dd [kernel.kallsyms] [k] sys_write ▒
+ 17.33% 3.24% dd [kernel.kallsyms] [k] generic_file_aio_read ▒
+ 15.20% 1.91% dd [kernel.kallsyms] [k] vfs_write ▒
+ 13.48% 5.68% dd [kernel.kallsyms] [k] fsnotify ▒
+ 8.65% 0.43% dd [kernel.kallsyms] [k] sysret_audit ▒
+ 8.32% 5.64% dd [kernel.kallsyms] [k] __find_get_page ▒
+ 7.69% 4.83% dd [kernel.kallsyms] [k] __audit_syscall_exit ▒
+ 7.65% 7.65% dd [kernel.kallsyms] [k] system_call_after_swapg▒
+ 5.21% 2.32% dd [kernel.kallsyms] [k] rw_verify_area ▒
Tip: To change sampling frequency to 100 Hz: perf record -F 100
|
| Comment by Gerrit Updater [ 22/Jul/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27970/ |
| Comment by Minh Diep [ 22/Jul/17 ] |
|
l |