[LU-9749] Reduce overhead for ll_do_fast_read Created: 07/Jul/17  Updated: 22/Jul/17  Resolved: 22/Jul/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: Patrick Farrell (Inactive) Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

For small read sizes (1k and below), creating a cl_env can be over 90% of the time spent doing ll_do_fast_read.

ll_do_fast_read doesn't really need a cl_env, it only uses it for some debug printing. Not prepping a cl_env for ll_do_fast_reads improves read rates (for already cached data) for 1 byte reads by ~20x, and 1k reads by ~10x. 4k reads are improved by ~5x.

Patch coming shortly.



 Comments   
Comment by Gerrit Updater [ 07/Jul/17 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/27970
Subject: LU-9749 llite: Reduce overhead for ll_do_fast_read
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f5fd490e35662395f59111358ad4eb09c5d9f3c5

Comment by Patrick Farrell (Inactive) [ 17/Jul/17 ]

4 threads of dd, as described in commit message.

This is current master (with the lc_version fix to avoid refilling cached envs every time):

Samples: 132K of event 'cpu-clock', Event count (approx.): 33152500000                  
  Children      Self  Command       Shared Object       Symbol                         ◆
-   82.92%     0.09%  dd            [kernel.kallsyms]   [k] system_call_fastpath       ▒
   - 82.82% system_call_fastpath                                                       ▒
      - 73.17% sys_read                                                                ▒
         - 71.71% vfs_read                                                             ▒
            - 68.77% ll_file_read                                                      ▒
               - 42.16% ll_file_aio_read                                               ▒
                  - 15.10% cl_env_put                                                  ▒
                       4.70% _raw_read_lock                                            ▒
                  - 10.15% cl_env_get                                                  ▒
                       3.58% _raw_read_lock                                            ▒
                       0.80% __list_del_entry                                          ▒
                  - 6.65% generic_file_aio_read                                        ▒
                     + 2.11% __find_get_page                                           ▒
                       0.92% put_page                                                  ▒
                       0.70% touch_atime                                               ▒
                       0.52% file_read_actor                                           ▒
                  - 3.28% ll_cl_add                                                    ▒
                       1.60% _raw_write_lock                                           ▒
                  - 1.87% ll_cl_remove                                                 ▒
                       0.67% _raw_write_lock                                           ▒
                  + 0.90% ll_stats_ops_tally                                           ▒
               - 11.90% cl_env_put                                                     ▒
                    3.00% _raw_read_lock                                               ▒
                    0.51% lu_context_exit                                              ▒
               - 11.71% cl_env_get                                                     ▒
                    4.06% _raw_read_lock                                               ▒
                    1.34% __list_del_entry                                             ▒
            + 1.43% rw_verify_area                                                     ▒
           0.63% fget_light                                                            ▒
      + 8.79% sys_write                                                                ▒
        0.75% fget_light                                                               ▒
+   80.65%     1.31%  dd            libc-2.17.so        [.] __GI___libc_read           ▒
+   73.88%     1.32%  dd            [kernel.kallsyms]   [k] sys_read                   ▒
+   71.78%     0.72%  dd            [kernel.kallsyms]   [k] vfs_read                   ▒
+   68.82%     1.48%  dd            [kernel.kallsyms]   [k] ll_file_read               ▒
Tip: Search options using a keyword: perf report -h <keyword>

Notice all the time spent looking up the cl_env, etc.

Comment by Patrick Farrell (Inactive) [ 17/Jul/17 ]

Here's with the current patch:

Samples: 125K of event 'cpu-clock', Event count (approx.): 31399500000                  
  Children      Self  Command          Shared Object       Symbol                      ◆
-   60.22%     0.24%  dd               [kernel.kallsyms]   [k] system_call_fastpath    ▒
   - 59.98% system_call_fastpath                                                       ▒
      - 40.61% sys_read                                                                ▒
         - 37.48% vfs_read                                                             ▒
            - 29.90% ll_file_read                                                      ▒
               - 25.80% ll_file_aio_read                                               ▒
                  - 16.66% generic_file_aio_read                                       ▒
                     + 7.20% __find_get_page                                           ▒
                       2.13% put_page                                                  ▒
                       1.38% touch_atime                                               ▒
                       1.07% file_read_actor                                           ▒
                       0.61% copy_user_generic_unrolled                                ▒
                  + 2.68% ll_stats_ops_tally                                           ▒
                    0.75% ll_file_get_iov_count                                        ▒
                    0.53% iov_iter_advance                                             ▒
            + 3.01% rw_verify_area                                                     ▒
              1.70% fsnotify                                                           ▒
           1.27% fget_light                                                            ▒
      + 17.66% sys_write                                                               ▒
        1.56% fget_light                                                               ▒
+   57.28%     2.79%  dd               libc-2.17.so        [.] __GI___libc_read        ▒
+   41.80%     2.66%  dd               [kernel.kallsyms]   [k] sys_read                ▒
+   37.55%     2.16%  dd               [kernel.kallsyms]   [k] vfs_read                ▒
+   34.48%     3.03%  dd               libc-2.17.so        [.] __GI___libc_write       ▒
+   29.98%     3.43%  dd               [kernel.kallsyms]   [k] ll_file_read            ▒
+   26.02%     4.53%  dd               [kernel.kallsyms]   [k] ll_file_aio_read        ▒
+   19.25%     2.96%  dd               [kernel.kallsyms]   [k] sys_write               ▒
+   16.86%     3.66%  dd               [kernel.kallsyms]   [k] generic_file_aio_read   ▒
+   14.38%     1.60%  dd               [kernel.kallsyms]   [k] vfs_write               ▒
+   14.20%     6.25%  dd               [kernel.kallsyms]   [k] fsnotify                ▒
+    8.68%     0.38%  dd               [kernel.kallsyms]   [k] sysret_audit            ▒
+    7.90%     7.90%  dd               [kernel.kallsyms]   [k] system_call_after_swapgs▒
+    7.73%     4.76%  dd               [kernel.kallsyms]   [k] __audit_syscall_exit    ▒
+    7.27%     4.35%  dd               [kernel.kallsyms]   [k] __find_get_page         ▒
+    4.68%     0.52%  dd               [kernel.kallsyms]   [k] auditsys                ▒
Tip: See assembly instructions with percentage: perf annotate <symbol>

Much better.

Comment by Patrick Farrell (Inactive) [ 17/Jul/17 ]

And here's with the version suggested by Jinshan, with cl_env_percpu_get used. Timings are identical to above, and code's a little simpler:

Samples: 176K of event 'cpu-clock', Event count (approx.): 44168250000                  
  Children      Self  Command          Shared Object        Symbol                     ◆
-   60.22%     0.25%  dd               [kernel.kallsyms]    [k] system_call_fastpath   ▒
   - 59.97% system_call_fastpath                                                       ▒
      - 40.04% sys_read                                                                ▒
         - 36.98% vfs_read                                                             ▒
            - 29.83% ll_file_read                                                      ▒
               - 25.86% ll_file_aio_read                                               ▒
                  - 17.08% generic_file_aio_read                                       ▒
                     + 8.25% __find_get_page                                           ▒
                       2.18% put_page                                                  ▒
                       1.33% touch_atime                                               ▒
                       1.02% file_read_actor                                           ▒
                  - 2.41% ll_stats_ops_tally                                           ▒
                       1.72% lprocfs_counter_add                                       ▒
                    0.77% ll_file_get_iov_count                                        ▒
                    0.52% iov_iter_advance                                             ▒
            + 2.90% rw_verify_area                                                     ▒
              1.49% fsnotify                                                           ▒
           1.21% fget_light                                                            ▒
      + 18.21% sys_write                                                               ▒
        1.58% fget_light                                                               ▒
+   56.70%     2.80%  dd               libc-2.17.so         [.] __GI___libc_read       ▒
+   41.41%     2.86%  dd               [kernel.kallsyms]    [k] sys_read               ▒
+   37.04%     2.07%  dd               [kernel.kallsyms]    [k] vfs_read               ▒
+   34.36%     2.84%  dd               libc-2.17.so         [.] __GI___libc_write      ▒
+   29.90%     3.28%  dd               [kernel.kallsyms]    [k] ll_file_read           ▒
+   26.09%     4.50%  dd               [kernel.kallsyms]    [k] ll_file_aio_read       ▒
+   19.95%     3.06%  dd               [kernel.kallsyms]    [k] sys_write              ▒
+   17.33%     3.24%  dd               [kernel.kallsyms]    [k] generic_file_aio_read  ▒
+   15.20%     1.91%  dd               [kernel.kallsyms]    [k] vfs_write              ▒
+   13.48%     5.68%  dd               [kernel.kallsyms]    [k] fsnotify               ▒
+    8.65%     0.43%  dd               [kernel.kallsyms]    [k] sysret_audit           ▒
+    8.32%     5.64%  dd               [kernel.kallsyms]    [k] __find_get_page        ▒
+    7.69%     4.83%  dd               [kernel.kallsyms]    [k] __audit_syscall_exit   ▒
+    7.65%     7.65%  dd               [kernel.kallsyms]    [k] system_call_after_swapg▒
+    5.21%     2.32%  dd               [kernel.kallsyms]    [k] rw_verify_area         ▒
Tip: To change sampling frequency to 100 Hz: perf record -F 100
Comment by Gerrit Updater [ 22/Jul/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27970/
Subject: LU-9749 llite: Reduce overhead for ll_do_fast_read
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c084c6215851d238d14b0d414374b6b55c91f525

Comment by Minh Diep [ 22/Jul/17 ]

l

Generated at Sat Feb 10 02:28:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.