[LU-16433] single client performance regression in SSF workload Created: 24/Dec/22  Updated: 18/Feb/23  Resolved: 03/Jan/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.2
Fix Version/s: Lustre 2.16.0, Lustre 2.15.2

Type: Bug Priority: Minor
Reporter: Shuichi Ihara Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre-2.15.2, Rokeylinux 8.6 (4.18.0-372.32.1.el8_6.x86_64), OFED-5.4-3.6.8.1


Issue Links:
Related
is related to LU-15959 support for SLES 15 SP4 Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

a client performance regression was found in 2.15.2-RC1 (commit:e21498bcaa).
Tested workload is single client and SSF(single shared file) from 16 processes.

# mpirun -np 16 ior -a POSIX -i 1 -d 10 -w -r -b 16g -t 1m -C -Q 17 -e -vv -o //exafs/d0/d1/d2/ost_stripe/file 

lustre-2.15.1

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     2489.25    2489.28    0.006428    16777216   1024.00    0.000936   105.31     0.000238   105.31     0   
read      4176       4176       0.003803    16777216   1024.00    0.001695   62.77      3.92       62.77      0   
write     2423.58    2423.60    0.006452    16777216   1024.00    0.000586   108.16     2.45       108.16     1   
read      4197       4197       0.003652    16777216   1024.00    0.001982   62.46      3.98       62.46      1   
write     2502.32    2502.34    0.006375    16777216   1024.00    0.000404   104.76     0.305282   104.76     2   
read      4211       4211       0.003683    16777216   1024.00    0.001679   62.25      3.99       62.25      2   

Max Write: 2502.32 MiB/sec (2623.88 MB/sec)
Max Read:  4211.19 MiB/sec (4415.75 MB/sec)

lustre-2.15.2-RC1

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     2103.65    2103.68    0.007142    16777216   1024.00    0.001769   124.61     7.60       124.61     0   
read      4204       4204       0.003159    16777216   1024.00    0.001461   62.35      10.59      62.35      0   
write     2169.58    2169.69    0.006903    16777216   1024.00    0.000912   120.82     7.72       120.83     1   
read      4282       4282       0.003722    16777216   1024.00    0.137671   61.22      2.78       61.22      1   
write     2133.24    2133.25    0.007500    16777216   1024.00    0.000380   122.88     3.60       122.89     2   
read      4088       4088       0.003689    16777216   1024.00    0.001053   64.13      3.68       64.13      2  

Max Write: 2169.58 MiB/sec (2274.97 MB/sec)
Max Read:  4282.19 MiB/sec (4490.20 MB/sec)

it is ~14% performance regression in 2.15.2-RC1 compared to lustre-2.15.1.

After investigations and 'git bisect' tells us "commit: [6d4559f6b948a93aaf5e94c4eb47cd9ebcf7ba95] LU-15959 kernel: new kernel [SLES15 SP4 5.14.21-150400.24.18.1]" caused this performance regression.

Here is another test result after revered patch "LU-15959 kernel: new kernel [SLES15 SP4 5.14.21-150400.24.18.1]" from lustre-2.15.2-RC1 and it confirmed the performance was back to same level of 2.15.1.

lustre-2.15.2-RC1 + reverted commit:6d4559f6b9 (LU-15959 kernel: new kernel [SLES15 SP4 5.14.21-150400.24.18.1])

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     2497.41    2497.44    0.006407    16777216   1024.00    0.001115   104.97     0.000791   104.97     0   
read      4217       4217       0.003773    16777216   1024.00    0.001680   62.16      3.37       62.16      0   
write     2471.13    2471.14    0.006475    16777216   1024.00    0.000375   106.08     0.000292   106.08     1   
read      4083       4083       0.003765    16777216   1024.00    0.001659   64.20      3.23       64.20      1   
write     2457.91    2457.92    0.006509    16777216   1024.00    0.000412   106.65     0.010367   106.65     2   
read      4163       4163       0.003771    16777216   1024.00    0.001909   62.97      6.35       62.97      2   

Max Write: 2497.41 MiB/sec (2618.72 MB/sec)
Max Read:  4217.39 MiB/sec (4422.25 MB/sec)


 Comments   
Comment by Peter Jones [ 24/Dec/22 ]

Jian

Is this something that can be avoided in the LU-15959 change?

Peter

Comment by Jian Yu [ 24/Dec/22 ]

In patch https://review.whamcloud.com/47924 ("LU-15959 kernel: new kernel [SLES15 SP4 5.14.21-150400.24.18.1]"), the following changes are related:

lustre/llite/vvp_internal.h
-#ifndef HAVE_ACCOUNT_PAGE_DIRTIED_EXPORT
+#if !defined(HAVE_ACCOUNT_PAGE_DIRTIED_EXPORT) || \
+defined(HAVE_KALLSYMS_LOOKUP_NAME)
 extern unsigned int (*vvp_account_page_dirtied)(struct page *page,
                                                struct address_space *mapping);
 #endif
lustre/llite/vvp_io.c
/* kernels without HAVE_KALLSYMS_LOOKUP_NAME also don't have account_dirty_page
 * exported, and if we can't access that symbol, we can't do page dirtying in
 * batch (taking the xarray lock only once) so we just fall back to a looped
 * call to __set_page_dirty_nobuffers
 */
#ifndef HAVE_KALLSYMS_LOOKUP_NAME
	for (i = 0; i < count; i++)
		__set_page_dirty_nobuffers(pvec->pages[i]);
#else
+       /*
+        * In kernel 5.14.21, kallsyms_lookup_name is defined but
+        * account_page_dirtied is not exported.
+        */
+       if (!vvp_account_page_dirtied) {
+               for (i = 0; i < count; i++)
+                       __set_page_dirty_nobuffers(pvec->pages[i]);
+               goto end;
+       }
+

In Rocky Linux 8.6 kernel 4.18.0-372.32.1.el8_6.x86_64, both account_page_dirtied and kallsyms_lookup_name are exported. So, I need to change the check of vvp_account_page_dirtied to HAVE_ACCOUNT_PAGE_DIRTIED_EXPORT. This can resolved the client performance regression issue on Rocky Linux 8.6.
However, for SLES15 SP4 client, I'm not sure how to resolve the issue since account_page_dirtied is not exported and we have to use __set_page_dirty_nobuffers.
I'm working on a patch to fix the issue on Rocky Linux 8.6.

Comment by Gerrit Updater [ 25/Dec/22 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49512
Subject: LU-16433 llite: define and check vvp_account_page_dirtied
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 11b721714311ce9f11a596eaa13c368d27096d96

Comment by Shuichi Ihara [ 27/Dec/22 ]

confirmed that patch https://review.whamcloud.com/c/fs/lustre-release/+/49512 solved problem and performance was back.
lustre-2.15.2-RC1 + patch https://review.whamcloud.com/c/fs/lustre-release/+/49512

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     2440.73    2440.76    0.006555    16777216   1024.00    0.001139   107.40     0.000249   107.40     0   
read      4027       4027       0.003897    16777216   1024.00    0.001635   65.09      3.60       65.09      0   
write     2427.14    2427.15    0.006584    16777216   1024.00    0.000384   108.00     0.126996   108.01     1   
read      4132       4132       0.003715    16777216   1024.00    0.001663   63.44      5.11       63.44      1   
write     2421.75    2421.76    0.006581    16777216   1024.00    0.000384   108.25     1.39       108.25     2   
read      4082       4082       0.003875    16777216   1024.00    0.001668   64.22      3.72       64.22      2   

Max Write: 2440.73 MiB/sec (2559.29 MB/sec)
Max Read:  4132.11 MiB/sec (4332.83 MB/sec)
Comment by Gerrit Updater [ 28/Dec/22 ]

"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49520
Subject: LU-16433 llite: check vvp_account_page_dirtied
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: b95cf135117cd24ef5403aa111ae82fd14215efb

Comment by Gerrit Updater [ 03/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49512/
Subject: LU-16433 llite: check vvp_account_page_dirtied
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 61c4c2b5e5d7d919149921d913322586ba5ebcab

Comment by Gerrit Updater [ 03/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49520/
Subject: LU-16433 llite: check vvp_account_page_dirtied
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 1c6e03a53cb374c10cf2d9e5a22fdb304f81e8bf

Comment by Peter Jones [ 03/Jan/23 ]

Landed for 2.16

Comment by Andreas Dilger [ 16/Feb/23 ]

Should this issue be re-opened to investigate/address the performance loss for newer kernels?

I don't think it is only SLES15sp4 that is affected, but any kernel since Linux 5.2 where account_page_dirtied() is not exported, like Ubuntu 22.04, RHEL9.x. The patch landed here defers this problem while kallsyms_lookup_name() can work around that lack, but that is also removed in newer kernels.

There should be some way that we can work with the new page cache more efficiently for large page ranges, since that is what xarray and folios are supposed to be for...

Comment by Patrick Farrell [ 16/Feb/23 ]

We could re-open it, but as it stands, xarray is just a re-API of the radix tree, and non-single-page-folios aren't supported in the page cache yet.  Setting folios aside, last I checked, the operations we'd need to do much in batch aren't exported.

At the very least, my focus is on the DIO stuff - I'm more interested in pushing buffered I/O through the DIO path once unaligned support is fully working.  That would offer much larger gains.  (Not that it's not worth working on the buffered path, but ...)

So re-opening is probably a decent idea, but I wouldn't prioritize it.

Comment by Patrick Farrell [ 17/Feb/23 ]

sihara , whether we re-open this or not, be aware this problem exists in Linux 5.2 and newer (and there is no obvious way to fix it).  So, as Andreas said, Ubuntu 22.04 + and RHEL9.

Comment by Shaun Tancheff [ 17/Feb/23 ]

I would note that 2.15.2-RC1 does not have LU-16433. It is possible that you could check if this fixes the performance regression?

Comment by Patrick Farrell [ 17/Feb/23 ]

Shaun,

I don't totally understand your question - The performance regression is about whether or not we have access to the necessary symbols to do things in batch.  This patch fixes it for some 'intermediate' kernels, where we can still use kallsyms_lookup_name() to find non-exported symbols, but that's gone in newer kernels.  So we know exactly why the regression is occurring and where it's occurring.

If HPE is interested in avoiding the regression on intermediate kernels for 2.15, you could push the patch to b2_15 and I think we'd be happy to land it.  But we have no solution for the latest kernels.

Comment by Andreas Dilger [ 18/Feb/23 ]

Shaun, I see patch https://review.whamcloud.com/49520 "LU-16433 llite: check vvp_account_page_dirtied" on b2_15, which is the cherry-picked version of Jian's 49512 patch that fixed the problem on master. It looks like it was included in 2.15.2, so you just need to update your tree.

Comment by Shaun Tancheff [ 18/Feb/23 ]

Sorry, I didn't read through the collapsed comments.

Patrick is correct. Post removal of kallsyms* we do not have a way to acquire the account_page_dirtied / folio_account_dirtied directly.

On the plus side it looks like we might be able to 'vectorize' folio_account_dirtied and provide a local vvp_account_dirtied_folios() for those kernels.

There now a vvp_set_folio_dirty_batched() under LU-16577 that may be useful.

Generated at Sat Feb 10 03:26:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.