[LU-14424] write performance regression in Lustre-2.14.0-RC1 Created: 11/Feb/21  Updated: 22/Feb/21  Resolved: 17/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Critical
Reporter: Shuichi Ihara Assignee: Shuichi Ihara
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9679 Prepare lustre for adoption into the ... Resolved
is related to LU-14443 review-dne-ssk test session failed: E... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

While I was runnng the performance regression tests, a performance regression in Lustre-2.14.0-RC1 was found on the write with 4K transfer size.
Here is a reproducer and test results.

# mpirun -np $NP -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --allow-run-as-root /work/tools/bin/ior -w -r -t 4k -b $((256/NP)) -e -v -F -o /ai400x/ior.out/file
Client Version NP Write(MB/s) Read(MB/s)
Lustre-2.13.0 1  803  3293
Lustre-2.14.0-RC1 1  529  3092
Lustre-2.13.0 16  6962  12021
Lustre-2.14.0-RC1 16  5127  11951


 Comments   
Comment by Shuichi Ihara [ 11/Feb/21 ]

I will narrow down and find what patch causes problem.

Comment by Shuichi Ihara [ 12/Feb/21 ]

Here is patch we started to see this perf regression in master branch.

commit 80e21cce3dd6748fd760786cafe9c26d502fd74f
Author: NeilBrown <neilb@suse.com>
Date:   Thu Dec 13 11:32:56 2018 +1100

    LU-9679 osc: simplify osc_extent_find()
    
    osc_extent_find() contains some code with the same functionality as
    osc_extent_merge().  So replace that code with a call to
    osc_extent_merge().
    
    This requires that we set cur->oe_grants earlier, as
    osc_extent_merge() needs that.
    
    Also:
    
     - fix a pre-existing bug - osc_extent_merge() should never try to
       merge two extends with different ->oe_mppr as later alignment
       checks can get confused.
     - Remove a redundant list_del_init() which is already included in
       __osc_extent_remove().
    
    Linux-Commit: 85ebb57ddc5b ("lustre: osc: simplify osc_extent_find()")

Here is test results before/after commit 80e21cce3d

80e21cce3d LU-9679 osc: simplify osc_extent_find()
Max Write: 540.67 MiB/sec (566.93 MB/sec)

9d914f9cc7 LU-13711 build: fix typo on SSL dependency for Ubuntu
Max Write: 797.24 MiB/sec (835.96 MB/sec)
Comment by Gerrit Updater [ 12/Feb/21 ]

Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41498
Subject: LU-14424 Revert "LU-9679 osc: simplify osc_extent_find()"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 94649b2121449fb80bf9d9b971db9be539db76b1

Comment by Andreas Dilger [ 12/Feb/21 ]

Some added notes here from my investigation:

  • the test workload is IOR file per process with 4KB buffered writes, and it looks like a single client with 1 and 16 threads
  • at 800MiB/s that is 200k x 4KB writes/s, so osc_extent_merge() function is called between 200k/s and 400k/s (depending on whether it can do both front and back merges, so any extra overhead in this path will affect the bottom line
  • it isn't yet totally clear if the problem is just more CPU overhead in osc_extent_merge() or if there is a subtle bug that is preventing extents to be merged and sending more RPCs over the wire? Plain CPU overhead would be hard to explain a 250MB/s slowdown, but this is a hot path, and CPU overhead is a major contributor to the IOPS number. Bad RPC formation would be an easier explanation for this performance delta.
  • going from in-line coding of the checks to calling osc_extent_merge() added a number of extra calls, which makes me wonder if we can improve performance by reducing more overhead in this call path

In terms of debugging the source of the slowdown there are a couple things to check:

  • look at osd-ldiskfs.*.brw_stats and/or osc.*.rpc_stats to see if it is badly formed RPCs
  • look at the flame graph before/after to see where the extra CPU overhead is coming from
Comment by Gerrit Updater [ 13/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41498/
Subject: LU-14424 Revert "LU-9679 osc: simplify osc_extent_find()"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b592f75446ccd8fea790de4156478fd057c82019

Comment by James A Simmons [ 13/Feb/21 ]

While a revert works around this issue as Andreas pointed out this is hot code so more improvements could be done here.

Comment by Peter Jones [ 17/Feb/21 ]

Landed for 2.14 RC3

Comment by Neil Brown [ 19/Feb/21 ]

I believe the problem was that cur->oe_state wasn't being initialised properly.

I've uploaded a revised version of the reverted patch at https://review.whamcloud.com/41691

 

Comment by Shuichi Ihara [ 22/Feb/21 ]

Lustre-2.14.0-RC2

[root@ec01 ~]# lctl get_param llite.*.stats osc.*OST0001*.stats
llite.ai400x-ffff9f61e8aac000.stats=
snapshot_time             1613969894.227596637 secs.nsecs
read_bytes                67108864 samples [bytes] 4096 4096 274877906944 1125899906842624
write_bytes               67108864 samples [bytes] 4096 4096 274877906944 1125899906842624
read                      67108864 samples [usec] 0 19225 6728336 507600924
write                     67108864 samples [usec] 3 36876 447558029 554690588441
open                      2 samples [usec] 2 152 154 23108
close                     2 samples [usec] 157 179 336 56690
seek                      134217728 samples [usec] 0 178 95834 1085920
fsync                     1 samples [usec] 6210 6210 6210 38564100
getattr                   3 samples [usec] 18 385 568 175774
unlink                    2 samples [usec] 301 310 611 186701
mknod                     1 samples [usec] 309 309 309 95481
statfs                    1 samples [usec] 111 111 111 12321
inode_permission          26 samples [usec] 0 257 278 66138

osc.ai400x-OST0001-osc-ffff9f61e8aac000.stats=
snapshot_time             1613969894.227650896 secs.nsecs
req_waittime              73654 samples [usec] 127 50969 207491668 745114450474
req_active                73654 samples [reqs] 1 17 89410 125650
ldlm_extent_enqueue       1 samples [reqs] 1 1 1 1
read_bytes                16385 samples [bytes] 4096 16777216 274877906944 4611685881021988864
write_bytes               57262 samples [bytes] 4096 16777216 274877906944 1330662844261203968
ost_read                  16385 samples [usec] 218 50969 83226452 426080555072
ost_write                 57262 samples [usec] 176 35894 124263923 319033596483
ost_connect               2 samples [usec] 130 211 341 61421
ost_disconnect            2 samples [usec] 136 137 273 37265
ost_sync                  1 samples [usec] 402 402 402 161604
ldlm_cancel               1 samples [usec] 150 150 150 22500

[root@es400nvx1-vm1 ~]# lctl get_param osd-ldiskfs.*OST0001*.brw_stats
osd-ldiskfs.ai400x-OST0001.brw_stats=
snapshot_time:         1613969983.077008582 (secs.nsecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
1:		         1   0   0   |   41   0   0
2:		         0   0   0   |   51   0   0
4:		         0   0   0   |   24   0   0
8:		         0   0   0   |    8   0   0
16:		         0   0   0   |    7   0   0
32:		         0   0   0   |    4   0   0
64:		         0   0   0   |    8   0   0
128:		         0   0   0   |    5   0   0
256:		         0   0   0   |    2   0   0
512:		         0   0   0   |    6   0   0
1K:		         0   0   0   |    7   0   0
2K:		         0   0   0   | 57025  99  99
4K:		     16384  99 100   |   75   0 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs        % cum %
0:		     16385 100 100   | 57255  99  99
1:		         0   0 100   |    8   0 100

                           read      |     write
discontiguous blocks   rpcs  % cum % |  rpcs        % cum %
0:		     16371  99  99   | 57225  99  99
1:		        14   0 100   |   38   0 100

                           read      |     write
disk fragmented I/Os   ios   % cum % |  ios         % cum %
1:		         1   0   0   |  150   0   0
2:		         0   0   0   |    6   0   0
3:		         0   0   0   |    6   0   0
4:		         0   0   0   |    1   0   0
5:		         0   0   0   | 57006  99  99
6:		         0   0   0   |   16   0  99
7:		         0   0   0   |    0   0  99
8:		         0   0   0   |    2   0  99
9:		         0   0   0   |    3   0  99
10:		         0   0   0   |   16   0  99
11:		         0   0   0   |    2   0  99
12:		         0   0   0   |    2   0  99
13:		         0   0   0   |    3   0  99
14:		         0   0   0   |    3   0  99
15:		         0   0   0   |    2   0  99
16:		     16384  99 100   |   44   0  99
17:		         0   0 100   |    1   0 100

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios         % cum %
1:		     16336   6   6   | 57148  19  19
2:		     16336   6  12   | 57173  19  39
3:		     16336   6  18   | 57131  19  59
4:		     16336   6  24   | 57103  19  79
5:		     16336   6  31   | 57096  19  99
6:		     16339   6  37   |   91   0  99
7:		     16338   6  43   |   75   0  99
8:		     16346   6  49   |   75   0  99
9:		     16355   6  56   |   73   0  99
10:		     16361   6  62   |   72   0  99
11:		     16368   6  68   |   56   0  99
12:		     16364   6  74   |   56   0  99
13:		     16373   6  81   |   54   0  99
14:		     16370   6  87   |   51   0  99
15:		     16371   6  93   |   48   0  99
16:		     16367   6  99   |   46   0  99
17:		        54   0  99   |    6   0  99
18:		        54   0  99   |    5   0  99
19:		        50   0  99   |    5   0  99
20:		        43   0  99   |    4   0  99
21:		        39   0  99   |    4   0  99
22:		        42   0  99   |    3   0  99
23:		        38   0  99   |    3   0  99
24:		        30   0  99   |    3   0  99
25:		        19   0  99   |    2   0  99
26:		        16   0  99   |    2   0  99
27:		        16   0  99   |    1   0  99
28:		        16   0  99   |    1   0  99
29:		        16   0  99   |    1   0  99
30:		        16   0  99   |    1   0  99
31:		        64   0 100   |    2   0 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios         % cum %
1:		        96   0   0   | 57179  99  99
2:		     16255  99  99   |   63   0  99
4:		        25   0  99   |   11   0  99
8:		         9   0 100   |    6   0  99
16:		         0   0 100   |    4   0 100

                           read      |     write
disk I/O size          ios   % cum % |  ios         % cum %
4K:		         1   0   0   |   41   0   0
8K:		         0   0   0   |   51   0   0
16K:		         0   0   0   |   26   0   0
32K:		         0   0   0   |   10   0   0
64K:		         0   0   0   |   10   0   0
128K:		         0   0   0   |    7   0   0
256K:		         0   0   0   |   30   0   0
512K:		         0   0   0   |   34   0   0
1M:		    262144  99 100   | 286182  99 100

Lustre-2.14.0-RC3 (after reverted patch)

[root@ec01 llite]# lctl get_param llite.*.stats osc.*OST0001*.stats
llite.ai400x-ffff9f67d7c2b800.stats=
snapshot_time             1613971351.373233917 secs.nsecs
read_bytes                67108864 samples [bytes] 4096 4096 274877906944 1125899906842624
write_bytes               67108864 samples [bytes] 4096 4096 274877906944 1125899906842624
read                      67108864 samples [usec] 0 17888 7516054 452029440
write                     67108864 samples [usec] 3 22195 291816283 14370801533
ioctl                     1 samples [reqs]
open                      4 samples [usec] 2 182 440 66138
close                     4 samples [usec] 78 260 650 122374
seek                      134217728 samples [usec] 0 53 130227 1512001
fsync                     1 samples [usec] 17794 17794 17794 316626436
readdir                   8 samples [usec] 0 293 307 85925
getattr                   30 samples [usec] 0 1565 2334 2653018
unlink                    1 samples [usec] 366 366 366 133956
mknod                     1 samples [usec] 363 363 363 131769
statfs                    1 samples [usec] 106 106 106 11236
inode_permission          42 samples [usec] 0 28 72 938
osc.ai400x-OST0001-osc-ffff9f67d7c2b800.stats=
snapshot_time             1613971351.373308591 secs.nsecs
req_waittime              33094 samples [usec] 98 61857 168002250 975823530146
req_active                33094 samples [reqs] 1 18 46781 85103
ldlm_extent_enqueue       1 samples [reqs] 1 1 1 1
read_bytes                16385 samples [bytes] 4096 16777216 274877906944 4611685881021988864
write_bytes               16704 samples [bytes] 4096 16777216 274877906944 4607087317889318912
ost_read                  16385 samples [usec] 157 61857 82488488 433515710132
ost_write                 16704 samples [usec] 188 60092 85512821 542307585031
ost_connect               1 samples [usec] 115 115 115 13225
ost_disconnect            1 samples [usec] 148 148 148 21904
ost_sync                  1 samples [usec] 395 395 395 156025
ldlm_cancel               1 samples [usec] 185 185 185 34225

[root@es400nvx1-vm1 ~]# lctl get_param osd-ldiskfs.*OST0001*.brw_stats
osd-ldiskfs.ai400x-OST0001.brw_stats=
snapshot_time:         1613971438.900406702 (secs.nsecs)

                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
1:		         1   0   0   |   76   0   0
2:		         0   0   0   |   80   0   0
4:		         0   0   0   |   22   0   1
8:		         0   0   0   |    4   0   1
16:		         0   0   0   |   22   0   1
32:		         0   0   0   |   32   0   1
64:		         0   0   0   |    8   0   1
128:		         0   0   0   |    6   0   1
256:		         0   0   0   |   21   0   1
512:		         0   0   0   |   33   0   1
1K:		         0   0   0   |    6   0   1
2K:		         0   0   0   |    8   0   1
4K:		     16384  99 100   | 16390  98 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs        % cum %
0:		     16385 100 100   | 16680  99  99
1:		         0   0 100   |   28   0 100

                           read      |     write
discontiguous blocks   rpcs  % cum % |  rpcs        % cum %
0:		     16333  99  99   | 16637  99  99
1:		        52   0 100   |   57   0  99
2:		         0   0 100   |   14   0 100

                           read      |     write
disk fragmented I/Os   ios   % cum % |  ios         % cum %
1:		         1   0   0   |  271   1   1
2:		         0   0   0   |   33   0   1
3:		         0   0   0   |    2   0   1
4:		         0   0   0   |    4   0   1
5:		         0   0   0   |    1   0   1
6:		         0   0   0   |    2   0   1
7:		         0   0   0   |    1   0   1
8:		         0   0   0   |    3   0   1
9:		         0   0   0   |    5   0   1
10:		         0   0   0   |    8   0   1
11:		         0   0   0   |    9   0   2
12:		         0   0   0   |    4   0   2
13:		         0   0   0   |    7   0   2
14:		         0   0   0   |    9   0   2
15:		         0   0   0   |   13   0   2
16:		     16384  99 100   | 16336  97 100

                           read      |     write
disk I/Os in flight    ios   % cum % |  ios         % cum %
1:		     16353   6   6   | 16506   6   6
2:		     16352   6  12   | 16478   6  12
3:		     16357   6  18   | 16439   6  18
4:		     16357   6  24   | 16401   6  25
5:		     16359   6  31   | 16383   6  31
6:		     16357   6  37   | 16382   6  37
7:		     16361   6  43   | 16381   6  43
8:		     16364   6  49   | 16381   6  50
9:		     16363   6  56   | 16384   6  56
10:		     16369   6  62   | 16382   6  62
11:		     16371   6  68   | 16377   6  68
12:		     16368   6  74   | 16369   6  75
13:		     16370   6  81   | 16370   6  81
14:		     16368   6  87   | 16367   6  87
15:		     16368   6  93   | 16357   6  93
16:		     16371   6  99   | 16349   6  99
17:		        28   0  99   |   21   0  99
18:		        32   0  99   |   20   0  99
19:		        31   0  99   |   19   0  99
20:		        30   0  99   |   15   0  99
21:		        32   0  99   |   15   0  99
22:		        27   0  99   |   14   0  99
23:		        26   0  99   |   13   0  99
24:		        22   0  99   |   10   0  99
25:		        20   0  99   |    8   0  99
26:		        10   0  99   |    6   0  99
27:		        10   0  99   |    6   0  99
28:		         9   0  99   |    6   0  99
29:		        12   0  99   |    4   0  99
30:		         8   0  99   |    2   0  99
31:		        40   0 100   |    2   0 100

                           read      |     write
I/O time (1/1000s)     ios   % cum % |  ios         % cum %
1:		        32   0   0   |  323   1   1
2:		     16325  99  99   | 16278  97  99
4:		        21   0  99   |   95   0  99
8:		         7   0 100   |   10   0  99
16:		         0   0 100   |    2   0 100

                           read      |     write
disk I/O size          ios   % cum % |  ios         % cum %
4K:		         1   0   0   |   77   0   0
8K:		         0   0   0   |   80   0   0
16K:		         0   0   0   |   22   0   0
32K:		         0   0   0   |    4   0   0
64K:		         0   0   0   |   28   0   0
128K:		         0   0   0   |   43   0   0
256K:		         0   0   0   |   29   0   0
512K:		         0   0   0   |   49   0   0
1M:		    262144  99 100   | 262135  99 100

The average write latency on llite layer in lustre-2.14.0-RC2 was significant higher than after reverted patch (lustre-2.14.0-RC3).
6.6 usec (447558029/67108864) in lustre-2.14.0-RC2 vs 4.3 usec (291816283/67108864) in lustre-2.14.0-RC3.
The other thing, in this case, 16M RPC was allowed, but I didn't see much see 16M RPCs in lustre-2.14.0-RC2 and it was mostly up to 8MB. After reverted patch (lustre-2.14.0-RC3), RPC size went up 16MB and send it to server.

Let me test Neil's new patch and udpate results on LU-9679.

Generated at Sat Feb 10 03:09:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.