Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14424

write performance regression in Lustre-2.14.0-RC1

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0
    • Lustre 2.14.0
    • None
    • 2
    • 9223372036854775807

    Description

      While I was runnng the performance regression tests, a performance regression in Lustre-2.14.0-RC1 was found on the write with 4K transfer size.
      Here is a reproducer and test results.

      # mpirun -np $NP -mca btl_openib_if_include mlx5_1:1 -x UCX_NET_DEVICES=mlx5_1:1 --allow-run-as-root /work/tools/bin/ior -w -r -t 4k -b $((256/NP)) -e -v -F -o /ai400x/ior.out/file
      
      Client Version NP Write(MB/s) Read(MB/s)
      Lustre-2.13.0 1  803  3293
      Lustre-2.14.0-RC1 1  529  3092
      Lustre-2.13.0 16  6962  12021
      Lustre-2.14.0-RC1 16  5127  11951

      Attachments

        Issue Links

          Activity

            [LU-14424] write performance regression in Lustre-2.14.0-RC1

            Lustre-2.14.0-RC2

            [root@ec01 ~]# lctl get_param llite.*.stats osc.*OST0001*.stats
            llite.ai400x-ffff9f61e8aac000.stats=
            snapshot_time             1613969894.227596637 secs.nsecs
            read_bytes                67108864 samples [bytes] 4096 4096 274877906944 1125899906842624
            write_bytes               67108864 samples [bytes] 4096 4096 274877906944 1125899906842624
            read                      67108864 samples [usec] 0 19225 6728336 507600924
            write                     67108864 samples [usec] 3 36876 447558029 554690588441
            open                      2 samples [usec] 2 152 154 23108
            close                     2 samples [usec] 157 179 336 56690
            seek                      134217728 samples [usec] 0 178 95834 1085920
            fsync                     1 samples [usec] 6210 6210 6210 38564100
            getattr                   3 samples [usec] 18 385 568 175774
            unlink                    2 samples [usec] 301 310 611 186701
            mknod                     1 samples [usec] 309 309 309 95481
            statfs                    1 samples [usec] 111 111 111 12321
            inode_permission          26 samples [usec] 0 257 278 66138
            
            osc.ai400x-OST0001-osc-ffff9f61e8aac000.stats=
            snapshot_time             1613969894.227650896 secs.nsecs
            req_waittime              73654 samples [usec] 127 50969 207491668 745114450474
            req_active                73654 samples [reqs] 1 17 89410 125650
            ldlm_extent_enqueue       1 samples [reqs] 1 1 1 1
            read_bytes                16385 samples [bytes] 4096 16777216 274877906944 4611685881021988864
            write_bytes               57262 samples [bytes] 4096 16777216 274877906944 1330662844261203968
            ost_read                  16385 samples [usec] 218 50969 83226452 426080555072
            ost_write                 57262 samples [usec] 176 35894 124263923 319033596483
            ost_connect               2 samples [usec] 130 211 341 61421
            ost_disconnect            2 samples [usec] 136 137 273 37265
            ost_sync                  1 samples [usec] 402 402 402 161604
            ldlm_cancel               1 samples [usec] 150 150 150 22500
            
            [root@es400nvx1-vm1 ~]# lctl get_param osd-ldiskfs.*OST0001*.brw_stats
            osd-ldiskfs.ai400x-OST0001.brw_stats=
            snapshot_time:         1613969983.077008582 (secs.nsecs)
            
                                       read      |     write
            pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
            1:		         1   0   0   |   41   0   0
            2:		         0   0   0   |   51   0   0
            4:		         0   0   0   |   24   0   0
            8:		         0   0   0   |    8   0   0
            16:		         0   0   0   |    7   0   0
            32:		         0   0   0   |    4   0   0
            64:		         0   0   0   |    8   0   0
            128:		         0   0   0   |    5   0   0
            256:		         0   0   0   |    2   0   0
            512:		         0   0   0   |    6   0   0
            1K:		         0   0   0   |    7   0   0
            2K:		         0   0   0   | 57025  99  99
            4K:		     16384  99 100   |   75   0 100
            
                                       read      |     write
            discontiguous pages    rpcs  % cum % |  rpcs        % cum %
            0:		     16385 100 100   | 57255  99  99
            1:		         0   0 100   |    8   0 100
            
                                       read      |     write
            discontiguous blocks   rpcs  % cum % |  rpcs        % cum %
            0:		     16371  99  99   | 57225  99  99
            1:		        14   0 100   |   38   0 100
            
                                       read      |     write
            disk fragmented I/Os   ios   % cum % |  ios         % cum %
            1:		         1   0   0   |  150   0   0
            2:		         0   0   0   |    6   0   0
            3:		         0   0   0   |    6   0   0
            4:		         0   0   0   |    1   0   0
            5:		         0   0   0   | 57006  99  99
            6:		         0   0   0   |   16   0  99
            7:		         0   0   0   |    0   0  99
            8:		         0   0   0   |    2   0  99
            9:		         0   0   0   |    3   0  99
            10:		         0   0   0   |   16   0  99
            11:		         0   0   0   |    2   0  99
            12:		         0   0   0   |    2   0  99
            13:		         0   0   0   |    3   0  99
            14:		         0   0   0   |    3   0  99
            15:		         0   0   0   |    2   0  99
            16:		     16384  99 100   |   44   0  99
            17:		         0   0 100   |    1   0 100
            
                                       read      |     write
            disk I/Os in flight    ios   % cum % |  ios         % cum %
            1:		     16336   6   6   | 57148  19  19
            2:		     16336   6  12   | 57173  19  39
            3:		     16336   6  18   | 57131  19  59
            4:		     16336   6  24   | 57103  19  79
            5:		     16336   6  31   | 57096  19  99
            6:		     16339   6  37   |   91   0  99
            7:		     16338   6  43   |   75   0  99
            8:		     16346   6  49   |   75   0  99
            9:		     16355   6  56   |   73   0  99
            10:		     16361   6  62   |   72   0  99
            11:		     16368   6  68   |   56   0  99
            12:		     16364   6  74   |   56   0  99
            13:		     16373   6  81   |   54   0  99
            14:		     16370   6  87   |   51   0  99
            15:		     16371   6  93   |   48   0  99
            16:		     16367   6  99   |   46   0  99
            17:		        54   0  99   |    6   0  99
            18:		        54   0  99   |    5   0  99
            19:		        50   0  99   |    5   0  99
            20:		        43   0  99   |    4   0  99
            21:		        39   0  99   |    4   0  99
            22:		        42   0  99   |    3   0  99
            23:		        38   0  99   |    3   0  99
            24:		        30   0  99   |    3   0  99
            25:		        19   0  99   |    2   0  99
            26:		        16   0  99   |    2   0  99
            27:		        16   0  99   |    1   0  99
            28:		        16   0  99   |    1   0  99
            29:		        16   0  99   |    1   0  99
            30:		        16   0  99   |    1   0  99
            31:		        64   0 100   |    2   0 100
            
                                       read      |     write
            I/O time (1/1000s)     ios   % cum % |  ios         % cum %
            1:		        96   0   0   | 57179  99  99
            2:		     16255  99  99   |   63   0  99
            4:		        25   0  99   |   11   0  99
            8:		         9   0 100   |    6   0  99
            16:		         0   0 100   |    4   0 100
            
                                       read      |     write
            disk I/O size          ios   % cum % |  ios         % cum %
            4K:		         1   0   0   |   41   0   0
            8K:		         0   0   0   |   51   0   0
            16K:		         0   0   0   |   26   0   0
            32K:		         0   0   0   |   10   0   0
            64K:		         0   0   0   |   10   0   0
            128K:		         0   0   0   |    7   0   0
            256K:		         0   0   0   |   30   0   0
            512K:		         0   0   0   |   34   0   0
            1M:		    262144  99 100   | 286182  99 100
            

            Lustre-2.14.0-RC3 (after reverted patch)

            [root@ec01 llite]# lctl get_param llite.*.stats osc.*OST0001*.stats
            llite.ai400x-ffff9f67d7c2b800.stats=
            snapshot_time             1613971351.373233917 secs.nsecs
            read_bytes                67108864 samples [bytes] 4096 4096 274877906944 1125899906842624
            write_bytes               67108864 samples [bytes] 4096 4096 274877906944 1125899906842624
            read                      67108864 samples [usec] 0 17888 7516054 452029440
            write                     67108864 samples [usec] 3 22195 291816283 14370801533
            ioctl                     1 samples [reqs]
            open                      4 samples [usec] 2 182 440 66138
            close                     4 samples [usec] 78 260 650 122374
            seek                      134217728 samples [usec] 0 53 130227 1512001
            fsync                     1 samples [usec] 17794 17794 17794 316626436
            readdir                   8 samples [usec] 0 293 307 85925
            getattr                   30 samples [usec] 0 1565 2334 2653018
            unlink                    1 samples [usec] 366 366 366 133956
            mknod                     1 samples [usec] 363 363 363 131769
            statfs                    1 samples [usec] 106 106 106 11236
            inode_permission          42 samples [usec] 0 28 72 938
            osc.ai400x-OST0001-osc-ffff9f67d7c2b800.stats=
            snapshot_time             1613971351.373308591 secs.nsecs
            req_waittime              33094 samples [usec] 98 61857 168002250 975823530146
            req_active                33094 samples [reqs] 1 18 46781 85103
            ldlm_extent_enqueue       1 samples [reqs] 1 1 1 1
            read_bytes                16385 samples [bytes] 4096 16777216 274877906944 4611685881021988864
            write_bytes               16704 samples [bytes] 4096 16777216 274877906944 4607087317889318912
            ost_read                  16385 samples [usec] 157 61857 82488488 433515710132
            ost_write                 16704 samples [usec] 188 60092 85512821 542307585031
            ost_connect               1 samples [usec] 115 115 115 13225
            ost_disconnect            1 samples [usec] 148 148 148 21904
            ost_sync                  1 samples [usec] 395 395 395 156025
            ldlm_cancel               1 samples [usec] 185 185 185 34225
            
            [root@es400nvx1-vm1 ~]# lctl get_param osd-ldiskfs.*OST0001*.brw_stats
            osd-ldiskfs.ai400x-OST0001.brw_stats=
            snapshot_time:         1613971438.900406702 (secs.nsecs)
            
                                       read      |     write
            pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
            1:		         1   0   0   |   76   0   0
            2:		         0   0   0   |   80   0   0
            4:		         0   0   0   |   22   0   1
            8:		         0   0   0   |    4   0   1
            16:		         0   0   0   |   22   0   1
            32:		         0   0   0   |   32   0   1
            64:		         0   0   0   |    8   0   1
            128:		         0   0   0   |    6   0   1
            256:		         0   0   0   |   21   0   1
            512:		         0   0   0   |   33   0   1
            1K:		         0   0   0   |    6   0   1
            2K:		         0   0   0   |    8   0   1
            4K:		     16384  99 100   | 16390  98 100
            
                                       read      |     write
            discontiguous pages    rpcs  % cum % |  rpcs        % cum %
            0:		     16385 100 100   | 16680  99  99
            1:		         0   0 100   |   28   0 100
            
                                       read      |     write
            discontiguous blocks   rpcs  % cum % |  rpcs        % cum %
            0:		     16333  99  99   | 16637  99  99
            1:		        52   0 100   |   57   0  99
            2:		         0   0 100   |   14   0 100
            
                                       read      |     write
            disk fragmented I/Os   ios   % cum % |  ios         % cum %
            1:		         1   0   0   |  271   1   1
            2:		         0   0   0   |   33   0   1
            3:		         0   0   0   |    2   0   1
            4:		         0   0   0   |    4   0   1
            5:		         0   0   0   |    1   0   1
            6:		         0   0   0   |    2   0   1
            7:		         0   0   0   |    1   0   1
            8:		         0   0   0   |    3   0   1
            9:		         0   0   0   |    5   0   1
            10:		         0   0   0   |    8   0   1
            11:		         0   0   0   |    9   0   2
            12:		         0   0   0   |    4   0   2
            13:		         0   0   0   |    7   0   2
            14:		         0   0   0   |    9   0   2
            15:		         0   0   0   |   13   0   2
            16:		     16384  99 100   | 16336  97 100
            
                                       read      |     write
            disk I/Os in flight    ios   % cum % |  ios         % cum %
            1:		     16353   6   6   | 16506   6   6
            2:		     16352   6  12   | 16478   6  12
            3:		     16357   6  18   | 16439   6  18
            4:		     16357   6  24   | 16401   6  25
            5:		     16359   6  31   | 16383   6  31
            6:		     16357   6  37   | 16382   6  37
            7:		     16361   6  43   | 16381   6  43
            8:		     16364   6  49   | 16381   6  50
            9:		     16363   6  56   | 16384   6  56
            10:		     16369   6  62   | 16382   6  62
            11:		     16371   6  68   | 16377   6  68
            12:		     16368   6  74   | 16369   6  75
            13:		     16370   6  81   | 16370   6  81
            14:		     16368   6  87   | 16367   6  87
            15:		     16368   6  93   | 16357   6  93
            16:		     16371   6  99   | 16349   6  99
            17:		        28   0  99   |   21   0  99
            18:		        32   0  99   |   20   0  99
            19:		        31   0  99   |   19   0  99
            20:		        30   0  99   |   15   0  99
            21:		        32   0  99   |   15   0  99
            22:		        27   0  99   |   14   0  99
            23:		        26   0  99   |   13   0  99
            24:		        22   0  99   |   10   0  99
            25:		        20   0  99   |    8   0  99
            26:		        10   0  99   |    6   0  99
            27:		        10   0  99   |    6   0  99
            28:		         9   0  99   |    6   0  99
            29:		        12   0  99   |    4   0  99
            30:		         8   0  99   |    2   0  99
            31:		        40   0 100   |    2   0 100
            
                                       read      |     write
            I/O time (1/1000s)     ios   % cum % |  ios         % cum %
            1:		        32   0   0   |  323   1   1
            2:		     16325  99  99   | 16278  97  99
            4:		        21   0  99   |   95   0  99
            8:		         7   0 100   |   10   0  99
            16:		         0   0 100   |    2   0 100
            
                                       read      |     write
            disk I/O size          ios   % cum % |  ios         % cum %
            4K:		         1   0   0   |   77   0   0
            8K:		         0   0   0   |   80   0   0
            16K:		         0   0   0   |   22   0   0
            32K:		         0   0   0   |    4   0   0
            64K:		         0   0   0   |   28   0   0
            128K:		         0   0   0   |   43   0   0
            256K:		         0   0   0   |   29   0   0
            512K:		         0   0   0   |   49   0   0
            1M:		    262144  99 100   | 262135  99 100
            

            The average write latency on llite layer in lustre-2.14.0-RC2 was significant higher than after reverted patch (lustre-2.14.0-RC3).
            6.6 usec (447558029/67108864) in lustre-2.14.0-RC2 vs 4.3 usec (291816283/67108864) in lustre-2.14.0-RC3.
            The other thing, in this case, 16M RPC was allowed, but I didn't see much see 16M RPCs in lustre-2.14.0-RC2 and it was mostly up to 8MB. After reverted patch (lustre-2.14.0-RC3), RPC size went up 16MB and send it to server.

            Let me test Neil's new patch and udpate results on LU-9679.

            sihara Shuichi Ihara added a comment - Lustre-2.14.0-RC2 [root@ec01 ~]# lctl get_param llite.*.stats osc.*OST0001*.stats llite.ai400x-ffff9f61e8aac000.stats= snapshot_time 1613969894.227596637 secs.nsecs read_bytes 67108864 samples [bytes] 4096 4096 274877906944 1125899906842624 write_bytes 67108864 samples [bytes] 4096 4096 274877906944 1125899906842624 read 67108864 samples [usec] 0 19225 6728336 507600924 write 67108864 samples [usec] 3 36876 447558029 554690588441 open 2 samples [usec] 2 152 154 23108 close 2 samples [usec] 157 179 336 56690 seek 134217728 samples [usec] 0 178 95834 1085920 fsync 1 samples [usec] 6210 6210 6210 38564100 getattr 3 samples [usec] 18 385 568 175774 unlink 2 samples [usec] 301 310 611 186701 mknod 1 samples [usec] 309 309 309 95481 statfs 1 samples [usec] 111 111 111 12321 inode_permission 26 samples [usec] 0 257 278 66138 osc.ai400x-OST0001-osc-ffff9f61e8aac000.stats= snapshot_time 1613969894.227650896 secs.nsecs req_waittime 73654 samples [usec] 127 50969 207491668 745114450474 req_active 73654 samples [reqs] 1 17 89410 125650 ldlm_extent_enqueue 1 samples [reqs] 1 1 1 1 read_bytes 16385 samples [bytes] 4096 16777216 274877906944 4611685881021988864 write_bytes 57262 samples [bytes] 4096 16777216 274877906944 1330662844261203968 ost_read 16385 samples [usec] 218 50969 83226452 426080555072 ost_write 57262 samples [usec] 176 35894 124263923 319033596483 ost_connect 2 samples [usec] 130 211 341 61421 ost_disconnect 2 samples [usec] 136 137 273 37265 ost_sync 1 samples [usec] 402 402 402 161604 ldlm_cancel 1 samples [usec] 150 150 150 22500 [root@es400nvx1-vm1 ~]# lctl get_param osd-ldiskfs.*OST0001*.brw_stats osd-ldiskfs.ai400x-OST0001.brw_stats= snapshot_time: 1613969983.077008582 (secs.nsecs) read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 1: 1 0 0 | 41 0 0 2: 0 0 0 | 51 0 0 4: 0 0 0 | 24 0 0 8: 0 0 0 | 8 0 0 16: 0 0 0 | 7 0 0 32: 0 0 0 | 4 0 0 64: 0 0 0 | 8 0 0 128: 0 0 0 | 5 0 0 256: 0 0 0 | 2 0 0 512: 0 0 0 | 6 0 0 1K: 0 0 0 | 7 0 0 2K: 0 0 0 | 57025 99 99 4K: 16384 99 100 | 75 0 100 read | write discontiguous pages rpcs % cum % | rpcs % cum % 0: 16385 100 100 | 57255 99 99 1: 0 0 100 | 8 0 100 read | write discontiguous blocks rpcs % cum % | rpcs % cum % 0: 16371 99 99 | 57225 99 99 1: 14 0 100 | 38 0 100 read | write disk fragmented I/Os ios % cum % | ios % cum % 1: 1 0 0 | 150 0 0 2: 0 0 0 | 6 0 0 3: 0 0 0 | 6 0 0 4: 0 0 0 | 1 0 0 5: 0 0 0 | 57006 99 99 6: 0 0 0 | 16 0 99 7: 0 0 0 | 0 0 99 8: 0 0 0 | 2 0 99 9: 0 0 0 | 3 0 99 10: 0 0 0 | 16 0 99 11: 0 0 0 | 2 0 99 12: 0 0 0 | 2 0 99 13: 0 0 0 | 3 0 99 14: 0 0 0 | 3 0 99 15: 0 0 0 | 2 0 99 16: 16384 99 100 | 44 0 99 17: 0 0 100 | 1 0 100 read | write disk I/Os in flight ios % cum % | ios % cum % 1: 16336 6 6 | 57148 19 19 2: 16336 6 12 | 57173 19 39 3: 16336 6 18 | 57131 19 59 4: 16336 6 24 | 57103 19 79 5: 16336 6 31 | 57096 19 99 6: 16339 6 37 | 91 0 99 7: 16338 6 43 | 75 0 99 8: 16346 6 49 | 75 0 99 9: 16355 6 56 | 73 0 99 10: 16361 6 62 | 72 0 99 11: 16368 6 68 | 56 0 99 12: 16364 6 74 | 56 0 99 13: 16373 6 81 | 54 0 99 14: 16370 6 87 | 51 0 99 15: 16371 6 93 | 48 0 99 16: 16367 6 99 | 46 0 99 17: 54 0 99 | 6 0 99 18: 54 0 99 | 5 0 99 19: 50 0 99 | 5 0 99 20: 43 0 99 | 4 0 99 21: 39 0 99 | 4 0 99 22: 42 0 99 | 3 0 99 23: 38 0 99 | 3 0 99 24: 30 0 99 | 3 0 99 25: 19 0 99 | 2 0 99 26: 16 0 99 | 2 0 99 27: 16 0 99 | 1 0 99 28: 16 0 99 | 1 0 99 29: 16 0 99 | 1 0 99 30: 16 0 99 | 1 0 99 31: 64 0 100 | 2 0 100 read | write I/O time (1/1000s) ios % cum % | ios % cum % 1: 96 0 0 | 57179 99 99 2: 16255 99 99 | 63 0 99 4: 25 0 99 | 11 0 99 8: 9 0 100 | 6 0 99 16: 0 0 100 | 4 0 100 read | write disk I/O size ios % cum % | ios % cum % 4K: 1 0 0 | 41 0 0 8K: 0 0 0 | 51 0 0 16K: 0 0 0 | 26 0 0 32K: 0 0 0 | 10 0 0 64K: 0 0 0 | 10 0 0 128K: 0 0 0 | 7 0 0 256K: 0 0 0 | 30 0 0 512K: 0 0 0 | 34 0 0 1M: 262144 99 100 | 286182 99 100 Lustre-2.14.0-RC3 (after reverted patch) [root@ec01 llite]# lctl get_param llite.*.stats osc.*OST0001*.stats llite.ai400x-ffff9f67d7c2b800.stats= snapshot_time 1613971351.373233917 secs.nsecs read_bytes 67108864 samples [bytes] 4096 4096 274877906944 1125899906842624 write_bytes 67108864 samples [bytes] 4096 4096 274877906944 1125899906842624 read 67108864 samples [usec] 0 17888 7516054 452029440 write 67108864 samples [usec] 3 22195 291816283 14370801533 ioctl 1 samples [reqs] open 4 samples [usec] 2 182 440 66138 close 4 samples [usec] 78 260 650 122374 seek 134217728 samples [usec] 0 53 130227 1512001 fsync 1 samples [usec] 17794 17794 17794 316626436 readdir 8 samples [usec] 0 293 307 85925 getattr 30 samples [usec] 0 1565 2334 2653018 unlink 1 samples [usec] 366 366 366 133956 mknod 1 samples [usec] 363 363 363 131769 statfs 1 samples [usec] 106 106 106 11236 inode_permission 42 samples [usec] 0 28 72 938 osc.ai400x-OST0001-osc-ffff9f67d7c2b800.stats= snapshot_time 1613971351.373308591 secs.nsecs req_waittime 33094 samples [usec] 98 61857 168002250 975823530146 req_active 33094 samples [reqs] 1 18 46781 85103 ldlm_extent_enqueue 1 samples [reqs] 1 1 1 1 read_bytes 16385 samples [bytes] 4096 16777216 274877906944 4611685881021988864 write_bytes 16704 samples [bytes] 4096 16777216 274877906944 4607087317889318912 ost_read 16385 samples [usec] 157 61857 82488488 433515710132 ost_write 16704 samples [usec] 188 60092 85512821 542307585031 ost_connect 1 samples [usec] 115 115 115 13225 ost_disconnect 1 samples [usec] 148 148 148 21904 ost_sync 1 samples [usec] 395 395 395 156025 ldlm_cancel 1 samples [usec] 185 185 185 34225 [root@es400nvx1-vm1 ~]# lctl get_param osd-ldiskfs.*OST0001*.brw_stats osd-ldiskfs.ai400x-OST0001.brw_stats= snapshot_time: 1613971438.900406702 (secs.nsecs) read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 1: 1 0 0 | 76 0 0 2: 0 0 0 | 80 0 0 4: 0 0 0 | 22 0 1 8: 0 0 0 | 4 0 1 16: 0 0 0 | 22 0 1 32: 0 0 0 | 32 0 1 64: 0 0 0 | 8 0 1 128: 0 0 0 | 6 0 1 256: 0 0 0 | 21 0 1 512: 0 0 0 | 33 0 1 1K: 0 0 0 | 6 0 1 2K: 0 0 0 | 8 0 1 4K: 16384 99 100 | 16390 98 100 read | write discontiguous pages rpcs % cum % | rpcs % cum % 0: 16385 100 100 | 16680 99 99 1: 0 0 100 | 28 0 100 read | write discontiguous blocks rpcs % cum % | rpcs % cum % 0: 16333 99 99 | 16637 99 99 1: 52 0 100 | 57 0 99 2: 0 0 100 | 14 0 100 read | write disk fragmented I/Os ios % cum % | ios % cum % 1: 1 0 0 | 271 1 1 2: 0 0 0 | 33 0 1 3: 0 0 0 | 2 0 1 4: 0 0 0 | 4 0 1 5: 0 0 0 | 1 0 1 6: 0 0 0 | 2 0 1 7: 0 0 0 | 1 0 1 8: 0 0 0 | 3 0 1 9: 0 0 0 | 5 0 1 10: 0 0 0 | 8 0 1 11: 0 0 0 | 9 0 2 12: 0 0 0 | 4 0 2 13: 0 0 0 | 7 0 2 14: 0 0 0 | 9 0 2 15: 0 0 0 | 13 0 2 16: 16384 99 100 | 16336 97 100 read | write disk I/Os in flight ios % cum % | ios % cum % 1: 16353 6 6 | 16506 6 6 2: 16352 6 12 | 16478 6 12 3: 16357 6 18 | 16439 6 18 4: 16357 6 24 | 16401 6 25 5: 16359 6 31 | 16383 6 31 6: 16357 6 37 | 16382 6 37 7: 16361 6 43 | 16381 6 43 8: 16364 6 49 | 16381 6 50 9: 16363 6 56 | 16384 6 56 10: 16369 6 62 | 16382 6 62 11: 16371 6 68 | 16377 6 68 12: 16368 6 74 | 16369 6 75 13: 16370 6 81 | 16370 6 81 14: 16368 6 87 | 16367 6 87 15: 16368 6 93 | 16357 6 93 16: 16371 6 99 | 16349 6 99 17: 28 0 99 | 21 0 99 18: 32 0 99 | 20 0 99 19: 31 0 99 | 19 0 99 20: 30 0 99 | 15 0 99 21: 32 0 99 | 15 0 99 22: 27 0 99 | 14 0 99 23: 26 0 99 | 13 0 99 24: 22 0 99 | 10 0 99 25: 20 0 99 | 8 0 99 26: 10 0 99 | 6 0 99 27: 10 0 99 | 6 0 99 28: 9 0 99 | 6 0 99 29: 12 0 99 | 4 0 99 30: 8 0 99 | 2 0 99 31: 40 0 100 | 2 0 100 read | write I/O time (1/1000s) ios % cum % | ios % cum % 1: 32 0 0 | 323 1 1 2: 16325 99 99 | 16278 97 99 4: 21 0 99 | 95 0 99 8: 7 0 100 | 10 0 99 16: 0 0 100 | 2 0 100 read | write disk I/O size ios % cum % | ios % cum % 4K: 1 0 0 | 77 0 0 8K: 0 0 0 | 80 0 0 16K: 0 0 0 | 22 0 0 32K: 0 0 0 | 4 0 0 64K: 0 0 0 | 28 0 0 128K: 0 0 0 | 43 0 0 256K: 0 0 0 | 29 0 0 512K: 0 0 0 | 49 0 0 1M: 262144 99 100 | 262135 99 100 The average write latency on llite layer in lustre-2.14.0-RC2 was significant higher than after reverted patch (lustre-2.14.0-RC3). 6.6 usec (447558029/67108864) in lustre-2.14.0-RC2 vs 4.3 usec (291816283/67108864) in lustre-2.14.0-RC3. The other thing, in this case, 16M RPC was allowed, but I didn't see much see 16M RPCs in lustre-2.14.0-RC2 and it was mostly up to 8MB. After reverted patch (lustre-2.14.0-RC3), RPC size went up 16MB and send it to server. Let me test Neil's new patch and udpate results on LU-9679 .
            neilb Neil Brown added a comment -

            I believe the problem was that cur->oe_state wasn't being initialised properly.

            I've uploaded a revised version of the reverted patch at https://review.whamcloud.com/41691

             

            neilb Neil Brown added a comment - I believe the problem was that cur->oe_state wasn't being initialised properly. I've uploaded a revised version of the reverted patch at https://review.whamcloud.com/41691  
            pjones Peter Jones added a comment -

            Landed for 2.14 RC3

            pjones Peter Jones added a comment - Landed for 2.14 RC3

            While a revert works around this issue as Andreas pointed out this is hot code so more improvements could be done here.

            simmonsja James A Simmons added a comment - While a revert works around this issue as Andreas pointed out this is hot code so more improvements could be done here.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41498/
            Subject: LU-14424 Revert "LU-9679 osc: simplify osc_extent_find()"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b592f75446ccd8fea790de4156478fd057c82019

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41498/ Subject: LU-14424 Revert " LU-9679 osc: simplify osc_extent_find()" Project: fs/lustre-release Branch: master Current Patch Set: Commit: b592f75446ccd8fea790de4156478fd057c82019

            Some added notes here from my investigation:

            • the test workload is IOR file per process with 4KB buffered writes, and it looks like a single client with 1 and 16 threads
            • at 800MiB/s that is 200k x 4KB writes/s, so osc_extent_merge() function is called between 200k/s and 400k/s (depending on whether it can do both front and back merges, so any extra overhead in this path will affect the bottom line
            • it isn't yet totally clear if the problem is just more CPU overhead in osc_extent_merge() or if there is a subtle bug that is preventing extents to be merged and sending more RPCs over the wire? Plain CPU overhead would be hard to explain a 250MB/s slowdown, but this is a hot path, and CPU overhead is a major contributor to the IOPS number. Bad RPC formation would be an easier explanation for this performance delta.
            • going from in-line coding of the checks to calling osc_extent_merge() added a number of extra calls, which makes me wonder if we can improve performance by reducing more overhead in this call path

            In terms of debugging the source of the slowdown there are a couple things to check:

            • look at osd-ldiskfs.*.brw_stats and/or osc.*.rpc_stats to see if it is badly formed RPCs
            • look at the flame graph before/after to see where the extra CPU overhead is coming from
            adilger Andreas Dilger added a comment - Some added notes here from my investigation: the test workload is IOR file per process with 4KB buffered writes, and it looks like a single client with 1 and 16 threads at 800MiB/s that is 200k x 4KB writes/s, so osc_extent_merge() function is called between 200k/s and 400k/s (depending on whether it can do both front and back merges, so any extra overhead in this path will affect the bottom line it isn't yet totally clear if the problem is just more CPU overhead in osc_extent_merge() or if there is a subtle bug that is preventing extents to be merged and sending more RPCs over the wire? Plain CPU overhead would be hard to explain a 250MB/s slowdown, but this is a hot path, and CPU overhead is a major contributor to the IOPS number. Bad RPC formation would be an easier explanation for this performance delta. going from in-line coding of the checks to calling osc_extent_merge() added a number of extra calls, which makes me wonder if we can improve performance by reducing more overhead in this call path In terms of debugging the source of the slowdown there are a couple things to check: look at osd-ldiskfs.*.brw_stats and/or osc.*.rpc_stats to see if it is badly formed RPCs look at the flame graph before/after to see where the extra CPU overhead is coming from

            Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41498
            Subject: LU-14424 Revert "LU-9679 osc: simplify osc_extent_find()"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 94649b2121449fb80bf9d9b971db9be539db76b1

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41498 Subject: LU-14424 Revert " LU-9679 osc: simplify osc_extent_find()" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 94649b2121449fb80bf9d9b971db9be539db76b1

            Here is patch we started to see this perf regression in master branch.

            commit 80e21cce3dd6748fd760786cafe9c26d502fd74f
            Author: NeilBrown <neilb@suse.com>
            Date:   Thu Dec 13 11:32:56 2018 +1100
            
                LU-9679 osc: simplify osc_extent_find()
                
                osc_extent_find() contains some code with the same functionality as
                osc_extent_merge().  So replace that code with a call to
                osc_extent_merge().
                
                This requires that we set cur->oe_grants earlier, as
                osc_extent_merge() needs that.
                
                Also:
                
                 - fix a pre-existing bug - osc_extent_merge() should never try to
                   merge two extends with different ->oe_mppr as later alignment
                   checks can get confused.
                 - Remove a redundant list_del_init() which is already included in
                   __osc_extent_remove().
                
                Linux-Commit: 85ebb57ddc5b ("lustre: osc: simplify osc_extent_find()")
            

            Here is test results before/after commit 80e21cce3d

            80e21cce3d LU-9679 osc: simplify osc_extent_find()
            Max Write: 540.67 MiB/sec (566.93 MB/sec)
            
            9d914f9cc7 LU-13711 build: fix typo on SSL dependency for Ubuntu
            Max Write: 797.24 MiB/sec (835.96 MB/sec)
            
            sihara Shuichi Ihara added a comment - Here is patch we started to see this perf regression in master branch. commit 80e21cce3dd6748fd760786cafe9c26d502fd74f Author: NeilBrown <neilb@suse.com> Date: Thu Dec 13 11:32:56 2018 +1100 LU-9679 osc: simplify osc_extent_find() osc_extent_find() contains some code with the same functionality as osc_extent_merge(). So replace that code with a call to osc_extent_merge(). This requires that we set cur->oe_grants earlier, as osc_extent_merge() needs that. Also: - fix a pre-existing bug - osc_extent_merge() should never try to merge two extends with different ->oe_mppr as later alignment checks can get confused. - Remove a redundant list_del_init() which is already included in __osc_extent_remove(). Linux-Commit: 85ebb57ddc5b ("lustre: osc: simplify osc_extent_find()") Here is test results before/after commit 80e21cce3d 80e21cce3d LU-9679 osc: simplify osc_extent_find() Max Write: 540.67 MiB/sec (566.93 MB/sec) 9d914f9cc7 LU-13711 build: fix typo on SSL dependency for Ubuntu Max Write: 797.24 MiB/sec (835.96 MB/sec)

            I will narrow down and find what patch causes problem.

            sihara Shuichi Ihara added a comment - I will narrow down and find what patch causes problem.

            People

              sihara Shuichi Ihara
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: