Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 9223372036854775807

    Description

      Seeing debug messages on the console during large RPC writes:

      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1947:osd_trans_start()) work-OST0004: credits 24347 > trans_max 12800
      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1876:osd_trans_dump_creds())   create: 0/0/0, destroy: 0/0/0
      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1883:osd_trans_dump_creds())   attr_set: 1/1/0, xattr_set: 2/15/0
      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1893:osd_trans_dump_creds())   write: 2/24182/0, punch: 0/0/0, quota 5/149/0
      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1900:osd_trans_dump_creds())   insert: 0/0/0, delete: 0/0/0
      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1907:osd_trans_dump_creds())   ref_add: 0/0/0, ref_del: 0/0/0
      

      These are likely caused by patches from LU-14134 landing that changed the way transaction sizes are calculated (in theory reducing transaction size, but not always).

      We might need an interface to check how extent per bytes going with system filled up, and also debug message to check how many credits calculated in  osd_declare_write_commit()

      Attachments

        Issue Links

          Activity

            [LU-14641] per extents bytes allocation stats

            I'm wondering if I should try installing https://review.whamcloud.com/43232 "LU-14438 ldiskfs: improvements to mballoc" to see if that is improving the initial allocations or not, although this is already doing pretty good considering how little free space is available.

            adilger Andreas Dilger added a comment - I'm wondering if I should try installing https://review.whamcloud.com/43232 " LU-14438 ldiskfs: improvements to mballoc " to see if that is improving the initial allocations or not, although this is already doing pretty good considering how little free space is available.

            Even though that OST is almost 98% full, the free space is relatively good (88% of free space is in chunks 4MB or larger):

            e2freefrag /dev/vgmyth/lvmythost3 
            Device: /dev/vgmyth/lvmythost3
            Blocksize: 4096 bytes
            /dev/vgmyth/lvmythost3: Operation not supported while calling fsmap
            Total blocks: 1460371456
            Free blocks: 91182550 (6.2%)
            
            Min. free extent: 4 KB 
            Max. free extent: 3869388 KB
            Avg. free extent: 5676 KB
            Num. free extent: 63654
            
            HISTOGRAM OF FREE EXTENT SIZES:
            Extent Size Range :  Free extents   Free Blocks  Percent
                4K...    8K-  :            80            80    0.00%
                8K...   16K-  :           103           247    0.00%
               16K...   32K-  :           180           949    0.00%
               32K...   64K-  :           368          4098    0.00%
               64K...  128K-  :           667         15328    0.02%
              128K...  256K-  :          1177         54808    0.06%
              256K...  512K-  :          1850        175030    0.19%
              512K... 1024K-  :          3523        686778    0.75%
                1M...    2M-  :          7456       2293985    2.52%
                2M...    4M-  :          9451       6161759    6.76%
                4M...    8M-  :         19705      28308110   31.05%
                8M...   16M-  :         19017      43864690   48.11%
               16M...   32M-  :             5         26840    0.03%
               32M...   64M-  :            12        147600    0.16%
               64M...  128M-  :             9        229679    0.25%
              128M...  256M-  :            11        544954    0.60%
              256M...  512M-  :            23       2171087    2.38%
              512M... 1024M-  :             9       1600006    1.75%
                1G...    2G-  :             6       2256410    2.47%
                2G...    4G-  :             2       1799598    1.97%
            

            However, that makes me wonder whether the allocator is actually doing a bad job for large writes and using up small chunks of space? Checking some recently-written files, I see that the initial allocations are relatively small, but once the file is over 4 MB in size it uses good allocations, so that could be related to the way the file is being written by the client (slowly over an hour):

            filefrag -v /myth/tv/2063_20210509184200.ts
            Filesystem type is: bd00bd0
            File size of /myth/tv/2063_20210509184200.ts is 2535750768 (2476320 blocks of 1024 bytes)
             ext:     device_logical:        physical_offset: length:  dev: flags:
               0:        0..     351:  980128408.. 980128759:    352: 0003: net
               1:      352..     707:  980129392.. 980129747:    356: 0003: net
               2:      708..    2607:  980148224.. 980150123:   1900: 0003: net
               3:     2608..    2855:  980138756.. 980139003:    248: 0003: net
               4:     2856..    2915:  980129748.. 980129807:     60: 0003: net
               5:     2916..    3579:  980139004.. 980139667:    664: 0003: net
               6:     3580..  131071: 1320420716..1320548207: 127492: 0003: net
               7:   131072..  262143: 1320550400..1320681471: 131072: 0003: net
               8:   262144..  393215: 1320681472..1320812543: 131072: 0003: net
               9:   393216..  524287: 1320812544..1320943615: 131072: 0003: net
              10:   524288..  655359: 1320943616..1321074687: 131072: 0003: net
              11:   655360..  786431: 1321074688..1321205759: 131072: 0003: net
              12:   786432..  917503: 1321205760..1321336831: 131072: 0003: net
              13:   917504.. 1048575: 1321336832..1321467903: 131072: 0003: net
              14:  1048576.. 1179647: 1321467904..1321598975: 131072: 0003: net
              15:  1179648.. 1310719: 1321598976..1321730047: 131072: 0003: net
              16:  1310720.. 1441791: 1321730048..1321861119: 131072: 0003: net
              17:  1441792.. 1572863: 1321861120..1321992191: 131072: 0003: net
              18:  1572864.. 1703935: 1321992192..1322123263: 131072: 0003: net
              19:  1703936.. 1835007: 1322123264..1322254335: 131072: 0003: net
              20:  1835008.. 1966079: 1322254336..1322385407: 131072: 0003: net
              21:  1966080.. 2097151: 1322385408..1322516479: 131072: 0003: net
              22:  2097152.. 2228223: 1322516480..1322647551: 131072: 0003: net
              23:  2228224.. 2359295: 1322647552..1322778623: 131072: 0003: net
              24:  2359296.. 2476319: 1322778624..1322895647: 117024: 0003: last,net,eof
            /myth/tv/2063_20210509184200.ts: 8 extents found
            
            adilger Andreas Dilger added a comment - Even though that OST is almost 98% full, the free space is relatively good (88% of free space is in chunks 4MB or larger): e2freefrag /dev/vgmyth/lvmythost3 Device: /dev/vgmyth/lvmythost3 Blocksize: 4096 bytes /dev/vgmyth/lvmythost3: Operation not supported while calling fsmap Total blocks: 1460371456 Free blocks: 91182550 (6.2%) Min. free extent: 4 KB Max. free extent: 3869388 KB Avg. free extent: 5676 KB Num. free extent: 63654 HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range : Free extents Free Blocks Percent 4K... 8K- : 80 80 0.00% 8K... 16K- : 103 247 0.00% 16K... 32K- : 180 949 0.00% 32K... 64K- : 368 4098 0.00% 64K... 128K- : 667 15328 0.02% 128K... 256K- : 1177 54808 0.06% 256K... 512K- : 1850 175030 0.19% 512K... 1024K- : 3523 686778 0.75% 1M... 2M- : 7456 2293985 2.52% 2M... 4M- : 9451 6161759 6.76% 4M... 8M- : 19705 28308110 31.05% 8M... 16M- : 19017 43864690 48.11% 16M... 32M- : 5 26840 0.03% 32M... 64M- : 12 147600 0.16% 64M... 128M- : 9 229679 0.25% 128M... 256M- : 11 544954 0.60% 256M... 512M- : 23 2171087 2.38% 512M... 1024M- : 9 1600006 1.75% 1G... 2G- : 6 2256410 2.47% 2G... 4G- : 2 1799598 1.97% However, that makes me wonder whether the allocator is actually doing a bad job for large writes and using up small chunks of space? Checking some recently-written files, I see that the initial allocations are relatively small, but once the file is over 4 MB in size it uses good allocations, so that could be related to the way the file is being written by the client (slowly over an hour): filefrag -v /myth/tv/2063_20210509184200.ts Filesystem type is: bd00bd0 File size of /myth/tv/2063_20210509184200.ts is 2535750768 (2476320 blocks of 1024 bytes) ext: device_logical: physical_offset: length: dev: flags: 0: 0.. 351: 980128408.. 980128759: 352: 0003: net 1: 352.. 707: 980129392.. 980129747: 356: 0003: net 2: 708.. 2607: 980148224.. 980150123: 1900: 0003: net 3: 2608.. 2855: 980138756.. 980139003: 248: 0003: net 4: 2856.. 2915: 980129748.. 980129807: 60: 0003: net 5: 2916.. 3579: 980139004.. 980139667: 664: 0003: net 6: 3580.. 131071: 1320420716..1320548207: 127492: 0003: net 7: 131072.. 262143: 1320550400..1320681471: 131072: 0003: net 8: 262144.. 393215: 1320681472..1320812543: 131072: 0003: net 9: 393216.. 524287: 1320812544..1320943615: 131072: 0003: net 10: 524288.. 655359: 1320943616..1321074687: 131072: 0003: net 11: 655360.. 786431: 1321074688..1321205759: 131072: 0003: net 12: 786432.. 917503: 1321205760..1321336831: 131072: 0003: net 13: 917504.. 1048575: 1321336832..1321467903: 131072: 0003: net 14: 1048576.. 1179647: 1321467904..1321598975: 131072: 0003: net 15: 1179648.. 1310719: 1321598976..1321730047: 131072: 0003: net 16: 1310720.. 1441791: 1321730048..1321861119: 131072: 0003: net 17: 1441792.. 1572863: 1321861120..1321992191: 131072: 0003: net 18: 1572864.. 1703935: 1321992192..1322123263: 131072: 0003: net 19: 1703936.. 1835007: 1322123264..1322254335: 131072: 0003: net 20: 1835008.. 1966079: 1322254336..1322385407: 131072: 0003: net 21: 1966080.. 2097151: 1322385408..1322516479: 131072: 0003: net 22: 2097152.. 2228223: 1322516480..1322647551: 131072: 0003: net 23: 2228224.. 2359295: 1322647552..1322778623: 131072: 0003: net 24: 2359296.. 2476319: 1322778624..1322895647: 117024: 0003: last,net,eof /myth/tv/2063_20210509184200.ts: 8 extents found

            Ignoring MDT 4096, OST stats 247905 is most smaller one, which means about 5 extents for 1M write, which was probably ok considering your space usage is high 98%...

            wshilong Wang Shilong (Inactive) added a comment - Ignoring MDT 4096, OST stats 247905 is most smaller one, which means about 5 extents for 1M write, which was probably ok considering your space usage is high 98%...

            Shilong, I updated my server to include your patch but am no longer seeing the journal warnings since upgrading and remounting the OSTs. I do see some uneven stats, but this may still be an issue with the application:

            # lctl get_param osd-ldiskfs.myth-*.extent_bytes_allocation
            osd-ldiskfs.myth-MDT0000.extent_bytes_allocation=4096
            osd-ldiskfs.myth-OST0000.extent_bytes_allocation=921170
            osd-ldiskfs.myth-OST0001.extent_bytes_allocation=757631
            osd-ldiskfs.myth-OST0002.extent_bytes_allocation=1036399
            osd-ldiskfs.myth-OST0003.extent_bytes_allocation=247905
            osd-ldiskfs.myth-OST0004.extent_bytes_allocation=1048250
            
            adilger Andreas Dilger added a comment - Shilong, I updated my server to include your patch but am no longer seeing the journal warnings since upgrading and remounting the OSTs. I do see some uneven stats, but this may still be an issue with the application: # lctl get_param osd-ldiskfs.myth-*.extent_bytes_allocation osd-ldiskfs.myth-MDT0000.extent_bytes_allocation=4096 osd-ldiskfs.myth-OST0000.extent_bytes_allocation=921170 osd-ldiskfs.myth-OST0001.extent_bytes_allocation=757631 osd-ldiskfs.myth-OST0002.extent_bytes_allocation=1036399 osd-ldiskfs.myth-OST0003.extent_bytes_allocation=247905 osd-ldiskfs.myth-OST0004.extent_bytes_allocation=1048250

            James, I don't think these messages have anything to do with EC, it just relates to large RPCs, "smaller" journals, and partly-fragmented free space.

            adilger Andreas Dilger added a comment - James, I don't think these messages have anything to do with EC, it just relates to large RPCs, "smaller" journals, and partly-fragmented free space.

            This is the problem I'm seeing with the EC code.

            simmonsja James A Simmons added a comment - This is the problem I'm seeing with the EC code.

            adilger Do you have possibility to apply latest patch to see how it helps?

            wshilong Wang Shilong (Inactive) added a comment - adilger  Do you have possibility to apply latest patch to see how it helps?

            The fragmentation of the allocations can be seen fairly clearly in the osd-ldiskfs.myth-OST0001.brw_stats output:

            # dd if=/dev/zero of=/myth/tmp/ost1/f1 bs=4M count=100
            # lctl get_param osd-ldiskfs.myth-OST0001.brw_stats | less
            osd-ldiskfs.myth-OST0001.brw_stats=
                                       read      |     write
            pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
            1K:                      0   0   0   |  100  100 100
            
                                       read      |     write
            discontiguous pages    rpcs  % cum % |  rpcs        % cum %
            0:                       0   0   0   |   93  93  93
            1:                       0   0   0   |    7   7 100
            
                                       read      |     write
            discontiguous blocks   rpcs  % cum % |  rpcs        % cum %
            0:                       0   0   0   |   90  90  90
            1:                       0   0   0   |   10   10 100
            
                                       read      |     write
            disk fragmented I/Os   ios   % cum % |  ios         % cum %
            1:                       0   0   0   |    0   0   0
            2:                       0   0   0   |   90  90  90
            3:                       0   0   0   |   10   10 100
            

            So with 100x 4MB RPC size, 10% of them had to allocate blocks in 2 separate chunks and this generated a console warning:

            osd_handler.c:1938:osd_trans_start()) myth-OST0001: credits 1235 > trans_max 1024
            osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 1/1/0, xattr_set: 2/15/0
            osd_handler.c:1884:osd_trans_dump_creds())   write: 2/1076/0, punch: 0/0/0, quota 5/149/0
            

            I will try to run a test with the patch tomorrow.

            adilger Andreas Dilger added a comment - The fragmentation of the allocations can be seen fairly clearly in the osd-ldiskfs.myth-OST0001.brw_stats output: # dd if=/dev/zero of=/myth/tmp/ost1/f1 bs=4M count=100 # lctl get_param osd-ldiskfs.myth-OST0001.brw_stats | less osd-ldiskfs.myth-OST0001.brw_stats= read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 1K: 0 0 0 | 100 100 100 read | write discontiguous pages rpcs % cum % | rpcs % cum % 0: 0 0 0 | 93 93 93 1: 0 0 0 | 7 7 100 read | write discontiguous blocks rpcs % cum % | rpcs % cum % 0: 0 0 0 | 90 90 90 1: 0 0 0 | 10 10 100 read | write disk fragmented I/Os ios % cum % | ios % cum % 1: 0 0 0 | 0 0 0 2: 0 0 0 | 90 90 90 3: 0 0 0 | 10 10 100 So with 100x 4MB RPC size, 10% of them had to allocate blocks in 2 separate chunks and this generated a console warning: osd_handler.c:1938:osd_trans_start()) myth-OST0001: credits 1235 > trans_max 1024 osd_handler.c:1874:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 2/15/0 osd_handler.c:1884:osd_trans_dump_creds()) write: 2/1076/0, punch: 0/0/0, quota 5/149/0 I will try to run a test with the patch tomorrow.

            Yes, definitely mballoc will be having a bit of problems with allocations on this filesystem, but I hope that patch https://review.whamcloud.com/43232 "LU-14438 ldiskfs: improvements to mballoc" can help this in the future. As the e2freefrag output shows, there are still many extents of 4MB or larger that could be used in this case.

            If the transaction can be restarted in case of credit shortage, then it makes sense to be more optimistic in the credit reservations. Even in a case like this where the filesystem is nearly full, it is extremely unlikely that the worst case would ever be hit.

            adilger Andreas Dilger added a comment - Yes, definitely mballoc will be having a bit of problems with allocations on this filesystem, but I hope that patch https://review.whamcloud.com/43232 " LU-14438 ldiskfs: improvements to mballoc " can help this in the future. As the e2freefrag output shows, there are still many extents of 4MB or larger that could be used in this case. If the transaction can be restarted in case of credit shortage, then it makes sense to be more optimistic in the credit reservations. Even in a case like this where the filesystem is nearly full, it is extremely unlikely that the worst case would ever be hit.
            wshilong Wang Shilong (Inactive) added a comment - - edited

            I think with space usage growing up, eg like 98%? mballoc codes won't try best to scan block groups to align best free extent as we can. So i am wondering

            maybe in this case, system decay extent bytes to a small value(eg tens of KB)? It will be nice if you could apply above patch to watch extents_bytes_allocation changes with system running.

            At the same time, i am wondering how we could fix the problem,  because even we did not hit this problem, it doesn't say that old codes are

            correct, it is just because it did not reserve enough credits, because we always calculate extents as number of write fragments..

            And if we consider worst of cases in the new codes:

                    /*   
                     * each extent can go into new leaf causing a split
                     * 5 is max tree depth: inode + 4 index blocks
                     * with blockmaps, depth is 3 at most
                     */
                    if (LDISKFS_I(inode)->i_flags & LDISKFS_EXTENTS_FL) {
                            /*   
                             * many concurrent threads may grow tree by the time
                             * our transaction starts. so, consider 2 is a min depth
                             */
                            depth = ext_depth(inode);
                            depth = max(depth, 1) + 1; 
                            newblocks += depth;
                            credits += depth * 2 * extents;
            

            For your 4M RPC, extents could be 1024 at the worst,  considering depth as 4, then credits here will be 10240 blocks.....

            I am wondering maybe we should change codes to make it more optimistic, eg, extents did could not exceed(number of fragments, 100)? in the default.

            We could restart if we are really short of credits with this extents?

            What do you think?

            wshilong Wang Shilong (Inactive) added a comment - - edited I think with space usage growing up, eg like 98%? mballoc codes won't try best to scan block groups to align best free extent as we can. So i am wondering maybe in this case, system decay extent bytes to a small value(eg tens of KB)? It will be nice if you could apply above patch to watch extents_bytes_allocation changes with system running. At the same time, i am wondering how we could fix the problem,  because even we did not hit this problem, it doesn't say that old codes are correct, it is just because it did not reserve enough credits, because we always calculate extents as number of write fragments.. And if we consider worst of cases in the new codes:     /*             * each extent can go into new leaf causing a split          * 5 is max tree depth: inode + 4 index blocks          * with blockmaps, depth is 3 at most          */         if (LDISKFS_I(inode)->i_flags & LDISKFS_EXTENTS_FL) {                 /*                     * many concurrent threads may grow tree by the time                  * our transaction starts. so, consider 2 is a min depth                  */                 depth = ext_depth(inode);                 depth = max(depth, 1) + 1;                  newblocks += depth;                 credits += depth * 2 * extents; For your 4M RPC, extents could be 1024 at the worst,  considering depth as 4, then credits here will be 10240 blocks..... I am wondering maybe we should change codes to make it more optimistic, eg, extents did could not exceed(number of fragments, 100)? in the default. We could restart if we are really short of credits with this extents? What do you think?

            Sorry, I was copying the message on my phone and didn't see that...

            The usage of the OSTs is quite high:

            lfs df
            UUID                   1K-blocks        Used   Available Use% Mounted on
            myth-MDT0000_UUID       13523964     7974808     4762248  63% /myth[MDT:0] 
            myth-OST0000_UUID     3861381132  3662435612   121198440  97% /myth[OST:0]
            myth-OST0001_UUID     3861381132  3684771832    98804312  98% /myth[OST:1]
            myth-OST0002_UUID     5795208676  4759666520   918714980  84% /myth[OST:2]
            myth-OST0003_UUID     5795078644  5555443412   122789132  98% /myth[OST:3]
            myth-OST0004_UUID     5794464848  5434846744    67540536  99% /myth[OST:4]
            

            but there are still reasonable-sized extents available for 4MB writes:

            e2freefrag /dev/vgmyth/lvmythost1 
            Device: /dev/vgmyth/lvmythost1
            Blocksize: 4096 bytes
            /dev/vgmyth/lvmythost1: Operation not supported while calling fsmap
            Total blocks: 973078528
            Free blocks: 51836471 (5.3%)
            
            Min. free extent: 4 KB 
            Max. free extent: 116528 KB
            Avg. free extent: 5348 KB
            Num. free extent: 38696
            
            HISTOGRAM OF FREE EXTENT SIZES:
            Extent Size Range :  Free extents   Free Blocks  Percent
                4K...    8K-  :            77            77    0.00%
                8K...   16K-  :           118           275    0.00%
               16K...   32K-  :           204          1049    0.00%
               32K...   64K-  :          5694         59774    0.12%
               64K...  128K-  :           610         14045    0.03%
              128K...  256K-  :          1048         48236    0.09%
              256K...  512K-  :          1881        175273    0.34%
              512K... 1024K-  :          3585        688220    1.33%
                1M...    2M-  :          5749       1978160    3.82%
                2M...    4M-  :          7553       5375458   10.37%
                4M...    8M-  :          6924       9917254   19.13%
                8M...   16M-  :          1621       4763813    9.19%
               16M...   32M-  :          2530      14947189   28.84%
               32M...   64M-  :           905       9954228   19.20%
               64M...  128M-  :           197       3830417    7.39%
            
            adilger Andreas Dilger added a comment - Sorry, I was copying the message on my phone and didn't see that... The usage of the OSTs is quite high: lfs df UUID 1K-blocks Used Available Use% Mounted on myth-MDT0000_UUID 13523964 7974808 4762248 63% /myth[MDT:0] myth-OST0000_UUID 3861381132 3662435612 121198440 97% /myth[OST:0] myth-OST0001_UUID 3861381132 3684771832 98804312 98% /myth[OST:1] myth-OST0002_UUID 5795208676 4759666520 918714980 84% /myth[OST:2] myth-OST0003_UUID 5795078644 5555443412 122789132 98% /myth[OST:3] myth-OST0004_UUID 5794464848 5434846744 67540536 99% /myth[OST:4] but there are still reasonable-sized extents available for 4MB writes: e2freefrag /dev/vgmyth/lvmythost1 Device: /dev/vgmyth/lvmythost1 Blocksize: 4096 bytes /dev/vgmyth/lvmythost1: Operation not supported while calling fsmap Total blocks: 973078528 Free blocks: 51836471 (5.3%) Min. free extent: 4 KB Max. free extent: 116528 KB Avg. free extent: 5348 KB Num. free extent: 38696 HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range : Free extents Free Blocks Percent 4K... 8K- : 77 77 0.00% 8K... 16K- : 118 275 0.00% 16K... 32K- : 204 1049 0.00% 32K... 64K- : 5694 59774 0.12% 64K... 128K- : 610 14045 0.03% 128K... 256K- : 1048 48236 0.09% 256K... 512K- : 1881 175273 0.34% 512K... 1024K- : 3585 688220 1.33% 1M... 2M- : 5749 1978160 3.82% 2M... 4M- : 7553 5375458 10.37% 4M... 8M- : 6924 9917254 19.13% 8M... 16M- : 1621 4763813 9.19% 16M... 32M- : 2530 14947189 28.84% 32M... 64M- : 905 9954228 19.20% 64M... 128M- : 197 3830417 7.39%

            People

              wshilong Wang Shilong (Inactive)
              wshilong Wang Shilong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: