Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 9223372036854775807

    Description

      Seeing debug messages on the console during large RPC writes:

      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1947:osd_trans_start()) work-OST0004: credits 24347 > trans_max 12800
      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1876:osd_trans_dump_creds())   create: 0/0/0, destroy: 0/0/0
      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1883:osd_trans_dump_creds())   attr_set: 1/1/0, xattr_set: 2/15/0
      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1893:osd_trans_dump_creds())   write: 2/24182/0, punch: 0/0/0, quota 5/149/0
      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1900:osd_trans_dump_creds())   insert: 0/0/0, delete: 0/0/0
      Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1907:osd_trans_dump_creds())   ref_add: 0/0/0, ref_del: 0/0/0
      

      These are likely caused by patches from LU-14134 landing that changed the way transaction sizes are calculated (in theory reducing transaction size, but not always).

      We might need an interface to check how extent per bytes going with system filled up, and also debug message to check how many credits calculated in  osd_declare_write_commit()

      Attachments

        Issue Links

          Activity

            [LU-14641] per extents bytes allocation stats

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45505
            Subject: LU-14641 osd-ldiskfs: write commit declaring improvement
            Project: fs/lustre-release
            Branch: b2_14
            Current Patch Set: 1
            Commit: 6db11a4e6e3214ac387433252986cd1850609260

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45505 Subject: LU-14641 osd-ldiskfs: write commit declaring improvement Project: fs/lustre-release Branch: b2_14 Current Patch Set: 1 Commit: 6db11a4e6e3214ac387433252986cd1850609260

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43446/
            Subject: LU-14641 osd-ldiskfs: write commit declaring improvement
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0f81c5ae973bf7fba45b6ba7f9c5f4fb1f6eadcb

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43446/ Subject: LU-14641 osd-ldiskfs: write commit declaring improvement Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0f81c5ae973bf7fba45b6ba7f9c5f4fb1f6eadcb

            Current calculation is like:

             *raw_cpu_ptr(osd->od_extent_bytes_percpu) =
            
                            DIV_ROUND_UP(old_bytes * (EXTENT_BYTES_DECAY -1) +
            
                                         min(new_bytes, OSD_DEFAULT_EXTENT_BYTES),
            
                                         EXTENT_BYTES_DECAY);
            
            

            Yup, extent per bytes was limiting not exceeding 1M, this kind of calculation is more sensitive to fragments once there was, if

            we tried to change it to new_bytes, extent_per_bytes might increase more quickly once we get one good large extent allocation.

             

            I had no idea which once is better, maybe increase OSD_DEFAULT_EXTENT_BYTES to 4M might be enough?

             

            wshilong Wang Shilong (Inactive) added a comment - Current calculation is like:  *raw_cpu_ptr(osd->od_extent_bytes_percpu) =                 DIV_ROUND_UP(old_bytes * (EXTENT_BYTES_DECAY -1) +                              min(new_bytes, OSD_DEFAULT_EXTENT_BYTES),                              EXTENT_BYTES_DECAY); Yup, extent per bytes was limiting not exceeding 1M, this kind of calculation is more sensitive to fragments once there was, if we tried to change it to new_bytes, extent_per_bytes might increase more quickly once we get one good large extent allocation.   I had no idea which once is better, maybe increase OSD_DEFAULT_EXTENT_BYTES to 4M might be enough?  

            One thing I realized while running this on my home system is that osd_ldiskfs_map_write() is submitting the IO because of BIO size limits (1MB in my case), and not because the allocation is fragmented. That means that osd_extent_bytes() is always <= 1MB even when the extents are very large (128MB contiguous allocations in my case with a single-threaded writer, 8MB with multiple writers). It may be that we need to change how the extent stats are calculated?

            adilger Andreas Dilger added a comment - One thing I realized while running this on my home system is that osd_ldiskfs_map_write() is submitting the IO because of BIO size limits (1MB in my case), and not because the allocation is fragmented. That means that osd_extent_bytes() is always <= 1MB even when the extents are very large (128MB contiguous allocations in my case with a single-threaded writer, 8MB with multiple writers). It may be that we need to change how the extent stats are calculated?

            I'm wondering if I should try installing https://review.whamcloud.com/43232 "LU-14438 ldiskfs: improvements to mballoc" to see if that is improving the initial allocations or not, although this is already doing pretty good considering how little free space is available.

            adilger Andreas Dilger added a comment - I'm wondering if I should try installing https://review.whamcloud.com/43232 " LU-14438 ldiskfs: improvements to mballoc " to see if that is improving the initial allocations or not, although this is already doing pretty good considering how little free space is available.

            Even though that OST is almost 98% full, the free space is relatively good (88% of free space is in chunks 4MB or larger):

            e2freefrag /dev/vgmyth/lvmythost3 
            Device: /dev/vgmyth/lvmythost3
            Blocksize: 4096 bytes
            /dev/vgmyth/lvmythost3: Operation not supported while calling fsmap
            Total blocks: 1460371456
            Free blocks: 91182550 (6.2%)
            
            Min. free extent: 4 KB 
            Max. free extent: 3869388 KB
            Avg. free extent: 5676 KB
            Num. free extent: 63654
            
            HISTOGRAM OF FREE EXTENT SIZES:
            Extent Size Range :  Free extents   Free Blocks  Percent
                4K...    8K-  :            80            80    0.00%
                8K...   16K-  :           103           247    0.00%
               16K...   32K-  :           180           949    0.00%
               32K...   64K-  :           368          4098    0.00%
               64K...  128K-  :           667         15328    0.02%
              128K...  256K-  :          1177         54808    0.06%
              256K...  512K-  :          1850        175030    0.19%
              512K... 1024K-  :          3523        686778    0.75%
                1M...    2M-  :          7456       2293985    2.52%
                2M...    4M-  :          9451       6161759    6.76%
                4M...    8M-  :         19705      28308110   31.05%
                8M...   16M-  :         19017      43864690   48.11%
               16M...   32M-  :             5         26840    0.03%
               32M...   64M-  :            12        147600    0.16%
               64M...  128M-  :             9        229679    0.25%
              128M...  256M-  :            11        544954    0.60%
              256M...  512M-  :            23       2171087    2.38%
              512M... 1024M-  :             9       1600006    1.75%
                1G...    2G-  :             6       2256410    2.47%
                2G...    4G-  :             2       1799598    1.97%
            

            However, that makes me wonder whether the allocator is actually doing a bad job for large writes and using up small chunks of space? Checking some recently-written files, I see that the initial allocations are relatively small, but once the file is over 4 MB in size it uses good allocations, so that could be related to the way the file is being written by the client (slowly over an hour):

            filefrag -v /myth/tv/2063_20210509184200.ts
            Filesystem type is: bd00bd0
            File size of /myth/tv/2063_20210509184200.ts is 2535750768 (2476320 blocks of 1024 bytes)
             ext:     device_logical:        physical_offset: length:  dev: flags:
               0:        0..     351:  980128408.. 980128759:    352: 0003: net
               1:      352..     707:  980129392.. 980129747:    356: 0003: net
               2:      708..    2607:  980148224.. 980150123:   1900: 0003: net
               3:     2608..    2855:  980138756.. 980139003:    248: 0003: net
               4:     2856..    2915:  980129748.. 980129807:     60: 0003: net
               5:     2916..    3579:  980139004.. 980139667:    664: 0003: net
               6:     3580..  131071: 1320420716..1320548207: 127492: 0003: net
               7:   131072..  262143: 1320550400..1320681471: 131072: 0003: net
               8:   262144..  393215: 1320681472..1320812543: 131072: 0003: net
               9:   393216..  524287: 1320812544..1320943615: 131072: 0003: net
              10:   524288..  655359: 1320943616..1321074687: 131072: 0003: net
              11:   655360..  786431: 1321074688..1321205759: 131072: 0003: net
              12:   786432..  917503: 1321205760..1321336831: 131072: 0003: net
              13:   917504.. 1048575: 1321336832..1321467903: 131072: 0003: net
              14:  1048576.. 1179647: 1321467904..1321598975: 131072: 0003: net
              15:  1179648.. 1310719: 1321598976..1321730047: 131072: 0003: net
              16:  1310720.. 1441791: 1321730048..1321861119: 131072: 0003: net
              17:  1441792.. 1572863: 1321861120..1321992191: 131072: 0003: net
              18:  1572864.. 1703935: 1321992192..1322123263: 131072: 0003: net
              19:  1703936.. 1835007: 1322123264..1322254335: 131072: 0003: net
              20:  1835008.. 1966079: 1322254336..1322385407: 131072: 0003: net
              21:  1966080.. 2097151: 1322385408..1322516479: 131072: 0003: net
              22:  2097152.. 2228223: 1322516480..1322647551: 131072: 0003: net
              23:  2228224.. 2359295: 1322647552..1322778623: 131072: 0003: net
              24:  2359296.. 2476319: 1322778624..1322895647: 117024: 0003: last,net,eof
            /myth/tv/2063_20210509184200.ts: 8 extents found
            
            adilger Andreas Dilger added a comment - Even though that OST is almost 98% full, the free space is relatively good (88% of free space is in chunks 4MB or larger): e2freefrag /dev/vgmyth/lvmythost3 Device: /dev/vgmyth/lvmythost3 Blocksize: 4096 bytes /dev/vgmyth/lvmythost3: Operation not supported while calling fsmap Total blocks: 1460371456 Free blocks: 91182550 (6.2%) Min. free extent: 4 KB Max. free extent: 3869388 KB Avg. free extent: 5676 KB Num. free extent: 63654 HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range : Free extents Free Blocks Percent 4K... 8K- : 80 80 0.00% 8K... 16K- : 103 247 0.00% 16K... 32K- : 180 949 0.00% 32K... 64K- : 368 4098 0.00% 64K... 128K- : 667 15328 0.02% 128K... 256K- : 1177 54808 0.06% 256K... 512K- : 1850 175030 0.19% 512K... 1024K- : 3523 686778 0.75% 1M... 2M- : 7456 2293985 2.52% 2M... 4M- : 9451 6161759 6.76% 4M... 8M- : 19705 28308110 31.05% 8M... 16M- : 19017 43864690 48.11% 16M... 32M- : 5 26840 0.03% 32M... 64M- : 12 147600 0.16% 64M... 128M- : 9 229679 0.25% 128M... 256M- : 11 544954 0.60% 256M... 512M- : 23 2171087 2.38% 512M... 1024M- : 9 1600006 1.75% 1G... 2G- : 6 2256410 2.47% 2G... 4G- : 2 1799598 1.97% However, that makes me wonder whether the allocator is actually doing a bad job for large writes and using up small chunks of space? Checking some recently-written files, I see that the initial allocations are relatively small, but once the file is over 4 MB in size it uses good allocations, so that could be related to the way the file is being written by the client (slowly over an hour): filefrag -v /myth/tv/2063_20210509184200.ts Filesystem type is: bd00bd0 File size of /myth/tv/2063_20210509184200.ts is 2535750768 (2476320 blocks of 1024 bytes) ext: device_logical: physical_offset: length: dev: flags: 0: 0.. 351: 980128408.. 980128759: 352: 0003: net 1: 352.. 707: 980129392.. 980129747: 356: 0003: net 2: 708.. 2607: 980148224.. 980150123: 1900: 0003: net 3: 2608.. 2855: 980138756.. 980139003: 248: 0003: net 4: 2856.. 2915: 980129748.. 980129807: 60: 0003: net 5: 2916.. 3579: 980139004.. 980139667: 664: 0003: net 6: 3580.. 131071: 1320420716..1320548207: 127492: 0003: net 7: 131072.. 262143: 1320550400..1320681471: 131072: 0003: net 8: 262144.. 393215: 1320681472..1320812543: 131072: 0003: net 9: 393216.. 524287: 1320812544..1320943615: 131072: 0003: net 10: 524288.. 655359: 1320943616..1321074687: 131072: 0003: net 11: 655360.. 786431: 1321074688..1321205759: 131072: 0003: net 12: 786432.. 917503: 1321205760..1321336831: 131072: 0003: net 13: 917504.. 1048575: 1321336832..1321467903: 131072: 0003: net 14: 1048576.. 1179647: 1321467904..1321598975: 131072: 0003: net 15: 1179648.. 1310719: 1321598976..1321730047: 131072: 0003: net 16: 1310720.. 1441791: 1321730048..1321861119: 131072: 0003: net 17: 1441792.. 1572863: 1321861120..1321992191: 131072: 0003: net 18: 1572864.. 1703935: 1321992192..1322123263: 131072: 0003: net 19: 1703936.. 1835007: 1322123264..1322254335: 131072: 0003: net 20: 1835008.. 1966079: 1322254336..1322385407: 131072: 0003: net 21: 1966080.. 2097151: 1322385408..1322516479: 131072: 0003: net 22: 2097152.. 2228223: 1322516480..1322647551: 131072: 0003: net 23: 2228224.. 2359295: 1322647552..1322778623: 131072: 0003: net 24: 2359296.. 2476319: 1322778624..1322895647: 117024: 0003: last,net,eof /myth/tv/2063_20210509184200.ts: 8 extents found

            Ignoring MDT 4096, OST stats 247905 is most smaller one, which means about 5 extents for 1M write, which was probably ok considering your space usage is high 98%...

            wshilong Wang Shilong (Inactive) added a comment - Ignoring MDT 4096, OST stats 247905 is most smaller one, which means about 5 extents for 1M write, which was probably ok considering your space usage is high 98%...

            Shilong, I updated my server to include your patch but am no longer seeing the journal warnings since upgrading and remounting the OSTs. I do see some uneven stats, but this may still be an issue with the application:

            # lctl get_param osd-ldiskfs.myth-*.extent_bytes_allocation
            osd-ldiskfs.myth-MDT0000.extent_bytes_allocation=4096
            osd-ldiskfs.myth-OST0000.extent_bytes_allocation=921170
            osd-ldiskfs.myth-OST0001.extent_bytes_allocation=757631
            osd-ldiskfs.myth-OST0002.extent_bytes_allocation=1036399
            osd-ldiskfs.myth-OST0003.extent_bytes_allocation=247905
            osd-ldiskfs.myth-OST0004.extent_bytes_allocation=1048250
            
            adilger Andreas Dilger added a comment - Shilong, I updated my server to include your patch but am no longer seeing the journal warnings since upgrading and remounting the OSTs. I do see some uneven stats, but this may still be an issue with the application: # lctl get_param osd-ldiskfs.myth-*.extent_bytes_allocation osd-ldiskfs.myth-MDT0000.extent_bytes_allocation=4096 osd-ldiskfs.myth-OST0000.extent_bytes_allocation=921170 osd-ldiskfs.myth-OST0001.extent_bytes_allocation=757631 osd-ldiskfs.myth-OST0002.extent_bytes_allocation=1036399 osd-ldiskfs.myth-OST0003.extent_bytes_allocation=247905 osd-ldiskfs.myth-OST0004.extent_bytes_allocation=1048250

            James, I don't think these messages have anything to do with EC, it just relates to large RPCs, "smaller" journals, and partly-fragmented free space.

            adilger Andreas Dilger added a comment - James, I don't think these messages have anything to do with EC, it just relates to large RPCs, "smaller" journals, and partly-fragmented free space.

            This is the problem I'm seeing with the EC code.

            simmonsja James A Simmons added a comment - This is the problem I'm seeing with the EC code.

            adilger Do you have possibility to apply latest patch to see how it helps?

            wshilong Wang Shilong (Inactive) added a comment - adilger  Do you have possibility to apply latest patch to see how it helps?

            People

              wshilong Wang Shilong (Inactive)
              wshilong Wang Shilong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: