[LU-14641] per extents bytes allocation stats Created: 26/Apr/21  Updated: 14/Nov/21  Resolved: 12/May/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Improvement Priority: Minor
Reporter: Wang Shilong (Inactive) Assignee: Wang Shilong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14134 reduce credits for new writing potent... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Seeing debug messages on the console during large RPC writes:

Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1947:osd_trans_start()) work-OST0004: credits 24347 > trans_max 12800
Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1876:osd_trans_dump_creds())   create: 0/0/0, destroy: 0/0/0
Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1883:osd_trans_dump_creds())   attr_set: 1/1/0, xattr_set: 2/15/0
Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1893:osd_trans_dump_creds())   write: 2/24182/0, punch: 0/0/0, quota 5/149/0
Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1900:osd_trans_dump_creds())   insert: 0/0/0, delete: 0/0/0
Apr  3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1907:osd_trans_dump_creds())   ref_add: 0/0/0, ref_del: 0/0/0

These are likely caused by patches from LU-14134 landing that changed the way transaction sizes are calculated (in theory reducing transaction size, but not always).

We might need an interface to check how extent per bytes going with system filled up, and also debug message to check how many credits calculated in  osd_declare_write_commit()



 Comments   
Comment by Gerrit Updater [ 26/Apr/21 ]

Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43446
Subject: LU-14641 osd-ldiskfs: extents bytes allocation stats
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ccad628ef9240ac52a8f81d4ab0c873ee8958b0d

Comment by Andreas Dilger [ 26/Apr/21 ]

I'm hitting this error pretty regularly on my home server with 2.14.0 (32MB journal, 4MB RPC size), so if there is some data to be collected please let me know.

osd_handler.c:1938:osd_trans_start()) myth-OST0001: credits 1327 > trans_max 1024
osd_handler.c:1867:osd_trans_dump_creds())   create: 0/0/0, destroy: 0/0/0
osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 1/1/0, xattr_set: 2/15/0
osd_handler.c:1884:osd_trans_dump_creds())   write: 2/3116/0, punch: 0/0/0, quota 5/149/0
osd_handler.c:1891:osd_trans_dump_creds())   insert: 0/0/0, delete: 0/0/0
osd_handler.c:1898:osd_trans_dump_creds())   ref_add: 0/0/0, ref_del: 0/0/0

It would be useful if the initial osd_trans_start() message included the filesystem label so that it could be seen of this problem is specific to a single OST or not (eg. fragmented).

Comment by Wang Shilong (Inactive) [ 27/Apr/21 ]

adilger it has indicated this is "myth-OST0001", what is space usage of this OST?

Comment by Andreas Dilger [ 27/Apr/21 ]

Sorry, I was copying the message on my phone and didn't see that...

The usage of the OSTs is quite high:

lfs df
UUID                   1K-blocks        Used   Available Use% Mounted on
myth-MDT0000_UUID       13523964     7974808     4762248  63% /myth[MDT:0] 
myth-OST0000_UUID     3861381132  3662435612   121198440  97% /myth[OST:0]
myth-OST0001_UUID     3861381132  3684771832    98804312  98% /myth[OST:1]
myth-OST0002_UUID     5795208676  4759666520   918714980  84% /myth[OST:2]
myth-OST0003_UUID     5795078644  5555443412   122789132  98% /myth[OST:3]
myth-OST0004_UUID     5794464848  5434846744    67540536  99% /myth[OST:4]

but there are still reasonable-sized extents available for 4MB writes:

e2freefrag /dev/vgmyth/lvmythost1 
Device: /dev/vgmyth/lvmythost1
Blocksize: 4096 bytes
/dev/vgmyth/lvmythost1: Operation not supported while calling fsmap
Total blocks: 973078528
Free blocks: 51836471 (5.3%)

Min. free extent: 4 KB 
Max. free extent: 116528 KB
Avg. free extent: 5348 KB
Num. free extent: 38696

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
    4K...    8K-  :            77            77    0.00%
    8K...   16K-  :           118           275    0.00%
   16K...   32K-  :           204          1049    0.00%
   32K...   64K-  :          5694         59774    0.12%
   64K...  128K-  :           610         14045    0.03%
  128K...  256K-  :          1048         48236    0.09%
  256K...  512K-  :          1881        175273    0.34%
  512K... 1024K-  :          3585        688220    1.33%
    1M...    2M-  :          5749       1978160    3.82%
    2M...    4M-  :          7553       5375458   10.37%
    4M...    8M-  :          6924       9917254   19.13%
    8M...   16M-  :          1621       4763813    9.19%
   16M...   32M-  :          2530      14947189   28.84%
   32M...   64M-  :           905       9954228   19.20%
   64M...  128M-  :           197       3830417    7.39%
Comment by Wang Shilong (Inactive) [ 27/Apr/21 ]

I think with space usage growing up, eg like 98%? mballoc codes won't try best to scan block groups to align best free extent as we can. So i am wondering

maybe in this case, system decay extent bytes to a small value(eg tens of KB)? It will be nice if you could apply above patch to watch extents_bytes_allocation changes with system running.

At the same time, i am wondering how we could fix the problem,  because even we did not hit this problem, it doesn't say that old codes are

correct, it is just because it did not reserve enough credits, because we always calculate extents as number of write fragments..

And if we consider worst of cases in the new codes:

        /*   
         * each extent can go into new leaf causing a split
         * 5 is max tree depth: inode + 4 index blocks
         * with blockmaps, depth is 3 at most
         */
        if (LDISKFS_I(inode)->i_flags & LDISKFS_EXTENTS_FL) {
                /*   
                 * many concurrent threads may grow tree by the time
                 * our transaction starts. so, consider 2 is a min depth
                 */
                depth = ext_depth(inode);
                depth = max(depth, 1) + 1; 
                newblocks += depth;
                credits += depth * 2 * extents;

For your 4M RPC, extents could be 1024 at the worst,  considering depth as 4, then credits here will be 10240 blocks.....

I am wondering maybe we should change codes to make it more optimistic, eg, extents did could not exceed(number of fragments, 100)? in the default.

We could restart if we are really short of credits with this extents?

What do you think?

Comment by Andreas Dilger [ 27/Apr/21 ]

Yes, definitely mballoc will be having a bit of problems with allocations on this filesystem, but I hope that patch https://review.whamcloud.com/43232 "LU-14438 ldiskfs: improvements to mballoc" can help this in the future. As the e2freefrag output shows, there are still many extents of 4MB or larger that could be used in this case.

If the transaction can be restarted in case of credit shortage, then it makes sense to be more optimistic in the credit reservations. Even in a case like this where the filesystem is nearly full, it is extremely unlikely that the worst case would ever be hit.

Comment by Andreas Dilger [ 27/Apr/21 ]

The fragmentation of the allocations can be seen fairly clearly in the osd-ldiskfs.myth-OST0001.brw_stats output:

# dd if=/dev/zero of=/myth/tmp/ost1/f1 bs=4M count=100
# lctl get_param osd-ldiskfs.myth-OST0001.brw_stats | less
osd-ldiskfs.myth-OST0001.brw_stats=
                           read      |     write
pages per bulk r/w     rpcs  % cum % |  rpcs        % cum %
1K:                      0   0   0   |  100  100 100

                           read      |     write
discontiguous pages    rpcs  % cum % |  rpcs        % cum %
0:                       0   0   0   |   93  93  93
1:                       0   0   0   |    7   7 100

                           read      |     write
discontiguous blocks   rpcs  % cum % |  rpcs        % cum %
0:                       0   0   0   |   90  90  90
1:                       0   0   0   |   10   10 100

                           read      |     write
disk fragmented I/Os   ios   % cum % |  ios         % cum %
1:                       0   0   0   |    0   0   0
2:                       0   0   0   |   90  90  90
3:                       0   0   0   |   10   10 100

So with 100x 4MB RPC size, 10% of them had to allocate blocks in 2 separate chunks and this generated a console warning:

osd_handler.c:1938:osd_trans_start()) myth-OST0001: credits 1235 > trans_max 1024
osd_handler.c:1874:osd_trans_dump_creds())   attr_set: 1/1/0, xattr_set: 2/15/0
osd_handler.c:1884:osd_trans_dump_creds())   write: 2/1076/0, punch: 0/0/0, quota 5/149/0

I will try to run a test with the patch tomorrow.

Comment by Wang Shilong (Inactive) [ 28/Apr/21 ]

adilger Do you have possibility to apply latest patch to see how it helps?

Comment by James A Simmons [ 04/May/21 ]

This is the problem I'm seeing with the EC code.

Comment by Andreas Dilger [ 04/May/21 ]

James, I don't think these messages have anything to do with EC, it just relates to large RPCs, "smaller" journals, and partly-fragmented free space.

Comment by Andreas Dilger [ 10/May/21 ]

Shilong, I updated my server to include your patch but am no longer seeing the journal warnings since upgrading and remounting the OSTs. I do see some uneven stats, but this may still be an issue with the application:

# lctl get_param osd-ldiskfs.myth-*.extent_bytes_allocation
osd-ldiskfs.myth-MDT0000.extent_bytes_allocation=4096
osd-ldiskfs.myth-OST0000.extent_bytes_allocation=921170
osd-ldiskfs.myth-OST0001.extent_bytes_allocation=757631
osd-ldiskfs.myth-OST0002.extent_bytes_allocation=1036399
osd-ldiskfs.myth-OST0003.extent_bytes_allocation=247905
osd-ldiskfs.myth-OST0004.extent_bytes_allocation=1048250
Comment by Wang Shilong (Inactive) [ 10/May/21 ]

Ignoring MDT 4096, OST stats 247905 is most smaller one, which means about 5 extents for 1M write, which was probably ok considering your space usage is high 98%...

Comment by Andreas Dilger [ 10/May/21 ]

Even though that OST is almost 98% full, the free space is relatively good (88% of free space is in chunks 4MB or larger):

e2freefrag /dev/vgmyth/lvmythost3 
Device: /dev/vgmyth/lvmythost3
Blocksize: 4096 bytes
/dev/vgmyth/lvmythost3: Operation not supported while calling fsmap
Total blocks: 1460371456
Free blocks: 91182550 (6.2%)

Min. free extent: 4 KB 
Max. free extent: 3869388 KB
Avg. free extent: 5676 KB
Num. free extent: 63654

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
    4K...    8K-  :            80            80    0.00%
    8K...   16K-  :           103           247    0.00%
   16K...   32K-  :           180           949    0.00%
   32K...   64K-  :           368          4098    0.00%
   64K...  128K-  :           667         15328    0.02%
  128K...  256K-  :          1177         54808    0.06%
  256K...  512K-  :          1850        175030    0.19%
  512K... 1024K-  :          3523        686778    0.75%
    1M...    2M-  :          7456       2293985    2.52%
    2M...    4M-  :          9451       6161759    6.76%
    4M...    8M-  :         19705      28308110   31.05%
    8M...   16M-  :         19017      43864690   48.11%
   16M...   32M-  :             5         26840    0.03%
   32M...   64M-  :            12        147600    0.16%
   64M...  128M-  :             9        229679    0.25%
  128M...  256M-  :            11        544954    0.60%
  256M...  512M-  :            23       2171087    2.38%
  512M... 1024M-  :             9       1600006    1.75%
    1G...    2G-  :             6       2256410    2.47%
    2G...    4G-  :             2       1799598    1.97%

However, that makes me wonder whether the allocator is actually doing a bad job for large writes and using up small chunks of space? Checking some recently-written files, I see that the initial allocations are relatively small, but once the file is over 4 MB in size it uses good allocations, so that could be related to the way the file is being written by the client (slowly over an hour):

filefrag -v /myth/tv/2063_20210509184200.ts
Filesystem type is: bd00bd0
File size of /myth/tv/2063_20210509184200.ts is 2535750768 (2476320 blocks of 1024 bytes)
 ext:     device_logical:        physical_offset: length:  dev: flags:
   0:        0..     351:  980128408.. 980128759:    352: 0003: net
   1:      352..     707:  980129392.. 980129747:    356: 0003: net
   2:      708..    2607:  980148224.. 980150123:   1900: 0003: net
   3:     2608..    2855:  980138756.. 980139003:    248: 0003: net
   4:     2856..    2915:  980129748.. 980129807:     60: 0003: net
   5:     2916..    3579:  980139004.. 980139667:    664: 0003: net
   6:     3580..  131071: 1320420716..1320548207: 127492: 0003: net
   7:   131072..  262143: 1320550400..1320681471: 131072: 0003: net
   8:   262144..  393215: 1320681472..1320812543: 131072: 0003: net
   9:   393216..  524287: 1320812544..1320943615: 131072: 0003: net
  10:   524288..  655359: 1320943616..1321074687: 131072: 0003: net
  11:   655360..  786431: 1321074688..1321205759: 131072: 0003: net
  12:   786432..  917503: 1321205760..1321336831: 131072: 0003: net
  13:   917504.. 1048575: 1321336832..1321467903: 131072: 0003: net
  14:  1048576.. 1179647: 1321467904..1321598975: 131072: 0003: net
  15:  1179648.. 1310719: 1321598976..1321730047: 131072: 0003: net
  16:  1310720.. 1441791: 1321730048..1321861119: 131072: 0003: net
  17:  1441792.. 1572863: 1321861120..1321992191: 131072: 0003: net
  18:  1572864.. 1703935: 1321992192..1322123263: 131072: 0003: net
  19:  1703936.. 1835007: 1322123264..1322254335: 131072: 0003: net
  20:  1835008.. 1966079: 1322254336..1322385407: 131072: 0003: net
  21:  1966080.. 2097151: 1322385408..1322516479: 131072: 0003: net
  22:  2097152.. 2228223: 1322516480..1322647551: 131072: 0003: net
  23:  2228224.. 2359295: 1322647552..1322778623: 131072: 0003: net
  24:  2359296.. 2476319: 1322778624..1322895647: 117024: 0003: last,net,eof
/myth/tv/2063_20210509184200.ts: 8 extents found
Comment by Andreas Dilger [ 10/May/21 ]

I'm wondering if I should try installing https://review.whamcloud.com/43232 "LU-14438 ldiskfs: improvements to mballoc" to see if that is improving the initial allocations or not, although this is already doing pretty good considering how little free space is available.

Comment by Andreas Dilger [ 11/May/21 ]

One thing I realized while running this on my home system is that osd_ldiskfs_map_write() is submitting the IO because of BIO size limits (1MB in my case), and not because the allocation is fragmented. That means that osd_extent_bytes() is always <= 1MB even when the extents are very large (128MB contiguous allocations in my case with a single-threaded writer, 8MB with multiple writers). It may be that we need to change how the extent stats are calculated?

Comment by Wang Shilong (Inactive) [ 11/May/21 ]

Current calculation is like:

 *raw_cpu_ptr(osd->od_extent_bytes_percpu) =

                DIV_ROUND_UP(old_bytes * (EXTENT_BYTES_DECAY -1) +

                             min(new_bytes, OSD_DEFAULT_EXTENT_BYTES),

                             EXTENT_BYTES_DECAY);

Yup, extent per bytes was limiting not exceeding 1M, this kind of calculation is more sensitive to fragments once there was, if

we tried to change it to new_bytes, extent_per_bytes might increase more quickly once we get one good large extent allocation.

 

I had no idea which once is better, maybe increase OSD_DEFAULT_EXTENT_BYTES to 4M might be enough?

 

Comment by Gerrit Updater [ 11/May/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43446/
Subject: LU-14641 osd-ldiskfs: write commit declaring improvement
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0f81c5ae973bf7fba45b6ba7f9c5f4fb1f6eadcb

Comment by Gerrit Updater [ 09/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45505
Subject: LU-14641 osd-ldiskfs: write commit declaring improvement
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: 6db11a4e6e3214ac387433252986cd1850609260

Comment by Gerrit Updater [ 14/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" merged in patch https://review.whamcloud.com/45505/
Subject: LU-14641 osd-ldiskfs: write commit declaring improvement
Project: fs/lustre-release
Branch: b2_14
Current Patch Set:
Commit: 1734c4746d98dfa6fc6559841be8028a22465718

Generated at Sat Feb 10 03:11:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.