Details
-
Improvement
-
Resolution: Fixed
-
Minor
-
None
-
None
-
9223372036854775807
Description
Seeing debug messages on the console during large RPC writes:
Apr 3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1947:osd_trans_start()) work-OST0004: credits 24347 > trans_max 12800 Apr 3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1876:osd_trans_dump_creds()) create: 0/0/0, destroy: 0/0/0 Apr 3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1883:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 2/15/0 Apr 3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1893:osd_trans_dump_creds()) write: 2/24182/0, punch: 0/0/0, quota 5/149/0 Apr 3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1900:osd_trans_dump_creds()) insert: 0/0/0, delete: 0/0/0 Apr 3 01:29:08 foss01 kernel: Lustre: 12536:0:(osd_handler.c:1907:osd_trans_dump_creds()) ref_add: 0/0/0, ref_del: 0/0/0
These are likely caused by patches from LU-14134 landing that changed the way transaction sizes are calculated (in theory reducing transaction size, but not always).
We might need an interface to check how extent per bytes going with system filled up, and also debug message to check how many credits calculated in osd_declare_write_commit()
Attachments
Issue Links
- is related to
-
LU-14134 reduce credits for new writing potentially
-
- Resolved
-
Activity
Even though that OST is almost 98% full, the free space is relatively good (88% of free space is in chunks 4MB or larger):
e2freefrag /dev/vgmyth/lvmythost3 Device: /dev/vgmyth/lvmythost3 Blocksize: 4096 bytes /dev/vgmyth/lvmythost3: Operation not supported while calling fsmap Total blocks: 1460371456 Free blocks: 91182550 (6.2%) Min. free extent: 4 KB Max. free extent: 3869388 KB Avg. free extent: 5676 KB Num. free extent: 63654 HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range : Free extents Free Blocks Percent 4K... 8K- : 80 80 0.00% 8K... 16K- : 103 247 0.00% 16K... 32K- : 180 949 0.00% 32K... 64K- : 368 4098 0.00% 64K... 128K- : 667 15328 0.02% 128K... 256K- : 1177 54808 0.06% 256K... 512K- : 1850 175030 0.19% 512K... 1024K- : 3523 686778 0.75% 1M... 2M- : 7456 2293985 2.52% 2M... 4M- : 9451 6161759 6.76% 4M... 8M- : 19705 28308110 31.05% 8M... 16M- : 19017 43864690 48.11% 16M... 32M- : 5 26840 0.03% 32M... 64M- : 12 147600 0.16% 64M... 128M- : 9 229679 0.25% 128M... 256M- : 11 544954 0.60% 256M... 512M- : 23 2171087 2.38% 512M... 1024M- : 9 1600006 1.75% 1G... 2G- : 6 2256410 2.47% 2G... 4G- : 2 1799598 1.97%
However, that makes me wonder whether the allocator is actually doing a bad job for large writes and using up small chunks of space? Checking some recently-written files, I see that the initial allocations are relatively small, but once the file is over 4 MB in size it uses good allocations, so that could be related to the way the file is being written by the client (slowly over an hour):
filefrag -v /myth/tv/2063_20210509184200.ts Filesystem type is: bd00bd0 File size of /myth/tv/2063_20210509184200.ts is 2535750768 (2476320 blocks of 1024 bytes) ext: device_logical: physical_offset: length: dev: flags: 0: 0.. 351: 980128408.. 980128759: 352: 0003: net 1: 352.. 707: 980129392.. 980129747: 356: 0003: net 2: 708.. 2607: 980148224.. 980150123: 1900: 0003: net 3: 2608.. 2855: 980138756.. 980139003: 248: 0003: net 4: 2856.. 2915: 980129748.. 980129807: 60: 0003: net 5: 2916.. 3579: 980139004.. 980139667: 664: 0003: net 6: 3580.. 131071: 1320420716..1320548207: 127492: 0003: net 7: 131072.. 262143: 1320550400..1320681471: 131072: 0003: net 8: 262144.. 393215: 1320681472..1320812543: 131072: 0003: net 9: 393216.. 524287: 1320812544..1320943615: 131072: 0003: net 10: 524288.. 655359: 1320943616..1321074687: 131072: 0003: net 11: 655360.. 786431: 1321074688..1321205759: 131072: 0003: net 12: 786432.. 917503: 1321205760..1321336831: 131072: 0003: net 13: 917504.. 1048575: 1321336832..1321467903: 131072: 0003: net 14: 1048576.. 1179647: 1321467904..1321598975: 131072: 0003: net 15: 1179648.. 1310719: 1321598976..1321730047: 131072: 0003: net 16: 1310720.. 1441791: 1321730048..1321861119: 131072: 0003: net 17: 1441792.. 1572863: 1321861120..1321992191: 131072: 0003: net 18: 1572864.. 1703935: 1321992192..1322123263: 131072: 0003: net 19: 1703936.. 1835007: 1322123264..1322254335: 131072: 0003: net 20: 1835008.. 1966079: 1322254336..1322385407: 131072: 0003: net 21: 1966080.. 2097151: 1322385408..1322516479: 131072: 0003: net 22: 2097152.. 2228223: 1322516480..1322647551: 131072: 0003: net 23: 2228224.. 2359295: 1322647552..1322778623: 131072: 0003: net 24: 2359296.. 2476319: 1322778624..1322895647: 117024: 0003: last,net,eof /myth/tv/2063_20210509184200.ts: 8 extents found
Ignoring MDT 4096, OST stats 247905 is most smaller one, which means about 5 extents for 1M write, which was probably ok considering your space usage is high 98%...
Shilong, I updated my server to include your patch but am no longer seeing the journal warnings since upgrading and remounting the OSTs. I do see some uneven stats, but this may still be an issue with the application:
# lctl get_param osd-ldiskfs.myth-*.extent_bytes_allocation osd-ldiskfs.myth-MDT0000.extent_bytes_allocation=4096 osd-ldiskfs.myth-OST0000.extent_bytes_allocation=921170 osd-ldiskfs.myth-OST0001.extent_bytes_allocation=757631 osd-ldiskfs.myth-OST0002.extent_bytes_allocation=1036399 osd-ldiskfs.myth-OST0003.extent_bytes_allocation=247905 osd-ldiskfs.myth-OST0004.extent_bytes_allocation=1048250
James, I don't think these messages have anything to do with EC, it just relates to large RPCs, "smaller" journals, and partly-fragmented free space.
adilger Do you have possibility to apply latest patch to see how it helps?
The fragmentation of the allocations can be seen fairly clearly in the osd-ldiskfs.myth-OST0001.brw_stats output:
# dd if=/dev/zero of=/myth/tmp/ost1/f1 bs=4M count=100 # lctl get_param osd-ldiskfs.myth-OST0001.brw_stats | less osd-ldiskfs.myth-OST0001.brw_stats= read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 1K: 0 0 0 | 100 100 100 read | write discontiguous pages rpcs % cum % | rpcs % cum % 0: 0 0 0 | 93 93 93 1: 0 0 0 | 7 7 100 read | write discontiguous blocks rpcs % cum % | rpcs % cum % 0: 0 0 0 | 90 90 90 1: 0 0 0 | 10 10 100 read | write disk fragmented I/Os ios % cum % | ios % cum % 1: 0 0 0 | 0 0 0 2: 0 0 0 | 90 90 90 3: 0 0 0 | 10 10 100
So with 100x 4MB RPC size, 10% of them had to allocate blocks in 2 separate chunks and this generated a console warning:
osd_handler.c:1938:osd_trans_start()) myth-OST0001: credits 1235 > trans_max 1024 osd_handler.c:1874:osd_trans_dump_creds()) attr_set: 1/1/0, xattr_set: 2/15/0 osd_handler.c:1884:osd_trans_dump_creds()) write: 2/1076/0, punch: 0/0/0, quota 5/149/0
I will try to run a test with the patch tomorrow.
Yes, definitely mballoc will be having a bit of problems with allocations on this filesystem, but I hope that patch https://review.whamcloud.com/43232 "LU-14438 ldiskfs: improvements to mballoc" can help this in the future. As the e2freefrag output shows, there are still many extents of 4MB or larger that could be used in this case.
If the transaction can be restarted in case of credit shortage, then it makes sense to be more optimistic in the credit reservations. Even in a case like this where the filesystem is nearly full, it is extremely unlikely that the worst case would ever be hit.
I think with space usage growing up, eg like 98%? mballoc codes won't try best to scan block groups to align best free extent as we can. So i am wondering
maybe in this case, system decay extent bytes to a small value(eg tens of KB)? It will be nice if you could apply above patch to watch extents_bytes_allocation changes with system running.
At the same time, i am wondering how we could fix the problem, because even we did not hit this problem, it doesn't say that old codes are
correct, it is just because it did not reserve enough credits, because we always calculate extents as number of write fragments..
And if we consider worst of cases in the new codes:
/* * each extent can go into new leaf causing a split * 5 is max tree depth: inode + 4 index blocks * with blockmaps, depth is 3 at most */ if (LDISKFS_I(inode)->i_flags & LDISKFS_EXTENTS_FL) { /* * many concurrent threads may grow tree by the time * our transaction starts. so, consider 2 is a min depth */ depth = ext_depth(inode); depth = max(depth, 1) + 1; newblocks += depth; credits += depth * 2 * extents;
For your 4M RPC, extents could be 1024 at the worst, considering depth as 4, then credits here will be 10240 blocks.....
I am wondering maybe we should change codes to make it more optimistic, eg, extents did could not exceed(number of fragments, 100)? in the default.
We could restart if we are really short of credits with this extents?
What do you think?
Sorry, I was copying the message on my phone and didn't see that...
The usage of the OSTs is quite high:
lfs df UUID 1K-blocks Used Available Use% Mounted on myth-MDT0000_UUID 13523964 7974808 4762248 63% /myth[MDT:0] myth-OST0000_UUID 3861381132 3662435612 121198440 97% /myth[OST:0] myth-OST0001_UUID 3861381132 3684771832 98804312 98% /myth[OST:1] myth-OST0002_UUID 5795208676 4759666520 918714980 84% /myth[OST:2] myth-OST0003_UUID 5795078644 5555443412 122789132 98% /myth[OST:3] myth-OST0004_UUID 5794464848 5434846744 67540536 99% /myth[OST:4]
but there are still reasonable-sized extents available for 4MB writes:
e2freefrag /dev/vgmyth/lvmythost1 Device: /dev/vgmyth/lvmythost1 Blocksize: 4096 bytes /dev/vgmyth/lvmythost1: Operation not supported while calling fsmap Total blocks: 973078528 Free blocks: 51836471 (5.3%) Min. free extent: 4 KB Max. free extent: 116528 KB Avg. free extent: 5348 KB Num. free extent: 38696 HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range : Free extents Free Blocks Percent 4K... 8K- : 77 77 0.00% 8K... 16K- : 118 275 0.00% 16K... 32K- : 204 1049 0.00% 32K... 64K- : 5694 59774 0.12% 64K... 128K- : 610 14045 0.03% 128K... 256K- : 1048 48236 0.09% 256K... 512K- : 1881 175273 0.34% 512K... 1024K- : 3585 688220 1.33% 1M... 2M- : 5749 1978160 3.82% 2M... 4M- : 7553 5375458 10.37% 4M... 8M- : 6924 9917254 19.13% 8M... 16M- : 1621 4763813 9.19% 16M... 32M- : 2530 14947189 28.84% 32M... 64M- : 905 9954228 19.20% 64M... 128M- : 197 3830417 7.39%
I'm wondering if I should try installing https://review.whamcloud.com/43232 "
LU-14438ldiskfs: improvements to mballoc" to see if that is improving the initial allocations or not, although this is already doing pretty good considering how little free space is available.