[LU-12103] Improve block allocation for large partitions Created: 25/Mar/19 Updated: 16/Feb/21 Resolved: 25/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Improvement | Priority: | Critical |
| Reporter: | Artem Blagodarenko (Inactive) | Assignee: | Artem Blagodarenko (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
Block allocator uses some heuristic while chooseing group to allocate new blocks. This works good in most cases, but take a time for large low-free-space partition. The algorithm should be adjusted for this special case. |
| Comments |
| Comment by Artem Blagodarenko (Inactive) [ 25/Mar/19 ] |
|
Hello adilger, what do you think about optimisation idea from https://patchwork.ozlabs.org/patch/1054251/ ? Do you know other optimisation to suggest? I attached test I used and testing output to this issue. Thanks. |
| Comment by Andreas Dilger [ 25/Mar/19 ] |
|
I think in the long run, it seems like a better approach would be to have a tree-based allocator using the extent status tree that already exists. Otherwise, searching through 3-4 million groups becomes too slow regardless of how the iteration is done. |
| Comment by Artem Blagodarenko (Inactive) [ 25/Mar/19 ] |
|
adilger, thank you for fast answer! I like this long-run idea. We faced with very slow OST operations on filled target. Do you think my patch can solve this problem as shot run solution? Thanks. |
| Comment by Alex Zhuravlev [ 25/Mar/19 ] |
|
it would be interesting to understand where the most time is spent in: checking (nearly)empty groups or search for a better chunk? or probably waiting on IO to fill bitmaps? |
| Comment by Artem Blagodarenko (Inactive) [ 23/Apr/19 ] |
|
Hello bzzz, Here is data from one of the stacked OST 4.80% 0.00% ll_ost_io00_031 [ptlrpc] [k] ptlrpc_server_handle_request | ---ptlrpc_server_handle_request | --4.80%--tgt_request_handle | --4.80%--tgt_brw_write | --4.80%--obd_commitrw.constprop.39 ofd_commitrw ofd_commitrw_write.isra.32 | --4.80%--osd_write_commit | --4.80%--osd_ldiskfs_map_inode_pages | --4.80%--ldiskfs_map_blocks | --4.80%--ldiskfs_ext_map_blocks | --4.80%--ldiskfs_mb_new_blocks | --4.43%--ldiskfs_mb_regular_allocator | --4.16%--ldiskfs_mb_good_group Most time are spent in ldiskfs_mb_regular_allocator() loops (4 loops other all groups) |
| Comment by Andreas Dilger [ 04/May/19 ] |
|
Another possibility is to improve large OST allocation by using the bigalloc feature. This will reduce the number of block groups to search by the factor of the chunk size, and increase the efficiency of block allocations. bigalloc has been in use by Google for many years, though there may be some issues to be fixed with osd-ldiskfs in order to convert block allocations to cluster allocations. |
| Comment by Andreas Dilger [ 09/May/19 ] |
|
The benefit of bigalloc is that it reduces metadata size and handling overhead by a significant factor. The number of bits to allocate per unit size is reduced linearly by the chuck factor. This will help mballoc significantly, since huge OSTs can have millions of block groups to search, and a bigalloc chunk size of, say, 128kB would reduce allocation overhead and the number of block groups by a factor of 32. The main drawback of bigalloc is that it can waste space because the chunk size is the minimum allocation unit of the filesystem (eg. any file < chunk_size will consume a full chunk of space, even though only one 4KB block might be written). The space in a chunk cannot be shared between files. However, this is not worse than if the block size was actually increased to match the bigalloc chuck size, and better in several regards. The one drawback vs. a larger block size is that it does not increase the maximum extent size or maximum file size, since the blocksize and block addressing is the same, only the allocation size is changed. Has anyone tested bigalloc on an OST, and are there any known issues? |
| Comment by Andreas Dilger [ 09/May/19 ] |
|
Note that I'm not against improving mballoc to be more efficient, but I think bigalloc is a very easy way to improve allocation performance with minimum effort (mainly going through osd-ldiskfs and maybe LFSCK and mapping blocks to chunks during allocation), vs. significant work to rewrite the block allocation code, which would also touch lots of core code and need a long time to validate correctness and allocator behavior. |
| Comment by Artem Blagodarenko (Inactive) [ 24/May/19 ] |
|
Hello adilger, I agree that bigalloc can improve metadata operation performance and save space. But it looks like it can hep with allocator problems. Here are results of testing that shows that allocator make ~1million useless groups scanning. If this number become 4 times less, nothing changed dramatically. During test, system was fragmented with pattern "50 free blocks - 50 occupied blocks". Performance digradated from 1.2 Gb/sed to 10 MB/sec. 1. dd on non fragmented fs : ~1.2 Gb/sec [root@cslmo1704 ~]# df -T /mnt/ldiskfs Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/md0 ldiskfs 121226819924 1260 120014147240 1% /mnt/ldiskfs [root@cslmo1704 ~]# cslmo1704 ~]# time dd if=/dev/zero of=/mnt/ldiskfs/foo bs=$((1024*1024)) count=$((32*10*1024)) & [1] 74048 [root@cslmo1704 ~]# 327680+0 records in 327680+0 records out 343597383680 bytes (344 GB) copied, 292.264 s, 1.2 GB/s real 4m52.267s user 0m0.287s sys 4m51.010s 2. fragmented fs, mb_c[1-3]_threshold are default (24,14, 4): write: ~10Mb/sec (93.89system, 90.98system) : fake_fill_fs 50 1 Fri May 17 15:41:27 UTC 2019 ================== + cat /sys/fs/ldiskfs/md0/mb_c1_threshold 24 + cat /sys/fs/ldiskfs/md0/mb_c2_threshold 14 + cat /sys/fs/ldiskfs/md0/mb_c3_threshold 4 + set +x WRITE 1: =======-=== Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 05/17/2019 _x86_64_ (20 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.84 0.00 0.41 0.13 0.00 97.61 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn md0 78.09 239.67 7133.00 22778200 677929976 65536+0 records in 65536+0 records out 68719476736 bytes (69 GB) copied, 6530.55 s, 10.5 MB/s 0.23user 93.89system 1:48:50elapsed 1%CPU (0avgtext+0avgdata 1824maxresident)k 168inputs+134217728outputs (1major+501minor)pagefaults 0swaps Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 05/17/2019 _x86_64_ (20 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.79 0.00 0.80 0.24 0.00 97.18 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn md0 80.86 224.30 7279.30 22782608 739372736 -rw-r--r-- 1 root root 68719476736 May 17 17:30 /mnt/ldiskfs/foo READ 1: ======== 65536+0 records in 65536+0 records out 68719476736 bytes (69 GB) copied, 56.2515 s, 1.2 GB/s 0.04user 24.33system 0:56.25elapsed 43%CPU (0avgtext+0avgdata 1828maxresident)k 134217784inputs+0outputs (1major+502minor)pagefaults 0swaps RM 1: ===== 0.00user 4.14system 0:04.89elapsed 84%CPU (0avgtext+0avgdata 684maxresident)k 1264inputs+0outputs (1major+214minor)pagefaults 0swaps 2 Fri May 17 17:43:21 UTC 2019 ================== + cat /sys/fs/ldiskfs/md0/mb_c1_threshold 24 + cat /sys/fs/ldiskfs/md0/mb_c2_threshold 14 + cat /sys/fs/ldiskfs/md0/mb_c3_threshold 4 + set +x 2 Fri May 17 17:43:21 UTC 2019 ================== + cat /sys/fs/ldiskfs/md0/mb_c1_threshold 24 + cat /sys/fs/ldiskfs/md0/mb_c2_threshold 14 + cat /sys/fs/ldiskfs/md0/mb_c3_threshold 4 + set +x WRITE 2: =======-=== Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 05/17/2019 _x86_64_ (20 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.78 0.00 0.84 0.24 0.00 97.14 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn md0 89.69 878.23 7278.99 89892436 745047808 65536+0 records in 65536+0 records out 68719476736 bytes (69 GB) copied, 6619.02 s, 10.4 MB/s 0.25user 90.98system 1:50:19elapsed 1%CPU (0avgtext+0avgdata 1828maxresident)k 64inputs+134217728outputs (1major+502minor)pagefaults 0swaps Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 05/17/2019 _x86_64_ (20 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.73 0.00 1.18 0.34 0.00 96.75 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn md0 91.65 824.93 7400.80 89896736 806502028 -rw-r--r-- 1 root root 68719476736 May 17 19:33 /mnt/ldiskfs/foo
For given fragmentation (50 free blocks - 50 occupied blocks) excluding c1 doesn't give improvement, but excluding c2 loops gives 500M/s writing performance:
Filesystem Type 1K-blocks Used Available Use% Mounted on
/dev/md0 ldiskfs 121226819924 60622252404 59391896096 51% /mnt/ldiskfs
/dev/md0:
Timing buffered disk reads: 6256 MB in 3.00 seconds = 2084.39 MB/sec
1 Tue May 21 12:25:00 UTC 2019 ============================================
+ cat /sys/fs/ldiskfs/md0/mb_c1_threshold
59
+ cat /sys/fs/ldiskfs/md0/mb_c2_threshold
59
+ cat /sys/fs/ldiskfs/md0/mb_c3_threshold
4
+ echo 1
+ set +x
WRITE 1: ================================
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 05/21/2019 _x86_64_ (20 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.11 0.00 0.99 0.29 0.00 97.61
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
md0 51.63 1384.45 2903.89 593729636 1245346552
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 133.066 s, 516 MB/s
0.07user 84.88system 2:14.63elapsed 63%CPU (0avgtext+0avgdata 1828maxresident)k
2928inputs+134217728outputs (0major+502minor)pagefaults 0swaps
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 05/21/2019 _x86_64_ (20 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.11 0.00 0.99 0.30 0.00 97.60
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
md0 53.68 1384.03 3059.43 593735516 1312462836
mballoc: 16777208 blocks 335827 reqs (325 success)
mballoc: 67425781 extents scanned, 177 goal hits, 0 2^N hits, 335442 breaks, 0 lost
mballoc: (0, 0, 0) useless c(0,1,2) loops
mballoc: (1425456, 1083502, 0) skipped c(0,1,2) loops
-rw-r--r-- 1 root root 68719476736 May 21 12:27 /mnt/ldiskfs/foo
READ 1: ========
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 64.7674 s, 1.1 GB/s
0.05user 24.76system 1:04.77elapsed 38%CPU (0avgtext+0avgdata 1828maxresident)k
134217784inputs+0outputs (1major+501minor)pagefaults 0swaps
Filesystem Type 1K-blocks Used Available Use% Mounted on
/dev/md0 ldiskfs 121226819924 60680523488 59333625012 51% /mnt/ldiskfs
RM 1: =====
0.00user 4.02system 0:06.78elapsed 59%CPU (0avgtext+0avgdata 680maxresident)k
4784inputs+0outputs (1major+214minor)pagefaults 0swaps
2 Tue May 21 12:33:37 UTC 2019 ============================================
+ cat /sys/fs/ldiskfs/md0/mb_c1_threshold
59
+ cat /sys/fs/ldiskfs/md0/mb_c2_threshold
59
+ cat /sys/fs/ldiskfs/md0/mb_c3_threshold
4
+ echo 1
+ set +x
WRITE 2: ================================
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 05/21/2019 _x86_64_ (20 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.11 0.00 1.00 0.30 0.00 97.60
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
md0 55.76 1539.10 3056.71 660846708 1312464796
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 123.318 s, 557 MB/s
0.05user 78.67system 2:03.32elapsed 63%CPU (0avgtext+0avgdata 1828maxresident)k
56inputs+134217728outputs (1major+501minor)pagefaults 0swaps
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 05/21/2019 _x86_64_ (20 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.11 0.00 1.00 0.30 0.00 97.59
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
md0 57.64 1538.67 3199.87 660851024 1374326816
mballoc: 15462744 blocks 309333 reqs (149 success)
mballoc: 62147084 extents scanned, 6 goal hits, 0 2^N hits, 309183 breaks, 0 lost
mballoc: (0, 0, 0) useless c(0,1,2) loops
mballoc: (1425456, 1393743, 0) skipped c(0,1,2) loops
-rw-r--r-- 1 root root 68719476736 May 21 12:35 /mnt/ldiskfs/foo
READ 2: ========
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 65.9076 s, 1.0 GB/s
0.05user 24.05system 1:05.91elapsed 36%CPU (0avgtext+0avgdata 1824maxresident)k
134217784inputs+0outputs (1major+500minor)pagefaults 0swaps
Filesystem Type 1K-blocks Used Available Use% Mounted on
/dev/md0 ldiskfs 121226819924 60680523488 59333625012 51% /mnt/ldiskfs
RM 2: =====
0.00user 3.94system 0:06.80elapsed 57%CPU (0avgtext+0avgdata 680maxresident)k
4752inputs+0outputs (1major+214minor)pagefaults 0swaps
The reason why in case "50 free blocks - 50 ocupated blocks" setting "60-0-0" can be illustrated by statistics: mballoc: (7829, 1664192, 0) useless c(0,1,2) loops mballoc: (981753, 0, 0) skipped c(0,1,2) loops Yes, there are 7829 c1 loops, but 1664192 c2 loops x1000 times, so we can drop c1 loops influence. In this case we need set "60-60-0" options. Statistic with "60-60-0" shows 1393743 c2 loops skipped and this returns to 500M/s write performance: mballoc: (0, 0, 0) useless c(0,1,2) loops mballoc: (1425456, 1393743, 0) skipped c(0,1,2) loops Read performance hasn't changed - 1.0 GB/s. |
| Comment by Andreas Dilger [ 24/May/19 ] |
|
A summary of these statistics (table showing total group scans, performance vs for unpatched and patches code for a few different configs) should be included in the commit message for the patch submitted upstream. That makes it clear the patch is providing a real benefit (improved performance, reduced CPU usage). I think that would make it much easier to get the patch accepted, otherwise just a vague "improves performance" in the comment is not a compelling reason to land it. |
| Comment by Artem Blagodarenko (Inactive) [ 03/Jun/19 ] |
|
Artem, we discussed this patch on the Ext4 concall today. A couple of items came up during discussion: the patch submission should include performance results to show that the patch is providing an improvement it would be preferable if the thresholds for the stages were found dynamically in the kernel based on how many groups have been skipped and the free chunk size in each group there would need to be some way to dynamically reset the scanning level when lots of blocks have been freed
Hello adilger, what do you think about idea split this ideas to two phases: first as I already send, second with some autotune logic?
|
| Comment by Andreas Dilger [ 03/Jun/19 ] |
|
Definitely including statistics for the performance improvement should be part of the first patch. I didn't see a copy of the patch in Gerrit. Have you submitted it yet? I think a very simple heuristic could be used for auto-tune, something like "skip groups with number of free blocks less than 1/2 of average", or possibly "skip groups with free allocation order less than 1/2 average", and adjust which scanning stage this applies when, say, 1/2, 3/4, ... of groups are below this level. |
| Comment by Artem Blagodarenko (Inactive) [ 05/Jun/19 ] |
|
Hello adilger, >I didn't see a copy of the patch in Gerrit. Have you submitted it yet? Do we need LDISKFS patches here? The right way is land it to ext4 directly. I am going send new patches series, with test results and debugfs fake fragmentation patch for tests.
|
| Comment by Andreas Dilger [ 05/Jun/19 ] |
|
I agree that it makes sense to get the patches reviewed and accepted upstream if possible, but after that it might take several years before the change is available in a vendor kernel, so it would make sense to have an ldiskfs patch as well. Also, in some cases, patches that improve performance and/or functionality still do not get accepted upstream because of various reasons, so in this case it would still make sense to carry this patch in the Lustre tree because it mostly affects very large OST filesystems. |
| Comment by Artem Blagodarenko (Inactive) [ 10/Jun/19 ] |
|
Here is summary from EXT4 developers call descussion last Thursday.
My next steps:
|
| Comment by Gerrit Updater [ 11/Jun/19 ] |
|
Artem Blagodarenko (c17828@cray.com) uploaded a new patch: https://review.whamcloud.com/35180 |
| Comment by Artem Blagodarenko (Inactive) [ 25/Sep/19 ] |
|
green, I added the issue is about porting to RHEL8 - |
| Comment by Gerrit Updater [ 25/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35180/ |
| Comment by Peter Jones [ 25/Sep/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 05/Nov/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36681 |
| Comment by Andreas Dilger [ 28/Nov/19 ] |
|
Hi Artem, I noticed that this patch was only added to the rhel7.6 series, but not the rhel7.7 and rhel8.0 series. Could you please submit a patch to add ext4-simple-blockalloc.patch to these newer series. |
| Comment by Gerrit Updater [ 05/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36681/ |