[LU-12103] Improve block allocation for large partitions Created: 25/Mar/19  Updated: 16/Feb/21  Resolved: 25/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4

Type: Improvement Priority: Critical
Reporter: Artem Blagodarenko (Inactive) Assignee: Artem Blagodarenko (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File 0002-LUS-6746-ldiskfs-block-allocator-tests.patch     Text File allocator-skip-loops-test-results.txt    
Issue Links:
Related
is related to LU-8365 Fix mballoc stream allocator to bette... Open
is related to LU-12345 backport - ext4: optimize ext4_find_d... Resolved
is related to LU-12801 Port "ldiskfs: don't search large blo... Resolved
is related to LU-12335 mb_prealloc_table table read/write c... Resolved
is related to LU-12970 improve mballoc for huge filesystems Open
is related to LU-12988 improve mount time on huge ldiskfs fi... Resolved
is related to LU-14305 add persistent tuning for mb_c3_thres... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Block allocator uses some heuristic while chooseing group to allocate new blocks. This works good in most cases, but take a time for large low-free-space partition. The algorithm should be adjusted for this special case.



 Comments   
Comment by Artem Blagodarenko (Inactive) [ 25/Mar/19 ]

Hello adilger, what do you think about optimisation idea from https://patchwork.ozlabs.org/patch/1054251/ ? Do you know other optimisation to suggest? I attached test I used and testing output to this issue. Thanks.

Comment by Andreas Dilger [ 25/Mar/19 ]

I think in the long run, it seems like a better approach would be to have a tree-based allocator using the extent status tree that already exists. Otherwise, searching through 3-4 million groups becomes too slow regardless of how the iteration is done.

Comment by Artem Blagodarenko (Inactive) [ 25/Mar/19 ]

adilger, thank you for fast answer! I like this long-run idea.

We faced with very slow OST operations on filled target. Do you think my patch can solve this problem as shot run solution?

Thanks.

Comment by Alex Zhuravlev [ 25/Mar/19 ]

it would be interesting to understand where the most time is spent in: checking (nearly)empty groups or search for a better chunk? or probably waiting on IO to fill bitmaps?

Comment by Artem Blagodarenko (Inactive) [ 23/Apr/19 ]

Hello bzzz,

Here is data from one  of the stacked OST

 4.80%     0.00%  ll_ost_io00_031  [ptlrpc]                      [k] ptlrpc_server_handle_request
            |
            ---ptlrpc_server_handle_request
               |
                --4.80%--tgt_request_handle
                          |
                           --4.80%--tgt_brw_write
                                     |
                                      --4.80%--obd_commitrw.constprop.39
                                                ofd_commitrw
                                                ofd_commitrw_write.isra.32
                                                |
                                                 --4.80%--osd_write_commit
                                                           |
                                                            --4.80%--osd_ldiskfs_map_inode_pages
                                                                      |
                                                                       --4.80%--ldiskfs_map_blocks
                                                                                 |
                                                                                  --4.80%--ldiskfs_ext_map_blocks
                                                                                            |
                                                                                             --4.80%--ldiskfs_mb_new_blocks
                                                                                                       |
                                                                                                        --4.43%--ldiskfs_mb_regular_allocator
                                                                                                                  |
                                                                                                                   --4.16%--ldiskfs_mb_good_group

Most time are spent in ldiskfs_mb_regular_allocator() loops (4 loops other all groups)

Comment by Andreas Dilger [ 04/May/19 ]

Another possibility is to improve large OST allocation by using the bigalloc feature. This will reduce the number of block groups to search by the factor of the chunk size, and increase the efficiency of block allocations.

bigalloc has been in use by Google for many years, though there may be some issues to be fixed with osd-ldiskfs in order to convert block allocations to cluster allocations.

Comment by Andreas Dilger [ 09/May/19 ]

The benefit of bigalloc is that it reduces metadata size and handling overhead by a significant factor. The number of bits to allocate per unit size is reduced linearly by the chuck factor. This will help mballoc significantly, since huge OSTs can have millions of block groups to search, and a bigalloc chunk size of, say, 128kB would reduce allocation overhead and the number of block groups by a factor of 32.

The main drawback of bigalloc is that it can waste space because the chunk size is the minimum allocation unit of the filesystem (eg. any file < chunk_size will consume a full chunk of space, even though only one 4KB block might be written). The space in a chunk cannot be shared between files. However, this is not worse than if the block size was actually increased to match the bigalloc chuck size, and better in several regards. The one drawback vs. a larger block size is that it does not increase the maximum extent size or maximum file size, since the blocksize and block addressing is the same, only the allocation size is changed.

Has anyone tested bigalloc on an OST, and are there any known issues?

Comment by Andreas Dilger [ 09/May/19 ]

Note that I'm not against improving mballoc to be more efficient, but I think bigalloc is a very easy way to improve allocation performance with minimum effort (mainly going through osd-ldiskfs and maybe LFSCK and mapping blocks to chunks during allocation), vs. significant work to rewrite the block allocation code, which would also touch lots of core code and need a long time to validate correctness and allocator behavior.

Comment by Artem Blagodarenko (Inactive) [ 24/May/19 ]

Hello adilger,

I agree that bigalloc can improve metadata operation performance and save space. But it looks like it can hep with allocator problems. Here are results of testing that shows that allocator make ~1million useless groups scanning. If this number become 4 times less, nothing changed dramatically.

During test, system was fragmented with pattern "50 free blocks - 50 occupied  blocks". Performance digradated from 1.2 Gb/sed to 10 MB/sec.

 
1. dd on non fragmented fs : ~1.2 Gb/sec
[root@cslmo1704 ~]# df -T /mnt/ldiskfs
Filesystem     Type       1K-blocks  Used    Available Use% Mounted on
/dev/md0       ldiskfs 121226819924  1260 120014147240   1% /mnt/ldiskfs
[root@cslmo1704 ~]#
 
cslmo1704 ~]# time dd if=/dev/zero of=/mnt/ldiskfs/foo bs=$((1024*1024)) count=$((32*10*1024)) &
[1] 74048
[root@cslmo1704 ~]# 327680+0 records in
327680+0 records out
343597383680 bytes (344 GB) copied, 292.264 s, 1.2 GB/s
 
real    4m52.267s
user    0m0.287s
sys     4m51.010s
 
2. fragmented fs, mb_c[1-3]_threshold are default (24,14, 4): write: ~10Mb/sec (93.89system, 90.98system) :
fake_fill_fs 50
1 Fri May 17 15:41:27 UTC 2019 ==================
+ cat /sys/fs/ldiskfs/md0/mb_c1_threshold
24
+ cat /sys/fs/ldiskfs/md0/mb_c2_threshold
14
+ cat /sys/fs/ldiskfs/md0/mb_c3_threshold
4
+ set +x
WRITE 1: =======-===
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704)        05/17/2019      _x86_64_        (20 CPU)
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.84    0.00    0.41    0.13    0.00   97.61
 
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
md0              78.09       239.67      7133.00   22778200  677929976
 
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 6530.55 s, 10.5 MB/s
0.23user 93.89system 1:48:50elapsed 1%CPU (0avgtext+0avgdata 1824maxresident)k
168inputs+134217728outputs (1major+501minor)pagefaults 0swaps
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704)        05/17/2019      _x86_64_        (20 CPU)
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.79    0.00    0.80    0.24    0.00   97.18
 
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
md0              80.86       224.30      7279.30   22782608  739372736
 
-rw-r--r-- 1 root root 68719476736 May 17 17:30 /mnt/ldiskfs/foo
READ 1: ========
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 56.2515 s, 1.2 GB/s
0.04user 24.33system 0:56.25elapsed 43%CPU (0avgtext+0avgdata 1828maxresident)k
134217784inputs+0outputs (1major+502minor)pagefaults 0swaps
RM 1: =====
0.00user 4.14system 0:04.89elapsed 84%CPU (0avgtext+0avgdata 684maxresident)k
1264inputs+0outputs (1major+214minor)pagefaults 0swaps
2 Fri May 17 17:43:21 UTC 2019 ==================
+ cat /sys/fs/ldiskfs/md0/mb_c1_threshold
24
+ cat /sys/fs/ldiskfs/md0/mb_c2_threshold
14
+ cat /sys/fs/ldiskfs/md0/mb_c3_threshold
4
+ set +x
2 Fri May 17 17:43:21 UTC 2019 ==================
+ cat /sys/fs/ldiskfs/md0/mb_c1_threshold
24
+ cat /sys/fs/ldiskfs/md0/mb_c2_threshold
14
+ cat /sys/fs/ldiskfs/md0/mb_c3_threshold
4
+ set +x
WRITE 2: =======-===
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704)        05/17/2019      _x86_64_        (20 CPU)
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.78    0.00    0.84    0.24    0.00   97.14
 
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
md0              89.69       878.23      7278.99   89892436  745047808
 
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 6619.02 s, 10.4 MB/s
0.25user 90.98system 1:50:19elapsed 1%CPU (0avgtext+0avgdata 1828maxresident)k
64inputs+134217728outputs (1major+502minor)pagefaults 0swaps
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704)        05/17/2019      _x86_64_        (20 CPU)
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.73    0.00    1.18    0.34    0.00   96.75
 
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
md0              91.65       824.93      7400.80   89896736  806502028
 
-rw-r--r-- 1 root root 68719476736 May 17 19:33 /mnt/ldiskfs/foo
 
For given fragmentation (50 free blocks -  50 occupied blocks) excluding c1 doesn't give improvement, but excluding c2 loops gives 500M/s writing performance:
 
Filesystem     Type       1K-blocks        Used   Available Use% Mounted on
/dev/md0       ldiskfs 121226819924 60622252404 59391896096  51% /mnt/ldiskfs
 
 
/dev/md0:
 Timing buffered disk reads: 6256 MB in  3.00 seconds = 2084.39 MB/sec
1 Tue May 21 12:25:00 UTC 2019 ============================================
+ cat /sys/fs/ldiskfs/md0/mb_c1_threshold
59
+ cat /sys/fs/ldiskfs/md0/mb_c2_threshold
59
+ cat /sys/fs/ldiskfs/md0/mb_c3_threshold
4
+ echo 1
+ set +x
WRITE 1: ================================
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 	05/21/2019 	_x86_64_	(20 CPU)
 
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.11    0.00    0.99    0.29    0.00   97.61
 
 
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
md0              51.63      1384.45      2903.89  593729636 1245346552
 
 
	65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 133.066 s, 516 MB/s
0.07user 84.88system 2:14.63elapsed 63%CPU (0avgtext+0avgdata 1828maxresident)k
2928inputs+134217728outputs (0major+502minor)pagefaults 0swaps
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 	05/21/2019 	_x86_64_	(20 CPU)
 
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.11    0.00    0.99    0.30    0.00   97.60
 
 
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
md0              53.68      1384.03      3059.43  593735516 1312462836
 
 
mballoc: 16777208 blocks 335827 reqs (325 success)
mballoc: 67425781 extents scanned, 177 goal hits, 0 2^N hits, 335442 breaks, 0 lost
mballoc: (0, 0, 0) useless c(0,1,2) loops
mballoc: (1425456, 1083502, 0) skipped c(0,1,2) loops
-rw-r--r-- 1 root root 68719476736 May 21 12:27 /mnt/ldiskfs/foo
READ 1: ========
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 64.7674 s, 1.1 GB/s
0.05user 24.76system 1:04.77elapsed 38%CPU (0avgtext+0avgdata 1828maxresident)k
134217784inputs+0outputs (1major+501minor)pagefaults 0swaps
Filesystem     Type       1K-blocks        Used   Available Use% Mounted on
/dev/md0       ldiskfs 121226819924 60680523488 59333625012  51% /mnt/ldiskfs
RM 1: =====
0.00user 4.02system 0:06.78elapsed 59%CPU (0avgtext+0avgdata 680maxresident)k
4784inputs+0outputs (1major+214minor)pagefaults 0swaps
2 Tue May 21 12:33:37 UTC 2019 ============================================
+ cat /sys/fs/ldiskfs/md0/mb_c1_threshold
59
+ cat /sys/fs/ldiskfs/md0/mb_c2_threshold
59
+ cat /sys/fs/ldiskfs/md0/mb_c3_threshold
4
+ echo 1
+ set +x
WRITE 2: ================================
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 	05/21/2019 	_x86_64_	(20 CPU)
 
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.11    0.00    1.00    0.30    0.00   97.60
 
 
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
md0              55.76      1539.10      3056.71  660846708 1312464796
 
 
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 123.318 s, 557 MB/s
0.05user 78.67system 2:03.32elapsed 63%CPU (0avgtext+0avgdata 1828maxresident)k
56inputs+134217728outputs (1major+501minor)pagefaults 0swaps
Linux 3.10.0-693.21.1.x3.2.12.x86_64 (cslmo1704) 	05/21/2019 	_x86_64_	(20 CPU)
 
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.11    0.00    1.00    0.30    0.00   97.59
 
 
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
md0              57.64      1538.67      3199.87  660851024 1374326816
 
 
mballoc: 15462744 blocks 309333 reqs (149 success)
mballoc: 62147084 extents scanned, 6 goal hits, 0 2^N hits, 309183 breaks, 0 lost
mballoc: (0, 0, 0) useless c(0,1,2) loops
mballoc: (1425456, 1393743, 0) skipped c(0,1,2) loops
-rw-r--r-- 1 root root 68719476736 May 21 12:35 /mnt/ldiskfs/foo
READ 2: ========
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 65.9076 s, 1.0 GB/s
0.05user 24.05system 1:05.91elapsed 36%CPU (0avgtext+0avgdata 1824maxresident)k
134217784inputs+0outputs (1major+500minor)pagefaults 0swaps
Filesystem     Type       1K-blocks        Used   Available Use% Mounted on
/dev/md0       ldiskfs 121226819924 60680523488 59333625012  51% /mnt/ldiskfs
RM 2: =====
0.00user 3.94system 0:06.80elapsed 57%CPU (0avgtext+0avgdata 680maxresident)k
4752inputs+0outputs (1major+214minor)pagefaults 0swaps 
The reason why in case "50 free blocks - 50 ocupated blocks" setting "60-0-0" can be illustrated by statistics:
mballoc: (7829, 1664192, 0) useless c(0,1,2) loops
mballoc: (981753, 0, 0) skipped c(0,1,2) loops
Yes, there are 7829 c1 loops, but 1664192 c2 loops x1000 times, so we can drop c1 loops influence. In this case we need set "60-60-0" options. Statistic with "60-60-0" shows 1393743 c2 loops skipped and this returns to 500M/s write performance:
mballoc: (0, 0, 0) useless c(0,1,2) loops
mballoc: (1425456, 1393743, 0) skipped c(0,1,2) loops
Read performance hasn't changed - 1.0 GB/s.
Comment by Andreas Dilger [ 24/May/19 ]

A summary of these statistics (table showing total group scans, performance vs for unpatched and patches code for a few different configs) should be included in the commit message for the patch submitted upstream. That makes it clear the patch is providing a real benefit (improved performance, reduced CPU usage). I think that would make it much easier to get the patch accepted, otherwise just a vague "improves performance" in the comment is not a compelling reason to land it.

Comment by Artem Blagodarenko (Inactive) [ 03/Jun/19 ]

 

Artem, we discussed this patch on the Ext4 concall today. A couple
 of items came up during discussion:

the patch submission should include performance results to
   show that the patch is providing an improvement it would be preferable if the thresholds for the stages were found
   dynamically in the kernel based on how many groups have been skipped
   and the free chunk size in each group there would need to be some way to dynamically reset the scanning
   level when lots of blocks have been freed 

 

Hello  adilger, what do you think about idea split this ideas to two phases: first as I already send, second with some autotune logic?  

 

Comment by Andreas Dilger [ 03/Jun/19 ]

Definitely including statistics for the performance improvement should be part of the first patch. I didn't see a copy of the patch in Gerrit. Have you submitted it yet?

I think a very simple heuristic could be used for auto-tune, something like "skip groups with number of free blocks less than 1/2 of average", or possibly "skip groups with free allocation order less than 1/2 average", and adjust which scanning stage this applies when, say, 1/2, 3/4, ... of groups are below this level.

Comment by Artem Blagodarenko (Inactive) [ 05/Jun/19 ]

Hello adilger,

>I didn't see a copy of the patch in Gerrit. Have you submitted it yet?

Do we need LDISKFS patches here? The right way is land it to ext4 directly. I am going send new patches series, with test results and debugfs fake fragmentation patch for tests.

 

Comment by Andreas Dilger [ 05/Jun/19 ]

I agree that it makes sense to get the patches reviewed and accepted upstream if possible, but after that it might take several years before the change is available in a vendor kernel, so it would make sense to have an ldiskfs patch as well.

Also, in some cases, patches that improve performance and/or functionality still do not get accepted upstream because of various reasons, so in this case it would still make sense to carry this patch in the Lustre tree because it mostly affects very large OST filesystems.

Comment by Artem Blagodarenko (Inactive) [ 10/Jun/19 ]

Here is summary from EXT4 developers call descussion last Thursday.

  • EXT4 users prefer automatic block allocator settings adjusting 
  • Current versions of skipping loops patch is not interesting, because requires manual setting.
  • Some heuristic that change allocator behaviour is preferred.

My next steps:

  • Continue discussion in EXT4 email list to find the best heuristic
  • Upload current patch version to the Gerrit
Comment by Gerrit Updater [ 11/Jun/19 ]

Artem Blagodarenko (c17828@cray.com) uploaded a new patch: https://review.whamcloud.com/35180
Subject: LU-12103 ldiskfs: don't search large block range if disk full
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5444c2b2d17a58f7b0d2d8aeb23b652ae8d6ecd4

Comment by Artem Blagodarenko (Inactive) [ 25/Sep/19 ]

green, I added the issue is about porting to RHEL8 - LU-12801

Comment by Gerrit Updater [ 25/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35180/
Subject: LU-12103 ldiskfs: don't search large block range if disk full
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 95f8ae5677491508ae7182b4f61ead3d413434ae

Comment by Peter Jones [ 25/Sep/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 05/Nov/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36681
Subject: LU-12103 ldiskfs: don't search large block range if disk full
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 0da3a44425f5ab1c0417663281f7cc626f99b675

Comment by Andreas Dilger [ 28/Nov/19 ]

Hi Artem, I noticed that this patch was only added to the rhel7.6 series, but not the rhel7.7 and rhel8.0 series. Could you please submit a patch to add ext4-simple-blockalloc.patch to these newer series.

Comment by Gerrit Updater [ 05/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36681/
Subject: LU-12103 ldiskfs: don't search large block range if disk full
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 810a952303969ca0ee01639a5408ff2f0e3456d9

Generated at Sat Feb 10 02:49:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.