[LU-16162] ldiskfs: use low disk tracks for block allocation on empty or moderately full filesystems. Created: 15/Sep/22  Updated: 10/May/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Alexander Zarochentsev Assignee: Alexander Zarochentsev
Resolution: Unresolved Votes: 0
Labels: ldiskfs

Issue Links:
Related
is related to LU-14438 backport ldiskfs mballoc patches Open
is related to LU-15319 Weird mballoc behaviour Resolved
Rank (Obsolete): 9223372036854775807

 Description   

Disk performance degrades, when new blocks get allocated near the end of the disk,

For example, the below are obdsurvey-results when mb_last_group is manually set to 0/75%/90% of max block group num before running the test:

3949.61 (0) vs 3677.15 (75%) vs 3133.43 (90%)

[root@cslmo2305 ~]# rpm -qi lustre_ib | grep Version | dshbak -c
----------------
Version
----------------
2.15.0.3_rc2_cray_165_g3355f27
[root@cslmo2305 ~]# echo 0 > /proc/fs/ldiskfs/md*/mb_last_group
[root@cslmo2305 ~]# cat /proc/fs/ldiskfs/md*/mb_last_group
0
[root@cslmo2305 ~]# nobjlo=2 nobjhi=2 thrlo=1024 thrhi=1024 size=393216 rszlo=4096 rszhi=4096 tests_str="write read" obdfilter-survey | egrep -v "^done" 2>/dev/null
Wed May 18 13:23:57 UTC 2022 Obdfilter-survey for case=disk from cslmo2305
ost  1 sz 402653184K rsz 4096K obj    2 thr 1024 write 3949.61 [1399.72, 4272.29] read 4451.99 [1679.81, 5986.40] 
/usr/bin/iokit-libecho: line 235: 69223 Killed                  remote_shell $host "vmstat 5 >> $host_vmstatf" &> /dev/null
[root@cslmo2305 ~]# cat /proc/fs/ldiskfs/md*/mb_last_group
3565
[root@cslmo2305 ~]# echo 0 > /proc/fs/ldiskfs/md*/mb_last_group
[root@cslmo2305 ~]# nobjlo=2 nobjhi=2 thrlo=1024 thrhi=1024 size=393216 rszlo=4096 rszhi=4096 tests_str="write read" obdfilter-survey | egrep -v "^done" 2>/dev/null
Wed May 18 13:27:24 UTC 2022 Obdfilter-survey for case=disk from cslmo2305
ost  1 sz 402653184K rsz 4096K obj    2 thr 1024 write 3975.33 [1207.70, 4299.46] read 4517.36 [1623.78, 5675.05] 
/usr/bin/iokit-libecho: line 235: 76282 Killed                  remote_shell $host "vmstat 5 >> $host_vmstatf" &> /dev/null
[root@cslmo2305 ~]# cat /proc/fs/ldiskfs/md*/mb_last_group
3590
[root@cslmo2305 ~]# echo 1040830 > /proc/fs/ldiskfs/md*/mb_last_group
[root@cslmo2305 ~]# cat /proc/fs/ldiskfs/md*/mb_last_group
1040830
[root@cslmo2305 ~]# nobjlo=2 nobjhi=2 thrlo=1024 thrhi=1024 size=393216 rszlo=4096 rszhi=4096 tests_str="write read" obdfilter-survey | egrep -v "^done" 2>/dev/null
Wed May 18 13:30:56 UTC 2022 Obdfilter-survey for case=disk from cslmo2305
ost  1 sz 402653184K rsz 4096K obj    2 thr 1024 write 3677.15 [2194.42, 3995.29] read 4819.83 [3596.32, 5391.15] 
/usr/bin/iokit-libecho: line 235: 82505 Killed                  remote_shell $host "vmstat 5 >> $host_vmstatf" &> /dev/null
[root@cslmo2305 ~]# cat /proc/fs/ldiskfs/md*/mb_last_group
1044021
[root@cslmo2305 ~]# echo 1040830 > /proc/fs/ldiskfs/md*/mb_last_group
[root@cslmo2305 ~]# cat /proc/fs/ldiskfs/md*/mb_last_group
1040830
[root@cslmo2305 ~]# nobjlo=2 nobjhi=2 thrlo=1024 thrhi=1024 size=393216 rszlo=4096 rszhi=4096 tests_str="write read" obdfilter-survey | egrep -v "^done" 2>/dev/null
Wed May 18 13:34:34 UTC 2022 Obdfilter-survey for case=disk from cslmo2305
ost  1 sz 402653184K rsz 4096K obj    2 thr 1024 write 3666.51 [3231.40, 4047.68] read 4798.87 [3963.07, 5255.19] 
/usr/bin/iokit-libecho: line 235: 88607 Killed                  remote_shell $host "vmstat 5 >> $host_vmstatf" &> /dev/null
[root@cslmo2305 ~]# cat /proc/fs/ldiskfs/md*/mb_last_group
1044030
[root@cslmo2305 ~]# echo 1248996 > /proc/fs/ldiskfs/md


 Comments   
Comment by Andreas Dilger [ 15/Sep/22 ]

I've been thinking about this issue for some time already, and I think it makes sense to use the LU-14438 patches as a starting point for this. That patch provides an array of allocation groups sorted by size of free extent, and the array is used when searching for a new group for allocations.

To provide the start/end segregation needed to isolate the slower tracks of the disk, a threshold could be set (eg. 80% of groups, or a specific group number), and this could be used to efficiently split the groups into two arrays. The "fast" array, for groups below the threshold, and the "slow" array for groups larger than the threshold. Allocations would prefer groups from the fast array if there are suitable free chunks, and only look for groups in the slow array of there were none in the fast array.

That would only be a small change to the mballoc code, as well as an O(1) change to the array insertion code to pick the correct array in which to insert each group.

Comment by Andreas Dilger [ 15/Sep/22 ]

If there was some way for clients to specify QOS for files ("lfs ladvise" or with "lfs migrate"?), it would even be possible to have osd-ldiskfs allocate objects into the slow groups directly.

Comment by Peter Jones [ 15/Sep/22 ]

"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48558
Subject: LU-16162 ldiskfs: keep low tracks allocated by mballoc
Project: fs/lustre-release
Branch: master
Current Patch Set: 3
Commit: cfdf70dc5b3aad26ff746f20ee030c389f9a7715

Comment by Andreas Dilger [ 16/Sep/22 ]

If there was some way for clients to specify QOS for files ("lfs ladvise" or with "lfs migrate"?), it would even be possible to have osd-ldiskfs allocate objects into the slow groups directly.

Shuichi had a very good use case for being able to write into the "slow" part of the filesystem. When migrating data from an old filesystem to a new filesystem, the old data will be copied into the newly-formatted OSTs, and fill all of the low groups (high bandwidth), leaving the new data to use slower parts of the disk. It would be useful to have some mechanism (ladvise process setting, environment variable, layout, or fcntl(F_SET_RW_HINT)?) to force object allocation to e.g. the last 30% of groups for cases like this, so that the beginning of the filesystem remains available for new usage.

Comment by Alexander Zarochentsev [ 19/Sep/22 ]

> if there was some way for clients to specify QOS for files ("lfs ladvise" or with "lfs migrate"?), it would even be possible to have osd-ldiskfs allocate objects into the slow groups directly.

no, the patch only addresses a simple but annoying case when mb_last_group points to some high block group num but the fs still almost empty (due to repeatable write / delete usage pattern) also LU-15319 is about a weird mballoc optimization causing skipping of already initialized block groups , i.e. ldiskfs fs starts to write to not initialized part, after some iterations, fs users would notice an empty fs slowness.

Generated at Sat Feb 10 03:24:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.