[LU-16162] ldiskfs: use low disk tracks for block allocation on empty or moderately full filesystems. Created: 15/Sep/22 Updated: 10/May/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major |
| Reporter: | Alexander Zarochentsev | Assignee: | Alexander Zarochentsev |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | ldiskfs | ||
| Issue Links: |
|
||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Disk performance degrades, when new blocks get allocated near the end of the disk, For example, the below are obdsurvey-results when mb_last_group is manually set to 0/75%/90% of max block group num before running the test:
|
| Comments |
| Comment by Andreas Dilger [ 15/Sep/22 ] |
|
I've been thinking about this issue for some time already, and I think it makes sense to use the LU-14438 patches as a starting point for this. That patch provides an array of allocation groups sorted by size of free extent, and the array is used when searching for a new group for allocations. To provide the start/end segregation needed to isolate the slower tracks of the disk, a threshold could be set (eg. 80% of groups, or a specific group number), and this could be used to efficiently split the groups into two arrays. The "fast" array, for groups below the threshold, and the "slow" array for groups larger than the threshold. Allocations would prefer groups from the fast array if there are suitable free chunks, and only look for groups in the slow array of there were none in the fast array. That would only be a small change to the mballoc code, as well as an O(1) change to the array insertion code to pick the correct array in which to insert each group. |
| Comment by Andreas Dilger [ 15/Sep/22 ] |
|
If there was some way for clients to specify QOS for files ("lfs ladvise" or with "lfs migrate"?), it would even be possible to have osd-ldiskfs allocate objects into the slow groups directly. |
| Comment by Peter Jones [ 15/Sep/22 ] |
|
"Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/48558 |
| Comment by Andreas Dilger [ 16/Sep/22 ] |
Shuichi had a very good use case for being able to write into the "slow" part of the filesystem. When migrating data from an old filesystem to a new filesystem, the old data will be copied into the newly-formatted OSTs, and fill all of the low groups (high bandwidth), leaving the new data to use slower parts of the disk. It would be useful to have some mechanism (ladvise process setting, environment variable, layout, or fcntl(F_SET_RW_HINT)?) to force object allocation to e.g. the last 30% of groups for cases like this, so that the beginning of the filesystem remains available for new usage. |
| Comment by Alexander Zarochentsev [ 19/Sep/22 ] |
|
> if there was some way for clients to specify QOS for files ("lfs ladvise" or with "lfs migrate"?), it would even be possible to have osd-ldiskfs allocate objects into the slow groups directly. no, the patch only addresses a simple but annoying case when mb_last_group points to some high block group num but the fs still almost empty (due to repeatable write / delete usage pattern) also |