[LU-2377] Provide a mechanism to reset the ldiskfs extents allocation position to near the beginning of a drive Created: 22/Nov/12  Updated: 16/Sep/16  Resolved: 13/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Alexander Zarochentsev Assignee: Keith Mannthey (Inactive)
Resolution: Won't Fix Votes: 0
Labels: patch

Issue Links:
Duplicate
duplicates LU-8365 Fix mballoc stream allocator to bette... Open
Related
Rank (Obsolete): 5643

 Description   

Apparently, the current allocator may allocate extents near "the end" of an OST at any point. Restarting the OSS seems to cause the allocator to again choose extents near "the beginning".

Since extents near the beginning of the OST drive have much better write performance, it will be good for benchmark/performance testing to be able to "reset" allocations w/o OST restart.



 Comments   
Comment by Alexander Zarochentsev [ 22/Nov/12 ]

In ext4, allocation of relatively large chunks of blocks is done by special "stream" block allocator. It is used when
request size is greater than sbi->s_mb_stream_request, 16 blocks by default. Once the stream allocator is used, the latest allocated block is used as a start of subsequent stream allocations. It explains why new files get larger and larger block numbers.
The patch I am going to attach exports stream allocator internal state through procfs and allows to reset it.

Comment by Andreas Dilger [ 22/Nov/12 ]

This can also lead to performance loss on real systems, if the beginning of the disk is full and/or has fragmented free space and the allocator has to search through many groups that have no space. Be careful that you are not optimizing for your benchmark, and instead focus on solutions that will actually improve performance under real world usage with disks that are not completely empty.

One suggestion would be to keep the group descriptors in a sorted list by the number of free blocks or the number of free extents larger than 1MB. This would likely also solve your problem, and also make allocations much faster for disks that are full.

Comment by Alexander Zarochentsev [ 23/Nov/12 ]

patch uploaded http://review.whamcloud.com/#change,4664

Comment by Alexander Zarochentsev [ 03/Dec/12 ]

Andreas, this patch is not for ordinary user but for a tester to be able to "reset" fs to its "after-mount" state. In general I think there should be no mechanism which may affect performance but totally hidden from ext4 testers/users.

About performance loss, ext4 allocator skips 100% full groups w/o actual bitmap scan (through ext4_mb_good_group() check) so it should jump to first not full group fast.

One suggestion would be to keep the group descriptors in a sorted list by the number of free blocks or the number of free extents larger than 1MB. This would likely also solve your problem, and also make allocations much faster for disks that are full.

While it may be a good allocator optimization, our problem is different: the performance comes back when the fs gets remounted. How sorted group descriptors can solve it?

Comment by Keith Mannthey (Inactive) [ 04/Dec/12 ]

Carrying kernel proper patches in the Lustre tree should really be reserved for critical issues.

What change in performance are you seeing with this code and your micro benchmark?

The state of the device is alway changing unless you reformat between runs. A reformat step or an exact disk copy to a known state is the only way for exact stable for performance. Will simply not reformatting then remounting get you the performance change you want for your specialized run?

Also in general what performance tests are you working with? Is it is opensouce?

Comment by Andreas Dilger [ 06/Dec/12 ]

Zam, I agree that you see improved performance after a "reset" due to using outer disk tracks (with higher linear velocity) than inner disk tracks (with lower linear velocity). That is to be expected (up to 50% difference in the papers that I have read).

My real issue is that this mechanism is only useful for benchmarking. It won't help with disks that are used, and it won't help even for disks that are empty but have been in use for some time (without users doing a manual reset).

My suggestions are possible ways that this could be fixed for real_world usage of the filesystem. Doing a reset of the allocator to the beginning of the group will not help under normal usage, since it will have to scan the existing groups first. As you noted, mballoc will skip groups that are totally full, but it will still scan groups that are partly full. At the same time, it will not allocate in these groups if the free space is fragmented (as one would expect in a real filesystem after it has been used for some time).

Conversely, the "elevator" style algorithm used today will only revisit each group after scanning each of the previous groups once. This gives the maximum time to free blocks in those other groups before trying to allocate from them again.

Of course, a better (but more complex) mechanism is to keep an ordered list of groups with large free extents. That allows the allocator to quickly find a group with enough free space without having to scan fragmented groups. Also, if the groups were weighted/ordered so that lower-numbered ones were preferred over higher-numbered groups with an equal amount of free space then you will get good behaviour for both your benchmarks and real-world usage.

For empty filesystems (or filesystems that are filled and emptied between test runs) the allocator would prefer the lower-numbered groups. For in-use filesystems the allocator would also be able to quickly find groups with lots of free space, and while it could be biased to the faster parts of the disk it won't be wasting time re-scanning groups that have fragmented free space.

This has been discussed with the ext4 developers in the past, and I think they would be willing to take this kind of patch upstream as well.

Comment by Nathan Rutman [ 13/Dec/12 ]

This was a request from one of our resellers; our feeling was that it would be useful to others as well. But if it's not desired that's fine; we can close this.

Comment by Keith Mannthey (Inactive) [ 13/Dec/12 ]

Please don't be discouraged with this work. As you say more than one person might want this code, it just does not seem like the Lustre code base is the correct project to hold the code. linux-ext4 is an active ext4 development list and I truly encourage an RFC submission to it.

If you can get code into mainline the path to other places becomes much easier.

Comment by Andreas Dilger [ 13/Dec/12 ]

I'm not against improving the performance of ldiskfs. I even think the workload exposes a bug in the mballoc allocator if it isn't at least somewhat biased to the front of the disk (unless the block device is not "rotational", i.e. an SSD).

I just think the current tunable hides a real problem and simultaneously doesn't fix any real-world behaviour. It is only useful to hide a real performance issue during benchmarking, and I don't think that is a step forward.

Generated at Sat Feb 10 01:24:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.