Zam, I agree that you see improved performance after a "reset" due to using outer disk tracks (with higher linear velocity) than inner disk tracks (with lower linear velocity). That is to be expected (up to 50% difference in the papers that I have read).
My real issue is that this mechanism is only useful for benchmarking. It won't help with disks that are used, and it won't help even for disks that are empty but have been in use for some time (without users doing a manual reset).
My suggestions are possible ways that this could be fixed for real_world usage of the filesystem. Doing a reset of the allocator to the beginning of the group will not help under normal usage, since it will have to scan the existing groups first. As you noted, mballoc will skip groups that are totally full, but it will still scan groups that are partly full. At the same time, it will not allocate in these groups if the free space is fragmented (as one would expect in a real filesystem after it has been used for some time).
Conversely, the "elevator" style algorithm used today will only revisit each group after scanning each of the previous groups once. This gives the maximum time to free blocks in those other groups before trying to allocate from them again.
Of course, a better (but more complex) mechanism is to keep an ordered list of groups with large free extents. That allows the allocator to quickly find a group with enough free space without having to scan fragmented groups. Also, if the groups were weighted/ordered so that lower-numbered ones were preferred over higher-numbered groups with an equal amount of free space then you will get good behaviour for both your benchmarks and real-world usage.
For empty filesystems (or filesystems that are filled and emptied between test runs) the allocator would prefer the lower-numbered groups. For in-use filesystems the allocator would also be able to quickly find groups with lots of free space, and while it could be biased to the faster parts of the disk it won't be wasting time re-scanning groups that have fragmented free space.
This has been discussed with the ext4 developers in the past, and I think they would be willing to take this kind of patch upstream as well.
I'm not against improving the performance of ldiskfs. I even think the workload exposes a bug in the mballoc allocator if it isn't at least somewhat biased to the front of the disk (unless the block device is not "rotational", i.e. an SSD).
I just think the current tunable hides a real problem and simultaneously doesn't fix any real-world behaviour. It is only useful to hide a real performance issue during benchmarking, and I don't think that is a step forward.