Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2377

Provide a mechanism to reset the ldiskfs extents allocation position to near the beginning of a drive

Details

    • Improvement
    • Resolution: Won't Fix
    • Minor
    • None
    • None
    • 5643

    Description

      Apparently, the current allocator may allocate extents near "the end" of an OST at any point. Restarting the OSS seems to cause the allocator to again choose extents near "the beginning".

      Since extents near the beginning of the OST drive have much better write performance, it will be good for benchmark/performance testing to be able to "reset" allocations w/o OST restart.

      Attachments

        Issue Links

          Activity

            [LU-2377] Provide a mechanism to reset the ldiskfs extents allocation position to near the beginning of a drive

            I'm not against improving the performance of ldiskfs. I even think the workload exposes a bug in the mballoc allocator if it isn't at least somewhat biased to the front of the disk (unless the block device is not "rotational", i.e. an SSD).

            I just think the current tunable hides a real problem and simultaneously doesn't fix any real-world behaviour. It is only useful to hide a real performance issue during benchmarking, and I don't think that is a step forward.

            adilger Andreas Dilger added a comment - I'm not against improving the performance of ldiskfs. I even think the workload exposes a bug in the mballoc allocator if it isn't at least somewhat biased to the front of the disk (unless the block device is not "rotational", i.e. an SSD). I just think the current tunable hides a real problem and simultaneously doesn't fix any real-world behaviour. It is only useful to hide a real performance issue during benchmarking, and I don't think that is a step forward.

            Please don't be discouraged with this work. As you say more than one person might want this code, it just does not seem like the Lustre code base is the correct project to hold the code. linux-ext4 is an active ext4 development list and I truly encourage an RFC submission to it.

            If you can get code into mainline the path to other places becomes much easier.

            keith Keith Mannthey (Inactive) added a comment - Please don't be discouraged with this work. As you say more than one person might want this code, it just does not seem like the Lustre code base is the correct project to hold the code. linux-ext4 is an active ext4 development list and I truly encourage an RFC submission to it. If you can get code into mainline the path to other places becomes much easier.

            This was a request from one of our resellers; our feeling was that it would be useful to others as well. But if it's not desired that's fine; we can close this.

            nrutman Nathan Rutman added a comment - This was a request from one of our resellers; our feeling was that it would be useful to others as well. But if it's not desired that's fine; we can close this.

            Zam, I agree that you see improved performance after a "reset" due to using outer disk tracks (with higher linear velocity) than inner disk tracks (with lower linear velocity). That is to be expected (up to 50% difference in the papers that I have read).

            My real issue is that this mechanism is only useful for benchmarking. It won't help with disks that are used, and it won't help even for disks that are empty but have been in use for some time (without users doing a manual reset).

            My suggestions are possible ways that this could be fixed for real_world usage of the filesystem. Doing a reset of the allocator to the beginning of the group will not help under normal usage, since it will have to scan the existing groups first. As you noted, mballoc will skip groups that are totally full, but it will still scan groups that are partly full. At the same time, it will not allocate in these groups if the free space is fragmented (as one would expect in a real filesystem after it has been used for some time).

            Conversely, the "elevator" style algorithm used today will only revisit each group after scanning each of the previous groups once. This gives the maximum time to free blocks in those other groups before trying to allocate from them again.

            Of course, a better (but more complex) mechanism is to keep an ordered list of groups with large free extents. That allows the allocator to quickly find a group with enough free space without having to scan fragmented groups. Also, if the groups were weighted/ordered so that lower-numbered ones were preferred over higher-numbered groups with an equal amount of free space then you will get good behaviour for both your benchmarks and real-world usage.

            For empty filesystems (or filesystems that are filled and emptied between test runs) the allocator would prefer the lower-numbered groups. For in-use filesystems the allocator would also be able to quickly find groups with lots of free space, and while it could be biased to the faster parts of the disk it won't be wasting time re-scanning groups that have fragmented free space.

            This has been discussed with the ext4 developers in the past, and I think they would be willing to take this kind of patch upstream as well.

            adilger Andreas Dilger added a comment - Zam, I agree that you see improved performance after a "reset" due to using outer disk tracks (with higher linear velocity) than inner disk tracks (with lower linear velocity). That is to be expected (up to 50% difference in the papers that I have read). My real issue is that this mechanism is only useful for benchmarking. It won't help with disks that are used, and it won't help even for disks that are empty but have been in use for some time (without users doing a manual reset). My suggestions are possible ways that this could be fixed for real_world usage of the filesystem. Doing a reset of the allocator to the beginning of the group will not help under normal usage, since it will have to scan the existing groups first. As you noted, mballoc will skip groups that are totally full, but it will still scan groups that are partly full. At the same time, it will not allocate in these groups if the free space is fragmented (as one would expect in a real filesystem after it has been used for some time). Conversely, the "elevator" style algorithm used today will only revisit each group after scanning each of the previous groups once. This gives the maximum time to free blocks in those other groups before trying to allocate from them again. Of course, a better (but more complex) mechanism is to keep an ordered list of groups with large free extents. That allows the allocator to quickly find a group with enough free space without having to scan fragmented groups. Also, if the groups were weighted/ordered so that lower-numbered ones were preferred over higher-numbered groups with an equal amount of free space then you will get good behaviour for both your benchmarks and real-world usage. For empty filesystems (or filesystems that are filled and emptied between test runs) the allocator would prefer the lower-numbered groups. For in-use filesystems the allocator would also be able to quickly find groups with lots of free space, and while it could be biased to the faster parts of the disk it won't be wasting time re-scanning groups that have fragmented free space. This has been discussed with the ext4 developers in the past, and I think they would be willing to take this kind of patch upstream as well.

            Carrying kernel proper patches in the Lustre tree should really be reserved for critical issues.

            What change in performance are you seeing with this code and your micro benchmark?

            The state of the device is alway changing unless you reformat between runs. A reformat step or an exact disk copy to a known state is the only way for exact stable for performance. Will simply not reformatting then remounting get you the performance change you want for your specialized run?

            Also in general what performance tests are you working with? Is it is opensouce?

            keith Keith Mannthey (Inactive) added a comment - Carrying kernel proper patches in the Lustre tree should really be reserved for critical issues. What change in performance are you seeing with this code and your micro benchmark? The state of the device is alway changing unless you reformat between runs. A reformat step or an exact disk copy to a known state is the only way for exact stable for performance. Will simply not reformatting then remounting get you the performance change you want for your specialized run? Also in general what performance tests are you working with? Is it is opensouce?

            Andreas, this patch is not for ordinary user but for a tester to be able to "reset" fs to its "after-mount" state. In general I think there should be no mechanism which may affect performance but totally hidden from ext4 testers/users.

            About performance loss, ext4 allocator skips 100% full groups w/o actual bitmap scan (through ext4_mb_good_group() check) so it should jump to first not full group fast.

            One suggestion would be to keep the group descriptors in a sorted list by the number of free blocks or the number of free extents larger than 1MB. This would likely also solve your problem, and also make allocations much faster for disks that are full.

            While it may be a good allocator optimization, our problem is different: the performance comes back when the fs gets remounted. How sorted group descriptors can solve it?

            zam Alexander Zarochentsev added a comment - Andreas, this patch is not for ordinary user but for a tester to be able to "reset" fs to its "after-mount" state. In general I think there should be no mechanism which may affect performance but totally hidden from ext4 testers/users. About performance loss, ext4 allocator skips 100% full groups w/o actual bitmap scan (through ext4_mb_good_group() check) so it should jump to first not full group fast. One suggestion would be to keep the group descriptors in a sorted list by the number of free blocks or the number of free extents larger than 1MB. This would likely also solve your problem, and also make allocations much faster for disks that are full. While it may be a good allocator optimization, our problem is different: the performance comes back when the fs gets remounted. How sorted group descriptors can solve it?
            zam Alexander Zarochentsev added a comment - patch uploaded http://review.whamcloud.com/#change,4664

            This can also lead to performance loss on real systems, if the beginning of the disk is full and/or has fragmented free space and the allocator has to search through many groups that have no space. Be careful that you are not optimizing for your benchmark, and instead focus on solutions that will actually improve performance under real world usage with disks that are not completely empty.

            One suggestion would be to keep the group descriptors in a sorted list by the number of free blocks or the number of free extents larger than 1MB. This would likely also solve your problem, and also make allocations much faster for disks that are full.

            adilger Andreas Dilger added a comment - This can also lead to performance loss on real systems, if the beginning of the disk is full and/or has fragmented free space and the allocator has to search through many groups that have no space. Be careful that you are not optimizing for your benchmark, and instead focus on solutions that will actually improve performance under real world usage with disks that are not completely empty. One suggestion would be to keep the group descriptors in a sorted list by the number of free blocks or the number of free extents larger than 1MB. This would likely also solve your problem, and also make allocations much faster for disks that are full.

            In ext4, allocation of relatively large chunks of blocks is done by special "stream" block allocator. It is used when
            request size is greater than sbi->s_mb_stream_request, 16 blocks by default. Once the stream allocator is used, the latest allocated block is used as a start of subsequent stream allocations. It explains why new files get larger and larger block numbers.
            The patch I am going to attach exports stream allocator internal state through procfs and allows to reset it.

            zam Alexander Zarochentsev added a comment - In ext4, allocation of relatively large chunks of blocks is done by special "stream" block allocator. It is used when request size is greater than sbi->s_mb_stream_request, 16 blocks by default. Once the stream allocator is used, the latest allocated block is used as a start of subsequent stream allocations. It explains why new files get larger and larger block numbers. The patch I am going to attach exports stream allocator internal state through procfs and allows to reset it.

            People

              keith Keith Mannthey (Inactive)
              zam Alexander Zarochentsev
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: