Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • None
    • 6
    • 3
    • 5832

    Description

      The Lustre Manual has a section that has suggested tuning for "testing", under "Tuning Linux Storage Devices". All of the settings have suggested values except for /sys/block/sdN/queue/scheduler. It would be nice to have a suggestion there.

      I think we are probably still using the Linux default (probably CFQ) everywhere at LLNL, and that may be a problem. I remember a recent discussion at LUG that suggested that was bad. ZFS certainly attempts to change the scheduler off of CFQ to noop (if ZFS believes that it owns the entire disc).

      For ldiskfs it might be the deadline scheduler that we should recommend?

      Attachments

        Activity

          [LUDOC-109] Missing block scheduler tuning suggestion

          Change has been approved and merged. Resolving ticket.

          linda Linda Bebernes (Inactive) added a comment - Change has been approved and merged. Resolving ticket.
          linda Linda Bebernes (Inactive) added a comment - - edited

          Added note about scheduler default (deadline) and recommendations (deadline, noop). Patch is ready for review at http://review.whamcloud.com/#change,6486.

          linda Linda Bebernes (Inactive) added a comment - - edited Added note about scheduler default (deadline) and recommendations (deadline, noop). Patch is ready for review at http://review.whamcloud.com/#change,6486 .

          Note that LU-2498 has a patch (http://review.whamcloud.com/4853) to automatically change the default scheduler for Lustre block devices from CFQ to deadline, unless it is already set to noop. This behavior should also be documented.

          adilger Andreas Dilger added a comment - Note that LU-2498 has a patch ( http://review.whamcloud.com/4853 ) to automatically change the default scheduler for Lustre block devices from CFQ to deadline, unless it is already set to noop. This behavior should also be documented.

          Brett,
          Would you mind taking a look at this and see if this is something you might be able to work with Linda on the Lustre Manual project with?

          jlevi Jodi Levi (Inactive) added a comment - Brett, Would you mind taking a look at this and see if this is something you might be able to work with Linda on the Lustre Manual project with?

          IIRC, while ZFS allocates the IO in order, there is some jitter in the processing times of the IO requests between threads, and this causes slightly out-of-order IO submission to the queue. At least I recall Brian (or someone) commenting about the slightly non-linear IO ordering from ZFS at the disk level. That's why I suggest deadline over noop, since it isn't guaranteed that only front/back merging is enough.

          adilger Andreas Dilger added a comment - IIRC, while ZFS allocates the IO in order, there is some jitter in the processing times of the IO requests between threads, and this causes slightly out-of-order IO submission to the queue. At least I recall Brian (or someone) commenting about the slightly non-linear IO ordering from ZFS at the disk level. That's why I suggest deadline over noop, since it isn't guaranteed that only front/back merging is enough.

          I thought it was old and established knowledge that CFQ sucks for high-performance workloads

          Well, that common knowledge appears to have been missed in both the documention, and at LLNL as a whole.

          but I think it needs to be done internally by ZFS for its constituent block devices

          That was the intent with ZFS, but apparently Brian was worried about setting the device's scheduler unilaterally, since the drive might be shared with other filesystems in other partitions. But they are talking that out right now in the hallway.

          Brian tells me that even the noop scheduler does front/back merging. He might have said that the merging happens at a layer before the scheduling or something along those lines. That isn't to say that deadline might help too, but at least we should get merging even with noop. And in theory ZFS's scheduler will make things easy to merge. We need to verify that theory with block traces though.

          morrone Christopher Morrone (Inactive) added a comment - I thought it was old and established knowledge that CFQ sucks for high-performance workloads Well, that common knowledge appears to have been missed in both the documention, and at LLNL as a whole. but I think it needs to be done internally by ZFS for its constituent block devices That was the intent with ZFS, but apparently Brian was worried about setting the device's scheduler unilaterally, since the drive might be shared with other filesystems in other partitions. But they are talking that out right now in the hallway. Brian tells me that even the noop scheduler does front/back merging. He might have said that the merging happens at a layer before the scheduling or something along those lines. That isn't to say that deadline might help too, but at least we should get merging even with noop. And in theory ZFS's scheduler will make things easy to merge. We need to verify that theory with block traces though.

          Chris, I thought it was old and established knowledge that CFQ sucks for high-performance workloads, and deadline or noop schedulers were far more appropriate for RAID hardware (since they typically do their own IO batching and reordering internally). My preference is deadline, even for ZFS, since it allows request aggregation in the common sequential IO cases, and does this with low overhead (unlike CFQ which will intentionally delay requests to try and merge them). Using deadline allows 1MB ldiskfs writes to be commonly merged up to 2-8MB, and it will also likely improve the journal IO.

          I suspect for ZFS it would also help the 128kB writes to VDEVs to be re-merged to be efficient again for the underlying RAID. This is important regardless of whether h/w RAID-6 or RAID-Z is used, since they want large requests for each disk, not just 16kB chunks.

          I thought we were doing this tuning in mount.lustre, but digging some more it appears that this is done in the server kernel config via CONFIG_DEFAULT_DEADLINE=y, which should affect all devices.

          Given we are trying to move to a patchless kernel, this could possibly be done in the OSD startup for ldiskfs since it is using simple block devices, but I think it needs to be done internally by ZFS for its constituent block devices. I'll file a separate bug for this under LU-20.

          adilger Andreas Dilger added a comment - Chris, I thought it was old and established knowledge that CFQ sucks for high-performance workloads, and deadline or noop schedulers were far more appropriate for RAID hardware (since they typically do their own IO batching and reordering internally). My preference is deadline, even for ZFS, since it allows request aggregation in the common sequential IO cases, and does this with low overhead (unlike CFQ which will intentionally delay requests to try and merge them). Using deadline allows 1MB ldiskfs writes to be commonly merged up to 2-8MB, and it will also likely improve the journal IO. I suspect for ZFS it would also help the 128kB writes to VDEVs to be re-merged to be efficient again for the underlying RAID. This is important regardless of whether h/w RAID-6 or RAID-Z is used, since they want large requests for each disk, not just 16kB chunks. I thought we were doing this tuning in mount.lustre, but digging some more it appears that this is done in the server kernel config via CONFIG_DEFAULT_DEADLINE=y , which should affect all devices. Given we are trying to move to a patchless kernel, this could possibly be done in the OSD startup for ldiskfs since it is using simple block devices, but I think it needs to be done internally by ZFS for its constituent block devices. I'll file a separate bug for this under LU-20 .

          People

            cliffw Cliff White (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: