[LUDOC-109] Missing block scheduler tuning suggestion Created: 12/Dec/12  Updated: 05/Jun/13  Resolved: 05/Jun/13

Status: Resolved
Project: Lustre Documentation
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: QContent

Business Value: 6
Severity: 3
Rank (Obsolete): 5832

 Description   

The Lustre Manual has a section that has suggested tuning for "testing", under "Tuning Linux Storage Devices". All of the settings have suggested values except for /sys/block/sdN/queue/scheduler. It would be nice to have a suggestion there.

I think we are probably still using the Linux default (probably CFQ) everywhere at LLNL, and that may be a problem. I remember a recent discussion at LUG that suggested that was bad. ZFS certainly attempts to change the scheduler off of CFQ to noop (if ZFS believes that it owns the entire disc).

For ldiskfs it might be the deadline scheduler that we should recommend?



 Comments   
Comment by Andreas Dilger [ 14/Dec/12 ]

Chris, I thought it was old and established knowledge that CFQ sucks for high-performance workloads, and deadline or noop schedulers were far more appropriate for RAID hardware (since they typically do their own IO batching and reordering internally). My preference is deadline, even for ZFS, since it allows request aggregation in the common sequential IO cases, and does this with low overhead (unlike CFQ which will intentionally delay requests to try and merge them). Using deadline allows 1MB ldiskfs writes to be commonly merged up to 2-8MB, and it will also likely improve the journal IO.

I suspect for ZFS it would also help the 128kB writes to VDEVs to be re-merged to be efficient again for the underlying RAID. This is important regardless of whether h/w RAID-6 or RAID-Z is used, since they want large requests for each disk, not just 16kB chunks.

I thought we were doing this tuning in mount.lustre, but digging some more it appears that this is done in the server kernel config via CONFIG_DEFAULT_DEADLINE=y, which should affect all devices.

Given we are trying to move to a patchless kernel, this could possibly be done in the OSD startup for ldiskfs since it is using simple block devices, but I think it needs to be done internally by ZFS for its constituent block devices. I'll file a separate bug for this under LU-20.

Comment by Christopher Morrone [ 14/Dec/12 ]

I thought it was old and established knowledge that CFQ sucks for high-performance workloads

Well, that common knowledge appears to have been missed in both the documention, and at LLNL as a whole.

but I think it needs to be done internally by ZFS for its constituent block devices

That was the intent with ZFS, but apparently Brian was worried about setting the device's scheduler unilaterally, since the drive might be shared with other filesystems in other partitions. But they are talking that out right now in the hallway.

Brian tells me that even the noop scheduler does front/back merging. He might have said that the merging happens at a layer before the scheduling or something along those lines. That isn't to say that deadline might help too, but at least we should get merging even with noop. And in theory ZFS's scheduler will make things easy to merge. We need to verify that theory with block traces though.

Comment by Andreas Dilger [ 14/Dec/12 ]

IIRC, while ZFS allocates the IO in order, there is some jitter in the processing times of the IO requests between threads, and this causes slightly out-of-order IO submission to the queue. At least I recall Brian (or someone) commenting about the slightly non-linear IO ordering from ZFS at the disk level. That's why I suggest deadline over noop, since it isn't guaranteed that only front/back merging is enough.

Comment by Jodi Levi (Inactive) [ 18/Mar/13 ]

Brett,
Would you mind taking a look at this and see if this is something you might be able to work with Linda on the Lustre Manual project with?

Comment by Andreas Dilger [ 18/Mar/13 ]

Note that LU-2498 has a patch (http://review.whamcloud.com/4853) to automatically change the default scheduler for Lustre block devices from CFQ to deadline, unless it is already set to noop. This behavior should also be documented.

Comment by Linda Bebernes (Inactive) [ 29/May/13 ]

Added note about scheduler default (deadline) and recommendations (deadline, noop). Patch is ready for review at http://review.whamcloud.com/#change,6486.

Comment by Linda Bebernes (Inactive) [ 05/Jun/13 ]

Change has been approved and merged. Resolving ticket.

Generated at Sat Feb 10 03:40:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.