[LUDOC-109] Missing block scheduler tuning suggestion Created: 12/Dec/12 Updated: 05/Jun/13 Resolved: 05/Jun/13 |
|
| Status: | Resolved |
| Project: | Lustre Documentation |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | QContent | ||
| Business Value: | 6 |
| Severity: | 3 |
| Rank (Obsolete): | 5832 |
| Description |
|
The Lustre Manual has a section that has suggested tuning for "testing", under "Tuning Linux Storage Devices". All of the settings have suggested values except for /sys/block/sdN/queue/scheduler. It would be nice to have a suggestion there. I think we are probably still using the Linux default (probably CFQ) everywhere at LLNL, and that may be a problem. I remember a recent discussion at LUG that suggested that was bad. ZFS certainly attempts to change the scheduler off of CFQ to noop (if ZFS believes that it owns the entire disc). For ldiskfs it might be the deadline scheduler that we should recommend? |
| Comments |
| Comment by Andreas Dilger [ 14/Dec/12 ] |
|
Chris, I thought it was old and established knowledge that CFQ sucks for high-performance workloads, and deadline or noop schedulers were far more appropriate for RAID hardware (since they typically do their own IO batching and reordering internally). My preference is deadline, even for ZFS, since it allows request aggregation in the common sequential IO cases, and does this with low overhead (unlike CFQ which will intentionally delay requests to try and merge them). Using deadline allows 1MB ldiskfs writes to be commonly merged up to 2-8MB, and it will also likely improve the journal IO. I suspect for ZFS it would also help the 128kB writes to VDEVs to be re-merged to be efficient again for the underlying RAID. This is important regardless of whether h/w RAID-6 or RAID-Z is used, since they want large requests for each disk, not just 16kB chunks. I thought we were doing this tuning in mount.lustre, but digging some more it appears that this is done in the server kernel config via CONFIG_DEFAULT_DEADLINE=y, which should affect all devices. Given we are trying to move to a patchless kernel, this could possibly be done in the OSD startup for ldiskfs since it is using simple block devices, but I think it needs to be done internally by ZFS for its constituent block devices. I'll file a separate bug for this under |
| Comment by Christopher Morrone [ 14/Dec/12 ] |
Well, that common knowledge appears to have been missed in both the documention, and at LLNL as a whole.
That was the intent with ZFS, but apparently Brian was worried about setting the device's scheduler unilaterally, since the drive might be shared with other filesystems in other partitions. But they are talking that out right now in the hallway. Brian tells me that even the noop scheduler does front/back merging. He might have said that the merging happens at a layer before the scheduling or something along those lines. That isn't to say that deadline might help too, but at least we should get merging even with noop. And in theory ZFS's scheduler will make things easy to merge. We need to verify that theory with block traces though. |
| Comment by Andreas Dilger [ 14/Dec/12 ] |
|
IIRC, while ZFS allocates the IO in order, there is some jitter in the processing times of the IO requests between threads, and this causes slightly out-of-order IO submission to the queue. At least I recall Brian (or someone) commenting about the slightly non-linear IO ordering from ZFS at the disk level. That's why I suggest deadline over noop, since it isn't guaranteed that only front/back merging is enough. |
| Comment by Jodi Levi (Inactive) [ 18/Mar/13 ] |
|
Brett, |
| Comment by Andreas Dilger [ 18/Mar/13 ] |
|
Note that |
| Comment by Linda Bebernes (Inactive) [ 29/May/13 ] |
|
Added note about scheduler default (deadline) and recommendations (deadline, noop). Patch is ready for review at http://review.whamcloud.com/#change,6486. |
| Comment by Linda Bebernes (Inactive) [ 05/Jun/13 ] |
|
Change has been approved and merged. Resolving ticket. |