[LU-13802] New i/o path: Buffered i/o as DIO Created: 18/Jul/20  Updated: 16/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Patrick Farrell Assignee: Patrick Farrell
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File test_malloc_align.c    
Issue Links:
Duplicate
is duplicated by LU-16964 I/O Path: Auto switch from BIO to DIO Closed
Related
is related to LU-13805 i/o path: Unaligned direct i/o Open
is related to LU-13814 DIO performance: cl_page struct remov... Open
is related to LU-13799 DIO/AIO efficiency improvements Resolved
is related to LU-15092 Fix logic for unaligned transfer with... Resolved
is related to LU-14969 Fall back to buffered I/O for unalign... Resolved
is related to LU-12550 automatic lockahead Open
is related to LU-17422 unaligned DIO: use page pools Open
is related to LU-17433 async hybrid writes Open
is related to LU-13798 Improve direct i/o performance with m... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

As Andreas noted in LU-13798, the faster DIO path makes it interesting to switch from buffered i/o to direct i/o at larger sizes.

This is actually pretty easy:
If the buffered i/o meets the alignment requirements for DIO (buffers is page aligned and i/o size is a multiple of page size), you can simply set the DIO flag internally in Lustre, and the kernel will direct the i/o to the direct i/o code.  (In newer kernels, this does not require manipulating the O_DIRECT flag on the file, which is good because that's likely unsafe.)

If the buffered i/o is not valid as direct i/o, the usual "fall back to buffered i/o" mechanism (implemented as part of LU-4198) happens automatically (just return 0 instead of -EINVAL).

 

The question, then, is how to decide when to switch from buffered i/o to DIO.  I have a proposed solution that I haven't implemented yet*, which I'll describe here.
*(I have done BIO (buffered i/o) as DIO, but I used a simple "Try all BIO as DIO" patch, not intelligent switching.)

Essentially, direct i/o performance is a function of how much parallelism we can get by splitting the i/o, and the sync time of the back end storage.

For example, on my flash back end, I see a benefit from switching 1 MiB BIO to 4x256 KiB DIO (1.9 GiB/s instead of 1.3 GiB/s).  But a spinning disk back end would require a much larger size for this change to make sense.

 

So the basic question to answer is, what size of i/o do we submit?  How small & in to how many chunks do we split up the i/o?

Note if our submitted i/o size at the higher levels is larger than stripe size or RPC size, it's automatically split to those boundaries, so if we start submitting at very large sizes, we split on those boundaries instead.

Here's my thinking.

We have two basic tunables, one of which has a version for rotational and non-rotational backends.

The tunables are "preferred minimum i/o size" and "desired submission concurrency" (I'm not proud of the name of the second one, open to suggestions...).

 

So, consider a situation where we have a preferred minimum size of 256 KiB and a desired submission concurrency of 8.

If we do a 256 KiB BIO, that is done as buffered i/o.  If we do a 400 KiB BIO, still buffered.  But if we do a 512 KiB BIO, we split it in two 256 KiB DIOs.  A 700 KiB BIO is 2x256 KiB +188 KiB DIOs.  (These thresholds may be too small.)

Now, consider larger sizes.  1 MiB becomes 4x256 KiB.  Then, 2 MiB 8x256 KiB submissions.

But at larger sizes, the desired submission concurrency comes in to play.  Consider 4 MiB.  4 MiB/8 = 512 KiB.  So we split 4 MiB in to 8x512 KiB.  This model prevents us from submitting many tiny i/os once the i/o size is large enough.

Note that I have not tested this much yet - I think 8 might be low for submission concurrency and 16 might be more desirable.  Basically, this is "try to cut the i/o in to this many RPCs", so perhaps concurrency is the wrong word...?

Also, as I noted earlier, the preferred i/o size will be very different for spinning disk vs non-rotational media.  So we will need two values for this (I am thinking we default spinning disk to some multiple of rotational and let people override), and we will also need to make this info available on the client.

I'll ask about that in a comment.  I've also got some benchmark info I can share later - But, basically, buffered i/o through this path performs exactly like DIO through this path.



 Comments   
Comment by Patrick Farrell [ 18/Jul/20 ]

The question I had for Andreas - wish I could @ tag - is about how to make the OST storage type information 'ambiently available' on the client, so the client can tell during an i/o if it's headed for flash or HDD and slice it up accordingly.

We can get it via statfs, but we can't put a statfs request in the client i/o path.  So, I was thinking we would perhaps stick it in the import somehow?  Or perhaps we should look at copying the mechanism used for communicating max ea_size instead?  That second seems more appropriate.

Comment by Wang Shilong (Inactive) [ 19/Jul/20 ]

I did some try on unaligned DIO some days before, i thought similar things of this.

This is good idea, but this might not make big help in real application or even benchmarking. IO size and offset aligned with page size might not that hard, but
user space always aligned with page size is not easy.(considering application or fio, ior, iozone might use malloc() for memory allocation?) i don't know how much possibility it will be
to allocate 1MiB memory will be page aligned memory(Unless we modify at least IOR to make benchmark happy).

The point to support unaligned page memory DIO might be complex a bit(we need do memory mapping well, and unfortunately across pages could make problems even complexer)..
Or another way that i could think of is we do lockless io(but we need extra pages allocation for unaligned DIO IO).

Comment by Gerrit Updater [ 19/Jul/20 ]

Patrick Farrell (farr0186@gmail.com) uploaded a new patch: https://review.whamcloud.com/39450
Subject: LU-13802 llite: All i/o as DIO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0b81fb97371a8a3b5b3bc5906844b244572ac81a

Comment by Patrick Farrell [ 19/Jul/20 ]

"I did some try on unaligned DIO some days before, i thought similar things of this."
Yes, I saw it.  I have some comments on that later - I have a way to do unaligned DIO, but it's basically via a bounce buffer.  I think that's what you're suggesting when say "extra pages allocation for unaligned DIO".  I'm actually going to open a separate ticket about that, but I'm trying to get the conversation started on this stuff first.  I promise - I have thoughts and a prototype patch.  I just can't juggle too many things.

Comment by Patrick Farrell [ 19/Jul/20 ]

"user space always aligned with page size is not easy.(considering application or fio, ior, iozone might use malloc() for memory allocation?) i don't know how much possibility it will be
to allocate 1MiB memory will be page aligned memory(Unless we modify at least IOR to make benchmark happy)."

This is actually incorrect, in practice, at least for IOR.

I submitted a quick patch to show that - You can try IOR, and see that it passes the alignment requirements for DIO.  IOR buffers are almost always aligned.  In fact, when you use malloc() to ask for a large buffer, it should generally be page aligned - this is for performance and fragmentation related reasons.  If large buffers were not page aligned by the memory allocator, you would eventually end up in situations where different processes/threads have buffers which are on the same page of memory.  This would be very bad for performance, so the allocator avoids it.  (This is my thinking, any way - It seems to be true, and I have run this by some other engineers here.)

So, my intention is that this ticket will just be for switching buffered (BIO) to DIO where it makes sense, and initially that will mean not doing unaligned BIO via DIO.

Comment by Wang Shilong (Inactive) [ 20/Jul/20 ]

OK it is good to see IOR could be aligned with page size.

Comment by Oleg Drokin [ 20/Jul/20 ]

in fact aligned IO is not a requirement, we implementd it in the past for 1.8.x for llnl and basically lockless IO path allows you to do direct IO from client no matter the alignment

Comment by Patrick Farrell [ 20/Jul/20 ]

Yeah, I know it's not a fundamental requirement, but looking at the code, it seems very difficult to remove that requirement from Lustre today.  A lot of work to make a new i/o path.

I will open a ticket on this later today so we have a dedicated place to discuss that, because I think we should implement the path switching for aligned stuff (which is easy/fast/simple), and consider implementing unaligned DIO separately.

Comment by Patrick Farrell [ 20/Jul/20 ]

Opened LU-13805 to discuss unaligned direct i/o.

Comment by Andreas Dilger [ 20/Jul/20 ]

I agree there is a lot of upside for automatically switching large PAGE_SIZE-aligned read/write to avoid the data copy. If the application is doing such IOs for the whole file (whether one client or many), not only does it reduce data copy overhead on the client, it would also avoid unnecessary DLM lock traffic for the many-client workload, and avoids polluting the client page cache as well (which is the other half of the overhead besides the data copy). I wrote a simple test for memory allocation alignment, to see if large allocations should all be aligned on PAGE_SIZE boundaries. However posix_memalign() unfortunately exists for a reason, and it looks like default memory alignment is not the common case from what my test program shows. It is likely that IOR itself makes an effort to align the buffer allocation for this reason.

A sample list of random-sized allocations in a loop shows that even for large allocations the alignment is at best on a 64-bit boundary, at least with the normal malloc() call (output of test_malloc_align.c shown):

153713: 62/62=  100%: alloc   0x4000 addr 0x7fe6a5f2bfe0
154715: 63/63=  100%: alloc 0x800000 addr 0x392452ac90
155852: 64/64=  100%: alloc 0x63f000 addr 0x32c4a64c30
157539: 65/65=  100%: alloc 0x800000 addr 0x7f393640d010

Interestingly, on MacOS, this program only has a 15% positive rate, as it seems to at least allocate on 256-byte boundaries:

The allocated memory is aligned such that it can be used for any data type, including AltiVec- and SSE-related types.

On Linux it is only doing the minimum 16-byte alignment, which is definitely problematic for this change to be successful:

For calloc() and malloc(), return a pointer to the allocated memory, which is suitably aligned for any kind of variable.

I don't see any kind of options in mallopt() that might affect the behavior of malloc(). I don't think it would be unreasonable to have an option that could be set to tune this behavior (at the cost of some more memory usage, but I think the request to align PAGE_SIZE-multiple allocations on PAGE_SIZE address boundaries is reasonable. Maybe a trial patch for glibc would be in order? Unfortunately, without that I think this approach will be mostly ineffective, excepting the installation of a custom glibc/malloc or LD_PRELOAD to force e.g. 4KB alignment for >= 1MB allocations (which might be easily written, and ould consume at most only 4KB of additional RAM).

Comment by Patrick Farrell [ 20/Jul/20 ]

Andreas, thanks for that test...  Yuck.

Well, the output of that test isn't much fun - It looks like IOR must be making special provision to get its buffers aligned, as is dd and the other test program I tried.

I suppose because the heap is (usually) per-process, it doesn't have any particular alignment requirements - It's the kernel memory allocator that probably wants to pay attention to page alignment more aggressively, because it's just one big address space.  But that's irrelevant to userspace.

OK, so we're looking at real applications mostly needing unaligned direct i/o.  So, this (LU-13802) won't be too much use without LU-13805.  That's not necessarily fatal - The fast buffering suggested in LU-13805 should still be much faster than normal buffered i/o.  (Or we could follow Oleg's route and implement a new i/o path for unaligned DIO.)

Comment by Patrick Farrell [ 20/Jul/20 ]

"but I think the request to align PAGE_SIZE-multiple allocations on PAGE_SIZE address boundaries is unreasonable" - Unreasonable as a default, right?  And we don't need 1 MiB alignment or anything - we just need page size alignment for the start of the buffer.

But, of course, that assumes it's also aligned to a size boundary in the file.  So, I think we'll be looking at LU-13805 for a lot of this.

Comment by Wang Shilong (Inactive) [ 21/Jul/20 ]

"but I think the request to align PAGE_SIZE-multiple allocations on PAGE_SIZE address boundaries is unreasonable" - Unreasonable as a default, right? And we don't need 1 MiB alignment or anything - we just need page size alignment for the start of the buffer."

Unfortunately, we need every use space page be aligned with page-size, no really start of the buffer, i mean currently for DIO. still this will be easy for kernel do IO mapping.

That is why i think whyt extra page allocations will make Unaligned DIO easier.

Comment by Patrick Farrell [ 21/Jul/20 ]

Shilong,

Because the buffer has to be at least virtually contiguous for i/o (this is a requirement or it's not one buffer, basically), I would think if the first page is aligned, the other pages in the buffer would usually be as well...  But I have been very wrong about memory alignment already today, soooo perhaps we should modify Andreas' test to check.

Comment by Andreas Dilger [ 21/Jul/20 ]

Sorry, I had a critical typo in my previous comment. I think it is OK to try the allocation alignment malloc() hack to see if that helps, since at least an LD_PRELOAD could easily enforce this for all allocations. While it would be way better to handle this in the glibc allocator itself, I think the below is OK for a POC:

#define PAGE_SHIFT 12
#define PAGE_SIZE (1UL << PAGE_SHIFT)
#define PAGE_MASK (PAGE_SIZE - 1)
#define MIN_ALIGN 1048576UL

/* save bottom bits for offset for free.
 * hugepages should already be aligned, so we don't need a huge offset */
#define SHIFT_MAGIC (0x50bad5050badUL << PAGE_SHIFT)
void *malloc(size_t size)
{
        void *ret, *tmp;

        if (size < MIN_ALIGN)
                return malloc(size);

        tmp = malloc(size + PAGE_MASK);
        if ((long)tmp & PAGE_MASK) {
                unsigned int offset;
                unsigned long *magic;

                ptr = (void *)(((long)tmp + PAGE_MASK) & ~PAGE_MASK);
                offset = (long)ptr - (long)tmp;

                magic = (long *)tmp - 1;
                *magic = SHIFT_MAGIC + offset;
        }
        return ptr;
}

void free(void *ptr)
{
        unsigned long *magic = (long *)ptr - 1;

        /* it's bad to access outside allocated memory, but should be OK for testing */
        if (ptr & PAGE_MASK == 0 && (long)(*magic) & ~PAGE_MASK == SHIFT_MAGIC)
                ptr -= (*magic - SHIFT_MAGIC);

        free(ptr);
}
Comment by Nathan Rutman [ 03/Aug/20 ]

Let's please move continued alignment discussion to LU-13805.

Getting back to the ticket description about how to split the IO into the right size DIO chunks - I dislike the idea of adding yet more tunables to Lustre, and also don't think a simple flash/disk division is sufficient. It seems to me like the OSTs should actually measure the performance of the backend and report the number back to clients to be tracked in the OSC import.
That's obviously a lot of work, especially in kernel, so maybe a compromise is to create the tunables on a per-OSC basis, and then run some client-side benchmark test (included with Lustre distro) that runs a survey and sets the tunable permanently. Customers would have to run this for each different type of OST, and set the params for all OSTs of each type. Maybe base this on pool definitions, assuming pools are set up along performance lines.
Thinking more broadly about this, maybe Lustre needs some idea of "OST profile" that includes a set of parameters associated with an OST type - rpcs_in_flight, grant_space, etc. A configurator run at initial install could try to test and optimize. This might bring help bring Lustre to the masses.
But I realize I'm getting ahead of myself. Back to this ticket, is there a more algorithmic approach to this? Does an OST "know" that it is flash or disk somehow? Indirectly, it seems that a smaller stripe size should be associated with flash, so maybe these parameters aren't associated with an OST at all, but rather a file's striping parameters?

Comment by Andreas Dilger [ 04/Aug/20 ]

Yes, if properly configured by the kernel, the OSTs know their technology type from the /sys/block/sdX/queue/rotational parameter, and report this to the clients/MDS via OS_STATE_NONROT in statfs (lfs df -v on a recent Lustre release), see LU-11963.

As for OSTs reporting performance to clients, this has been proposed in LU-7880 for some time already. The kernel already collects these performance metrics, and just needs to report them to clients via OST_STATFS. IMHO, that would be far superior to having users run some benchmark on the disks at setup time for many reasons:

  • users are likely to get this wrong, or not do it at all
  • configuring parameters from userspace is fragile
  • performance of OSTs is likely to change over time as they fill, fragment, grow sector errors and are remapped, etc.
  • performance will change dynamically under load, so anything collected at setup time will be useless
Comment by Andreas Dilger [ 11/Aug/20 ]

nrutman I don't think we can separate the alignment question from the "buffered IO as DIO", since it is basically impossible to do this unless the buffered IO is submitted with page-aligned/sized pointers (in the malloc() implementation so that every application doesn't need to implement this itself), or some magic is done in the underlying O_DIRECT and/or RDMA code to transfer unaligned pointers from the client to the server. Until that is done, all of the discussion on how to optimize this functionality is irrelevant since it will not work.

Comment by Li Xi [ 08/Sep/21 ]

Simple implementation of malloc aligned memory:

#define DEBUG
#define _GNU_SOURCE
#include <stdlib.h>
#include <errno.h>
#ifdef DEBUG
#include <stdio.h>
#endif

#define PAGE_SHIFT 12ULL
#define PAGE_SIZE (1 << PAGE_SHIFT)

void *malloc(size_t size)
{
	int ret;
	void *memptr = NULL;

	if (size & (PAGE_SIZE - 1)) {
		memptr = malloc(size);
        } else {
		ret = posix_memalign(&memptr, PAGE_SIZE, size);
		if (ret) {
			errno = ret;
			memptr = NULL;
		}
	}
	
#ifdef DEBUG
	printf("allocated %p\n", memptr);
#endif
	return memptr;
}

#if 0
int main()
{
	char *ptr = malloc(100);
	free(ptr);
	return 0;
}
#endif
Comment by Andreas Dilger [ 08/Sep/21 ]

Simple implementation of malloc aligned memory:

I don't think it makes sense to do this for every allocation, only those that are a multiple of PAGE_SIZE themselves. Otherwise, that probably adds a lot of overhead to the application for allocations that are not going to be used for IO anyway.

Comment by Qian Yingjin [ 04/Aug/23 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51679
Subject: LU-16964 llite: auto switch from BIO to DIO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e3e07e129ca55585af5831fd25cc07fa96d5394d

Comment by Gerrit Updater [ 09/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52609
Subject: LU-13802 tests: add basic tests of hybrid IO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 97dc18521af24b096dc19c4e5543f5c74a7bb2c8

Comment by Andreas Dilger [ 10/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52586
Subject: LU-13802 llite: trivial bio_dio switch check
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5690c653fd3ea3cab65ef8cb956faf79fdbe6bd2

Comment by Andreas Dilger [ 10/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52587
Subject: LU-13802 llite: refactor ll_file_io_generic decs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a191cf6a8170a169e78e31ff9fff4ea47d2cc093

Comment by Andreas Dilger [ 10/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52592
Subject: LU-13802 llite: add hybrid IO SBI flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8b2dfdba4c0d149f988c49b173dcc11bfef5c0f4

Comment by Andreas Dilger [ 10/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52593
Subject: LU-13802 llite: add fail loc to force bio-dio switch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 14432e462bc22f5bff1c06517b4f2230c04b1085

Comment by Andreas Dilger [ 10/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52594
Subject: LU-13802 llite: add hybrid io switch threshold
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9c4c9408629b00a15fda23c5ebed095b16036362

Comment by Andreas Dilger [ 10/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52595
Subject: LU-13802 llite: add read & write switch thresholds
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8808647ec827c7c000f414787093c4c168cc5d30

Comment by Andreas Dilger [ 10/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52596
Subject: LU-13802 llite: add hybrid IO switch proc stats
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b80e468a5f98b433554ebd67610639ed70be8cf7

Comment by Patrick Farrell [ 10/Oct/23 ]

Thanks, Andreas.  (I had accidentally placed these on LU-13804.)

Comment by Gerrit Updater [ 15/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52703
Subject: LU-13802 llite: tag switched hybrid IOs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bb640da69daa8a65bd1c8fa3a986465ac8d327e3

Comment by Patrick Farrell [ 20/Oct/23 ]

gerrit added a comment - 7 minutes ago
"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52777
Subject: LU-13802 llite: hybrid IO HDD thresholds
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 83f416731d2d5455ce0255202b7aa3c1f872da13
Edit
gerrit added a comment - 7 minutes ago
"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52778
Subject: LU-13802 tests: hybrid IO consistency test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1

Comment by Patrick Farrell [ 20/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52776
Subject: LU-13802 llite: add file nonrotational check
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b88060157ab67c90d2724f1de86203b2a4708953

Comment by Patrick Farrell [ 20/Oct/23 ]

Making some notes on what remains to do here.

Add tests of tweaking the hybrid IO threshold for rotational and non-rotational

Need to make the existing threshold test more intelligent, using an automatically adjusting IO size for the threshold which is in effect (rotational or non-rotational)

Need to add two types of racing test: Multiple process on one client, multiple process on two clients (sanityn)

Bring in the contention code.  The contention detection and management on the client side needs to be split out in to a number of patches.

Comment by Patrick Farrell [ 20/Oct/23 ]

Add test using the fail loc to force a switch (It's a good fail loc but doesn't have an obvious use right now)

Comment by Gerrit Updater [ 24/Oct/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52822
Subject: LU-13802 llite: add ZFS check for hybrid IO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c5ada8cbb710019c7fe5671db06bb514d173f3ca

Generated at Sat Feb 10 03:04:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.