[LU-4198] Improve IO performance when using DIRECT IO using libaio Created: 01/Nov/13 Updated: 26/Aug/20 Resolved: 10/Jun/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Brett Lee (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | clio | ||
| Environment: |
Seen in two environments. AWS cloud (Robert R.) and a dual-OSS setup (3 SSD per OST) over 2x10 GbE. |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 11385 | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Attached to this Jira are some numbers from the direct IO tests. Write operations only. It was noticed that setting RPCs in flight to 256 in these tests gives poorer performance. max rpc here is set to 32.
|
| Comments |
| Comment by Keith Mannthey (Inactive) [ 01/Nov/13 ] |
|
I am not quite sure how to read this output to know if it is good bad. In general I expect Direct I/O to hurt performance. It gets the filesystem read/write caches out of the way of the app. It is commonly used for databases to minimize the risk (some turn off hardware write caches as well). For 4k io I would not use wide stripping. |
| Comment by Johann Lombardi (Inactive) [ 04/Nov/13 ] |
This is unfortunately expected since we have to wait for I/O completion on the first stripe before firing RPCs to the next one (i.e. foreach(stripe) { lock(stripe); do_sync_io(stripe); unlock(stripe); }
) to work around the cascading abort issue. On 1.8, some customers were using a patch to use lockless direct I/O by default. |
| Comment by Robert Read (Inactive) [ 04/Nov/13 ] |
|
johann Ah, that is what I was afraid of. Is there a lockless direct IO patch for 2.x? That would probably be very helpful in this use case. |
| Comment by Robert Read (Inactive) [ 04/Nov/13 ] |
|
I see in |
| Comment by Keith Mannthey (Inactive) [ 04/Nov/13 ] |
|
What is the cascading abort issue? |
| Comment by Robert Read (Inactive) [ 05/Nov/13 ] |
|
I tried mounting the client with "nolock" and performance actually got about 4x worse than before. |
| Comment by Johann Lombardi (Inactive) [ 05/Nov/13 ] |
It seems that this patch enables lockless I/O not only for direct I/O, but also for buffered I/Os which is quite bad.
Client holds a lock on resource from server A and waits for RPC completion on server B. This introduces an implicit dependency between servers. If server B is not responsive (e.g. doing failover or just slow because it is overloaded) and server A issues a blocking AST, the client will get evicted from server A since it cannot release the lock in a timely manner.
Strange ... we definitely got better results with 1.8. There is probably something wrong with CLIO. HTH |
| Comment by Jinshan Xiong (Inactive) [ 05/Nov/13 ] |
|
in 2.x, the only difference between direct IO and cache IO is whether it caches dirty data on the client. They actually share the same IO framework. Even though it's really strange that no lock version was 4x worse; server takes lock for no lock IO - is there anybody else operating this file meanwhile? |
| Comment by Robert Read (Inactive) [ 05/Nov/13 ] |
|
I was running a single threaded benchmark (FIO) and there was only a single client on the filesystem. So definitely not shared. It seems there are other differences between direct and buffered IO, such as direct io is synchronous. I've noticed while testing AIO with various io depths that AIO appears to make no difference with direct IO. |
| Comment by Jinshan Xiong (Inactive) [ 05/Nov/13 ] |
|
AIO used to work with Direct IO only. I don't know what the state in current kernel is, I'll check it out. If we want to use direct IO, two problems have to be addressed: 2. universal direct IO support: in current implementation, the address of user buffer has to be aligned to page. Niu has a patch to address this problem but it uses obsoleted interfaces. Both problems should not be difficult to solve. Robert, will you briefly describe the use case scenarios for direct IO? |
| Comment by José Valerio [ 19/Nov/13 ] |
|
Hello, all. I have performed tests in one of the two environments where Brett worked (the dual-OSS setup - 3 SSD per OST - over 2x10 GE). I made tests writing directly into a local SSD and with other network block storage tool (NBD - Network Block Device), playing with the O_DIRECT and O_SYNC flags:
Results are that the performance, in both setups (local SSD and NBD) has the following pattern: only libaio >> faster than >> O_DIRECT + libaio >> faster than >> O_DIRECT + O_SYNC whereas, with Lustre, O_DIRECT + libaio and O_DIRECT + O_SYNC show the same performance. I exchanged a couple of emails with Brett and he confirmed that in Lustre, setting O_DIRECT implies always setting also O_SYNC. Also, in theory: O_DIRECT (Since Linux 2.4.10) A semantically similar (but deprecated) interface for block This comes from open(2) man page. So, to my understanding, O_DIRECT "tries" to write synchronously, but does not offer any guarantee. This is specially important in the case when using O_DIRECT + libaio, which is a library that allows non-blocking parallel writes to single-threaded user-space applications. According to the theory (man pages) and my tests with local SSD and NBD, I would personally say that Lustre deviates from the standard POSIX Filesystems. Not only that, this behavior slows down Lustre, apparently for no reason. If you agree with me on that, I would like to request a change in the code to correct this behavior, or at least some help on where to change it by myself, so I can test again and maybe see a bump in performance. Thanks in advance |
| Comment by Brett Lee (Inactive) [ 29/Jan/14 ] |
|
Using packages built for "SLES 11 SP2" OS, am seeing an LBUG when mounting a newly created storage target (MGT in this case). This event is repeatable. Using packages from: Stack trace seen on the console is: sles11sp2-2:~/work # mount -t lustre /dev/vdb /sap/mgs Have confirmed that the same installed OS functions properly using SLES packages from: |
| Comment by Jinshan Xiong (Inactive) [ 30/Jan/14 ] |
|
Hi Brett, it seems not related, please file a new ticket for the problem. |
| Comment by Brett Lee (Inactive) [ 03/Feb/14 ] |
|
Thanks Jinshan - have opened a different ticket for that issue. In testing the build from 21279 on CentOS 6.4, saw two similar issues. Configuration is a single node running a MGS, 1 MDT, 2 OSTs, and 1 client mount. Also have an identical "control" system. Both worked well (no issues seen, all tests completed w/o issue) with RHEL server bits from: After reconfiguring same system with the new bits: and running IO, there were two issues seen. First was a hung system that resulted in a /tmp/debug log ~ 6MB. Second was a system that got corrupted - showed 61% capacity utilization on both OSTs, though no files were present from the client perspective - also produced a debug log ~300K. Both debug logs are available for further review but not uploaded. The system hang failure appeared immediate pursuant to the first of 32 IO tests, using synchronous IO. The second failure was on only 2 of the 32 test cases, all of these using AIO. The two tests that failed were 1 GB write and random write, in 64MB bursts. In those two cases, the IO hung but I was able to ctrl-c out of the IO job. 16 tests were run using 1 OST, 16 tests using 2 OSTs. Note that in several of the test cases the performance benefit using these patches (vs. the control node) was very pronounced. Will be working to get more samples to increase the reliability of these data, and to further check/troubleshoot any issues with stability. |
| Comment by Brett Lee (Inactive) [ 05/Feb/14 ] |
|
Update: No further hung system issues. Oddly, the hung system was on the initial IO, and never seen since. The "corrupted" event is reproducible, though I would no longer call it corrupted. Rather, it has to do with stalled fio kernel threads. After killing off the fio user processes, two kernel threads remained. After rebooting to end those threads, the 61% was cleared. Note that all fio writes using block size 64M are not completing (though they are on the 2.5 release, as well as the root All other reads/writes (sequential and random) are completing successfully and without incident. Performance data comparisons upcoming. |
| Comment by Brett Lee (Inactive) [ 10/Feb/14 ] |
|
Data in the attached spreadsheet seems to make a good case for including the performance improvements. Also, I’ve not seen any further stability issues since the beginning of the test period. |
| Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ] |
|
Will you please increase iodepth to at least 32 and see if we can get any better results? |
| Comment by Brett Lee (Inactive) [ 12/Feb/14 ] |
|
Better? I thought those results were pretty good already. Will give it a try. |
| Comment by Brett Lee (Inactive) [ 15/Feb/14 ] |
|
Jinshan - an OST failed on me (each OST is one SATA-II or III disk) and have no other suitable disks. Have ordered a pair of WD 10K RPM Velociraptors (200 MB/s) that will support queue depth up to 32 (NCQ). On hold till then. |
| Comment by Richard Henwood (Inactive) [ 31/Oct/14 ] |
|
Jinshan, please update this ticket description to include the reason that this ticket is a dependency for |
| Comment by Richard Henwood (Inactive) [ 12/Dec/14 ] |
|
This ticket isn't directly related to CLIO Simplification work. The ticket relationships on Jira have been updated to reflect this. |
| Comment by Andreas Dilger [ 19/May/16 ] |
|
Patches in Gerrit for this issue: |
| Comment by Jinshan Xiong (Inactive) [ 13/Sep/16 ] |
|
Let's reopen this ticket after we have a more convincible solution for this issue. |
| Comment by Robert Read (Inactive) [ 13/Sep/16 ] |
|
|
| Comment by Patrick Farrell (Inactive) [ 30/May/17 ] |
|
Patch is still in flight. (Hope this is OK.) |
| Comment by Patrick Farrell (Inactive) [ 30/May/17 ] |
|
|
| Comment by Patrick Farrell (Inactive) [ 23/Jun/17 ] |
|
Jinshan, Attached patch is a suggestion for fixing the need for size glimpsing for dio reads. Not 100% sure it's safe, but some local testing suggests it's OK. (Diff was a little big to drop in gerrit) |
| Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ] |
|
This work is still useful so probably we should keep this ticket open. |
| Comment by Gerrit Updater [ 16/May/18 ] |
|
Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/32415 |
| Comment by Gerrit Updater [ 16/May/18 ] |
|
Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/32416 |
| Comment by Gerrit Updater [ 16/May/18 ] |
|
Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/32417 |
| Comment by Shuichi Ihara [ 26/Nov/18 ] |
|
Here is test resutls of patch https://review.whamcloud.com/32416 Client Without patch, we only get 80K IOPS at 4k random read with DIO even increased number of threads. Here is fio parameters. [randread] ioengine=sync ;ioengine=libaio rw=randread blocksize=4096 iodepth=32 direct=1 size=1g runtime=120 numjobs=128 group_reporting directory=/cache0/fio.out filename_format=f.$jobnum.$filenum With aio patch https://review.whamcloud.com/32416, it could readch more than 600K IOPS per an client. |
| Comment by Gerrit Updater [ 29/Apr/19 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/34774 |
| Comment by Gerrit Updater [ 08/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/8201/ |
| Comment by Gerrit Updater [ 08/Feb/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32416/ |
| Comment by Gerrit Updater [ 19/Feb/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37621 |
| Comment by Gerrit Updater [ 17/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37824/ |
| Comment by Nathan Rutman [ 16/Apr/20 ] |
|
Can someone please summarize what the state of this ticket is? The subject seems to have wandered from "Additional stripes on a file does not increase IO performance when using DIRECT IO" to lockless DIO to AIO. Johann's and Jinshan's comments seem to be at odds as to whether DIO stripes are parallelized or not. Rough testing (DIO, not AIO) seems to indicate they are not. |
| Comment by Andreas Dilger [ 10/Jun/20 ] |
|
Nathan, I think regardless of how this ticket started, it ended up being used to land the AIO/DIO support for 2.14. If there are still issues that need to be addressed, they should be done in the context of a new ticket. |
| Comment by Wang Shilong (Inactive) [ 13/Aug/20 ] |
|
I think |
| Comment by Gerrit Updater [ 26/Aug/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39733 |