[LU-4198] Improve IO performance when using DIRECT IO using libaio Created: 01/Nov/13  Updated: 26/Aug/20  Resolved: 10/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.14.0

Type: Improvement Priority: Minor
Reporter: Brett Lee (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: clio
Environment:

Seen in two environments. AWS cloud (Robert R.) and a dual-OSS setup (3 SSD per OST) over 2x10 GbE.


Attachments: Microsoft Word JinshanPatchesTesting.xlsx     PNG File LU-4198.png     Microsoft Word fio.direct.xls     File vvp_io.c.dio_i_size.patch    
Issue Links:
Duplicate
is duplicated by LU-13786 Take server-side locks for direct i/o Closed
Related
is related to LU-12687 Fast ENOSPC on direct I/O Resolved
is related to LU-13798 Improve direct i/o performance with m... Resolved
is related to LU-13900 don't call aio_complete() in lustre u... Resolved
is related to LU-247 Lustre client slow performance on BG/... Resolved
is related to LU-13697 short io for AIO Resolved
is related to LU-9409 Lustre small IO write performance imp... Resolved
is related to LU-10278 lfs migrate: make use of direct i/o o... Resolved
is related to LU-13801 Enable io_uring interface for Lustre ... Resolved
Severity: 3
Rank (Obsolete): 11385

 Description   

Attached to this Jira are some numbers from the direct IO tests. Write operations only.

It was noticed that setting RPCs in flight to 256 in these tests gives poorer performance. max rpc here is set to 32.

  • A sample FIO output:
    fio.4k.write.1.23499: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
    fio-2.1.2
    Starting 1 process
    fio.4k.write.1.23499: Laying out IO file(s) (1 file(s) / 10MB)
    
    fio.4k.write.1.23499: (groupid=0, jobs=1): err= 0: pid=10709: Fri Nov  1 11:47:29 2013
      write: io=10240KB, bw=2619.7KB/s, iops=654, runt=  3909msec
        clat (usec): min=579, max=5283, avg=1520.43, stdev=1216.20
         lat (usec): min=580, max=5299, avg=1521.37, stdev=1216.22
        clat percentiles (usec):
         |  1.00th=[  604],  5.00th=[  652], 10.00th=[  668], 20.00th=[  708],
         | 30.00th=[  732], 40.00th=[  756], 50.00th=[  796], 60.00th=[  844],
         | 70.00th=[ 1320], 80.00th=[ 3440], 90.00th=[ 3568], 95.00th=[ 3632],
         | 99.00th=[ 3824], 99.50th=[ 5024], 99.90th=[ 5216], 99.95th=[ 5280],
         | 99.99th=[ 5280]
        bw (KB  /s): min= 1224, max= 4366, per=97.64%, avg=2557.14, stdev=1375.64
        lat (usec) : 750=37.50%, 1000=30.12%
        lat (msec) : 2=5.00%, 4=26.76%, 10=0.62%
      cpu          : usr=0.92%, sys=8.70%, ctx=2562, majf=0, minf=25
      IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
         issued    : total=r=0/w=2560/d=0, short=r=0/w=0/d=0
    
    Run status group 0 (all jobs):
      WRITE: io=10240KB, aggrb=2619KB/s, minb=2619KB/s, maxb=2619KB/s, mint=3909msec, maxt=3909msec
    


 Comments   
Comment by Keith Mannthey (Inactive) [ 01/Nov/13 ]

I am not quite sure how to read this output to know if it is good bad.

In general I expect Direct I/O to hurt performance. It gets the filesystem read/write caches out of the way of the app. It is commonly used for databases to minimize the risk (some turn off hardware write caches as well).

For 4k io I would not use wide stripping.

Comment by Johann Lombardi (Inactive) [ 04/Nov/13 ]

Additional stripes on a file does not increase IO performance when using DIRECT IO

This is unfortunately expected since we have to wait for I/O completion on the first stripe before firing RPCs to the next one (i.e.

foreach(stripe) { lock(stripe); do_sync_io(stripe); unlock(stripe); }

) to work around the cascading abort issue. On 1.8, some customers were using a patch to use lockless direct I/O by default.

Comment by Robert Read (Inactive) [ 04/Nov/13 ]

johann Ah, that is what I was afraid of. Is there a lockless direct IO patch for 2.x? That would probably be very helpful in this use case.

Comment by Robert Read (Inactive) [ 04/Nov/13 ]

I see in LU-238 that there is a nolock mount option that we can use to enabled lockless direct IO.

Comment by Keith Mannthey (Inactive) [ 04/Nov/13 ]

What is the cascading abort issue?

Comment by Robert Read (Inactive) [ 05/Nov/13 ]

I tried mounting the client with "nolock" and performance actually got about 4x worse than before.

Comment by Johann Lombardi (Inactive) [ 05/Nov/13 ]

I see in LU-238 that there is a nolock mount option that we can use to enabled lockless direct IO.

It seems that this patch enables lockless I/O not only for direct I/O, but also for buffered I/Os which is quite bad.

What is the cascading abort issue?

Client holds a lock on resource from server A and waits for RPC completion on server B. This introduces an implicit dependency between servers. If server B is not responsive (e.g. doing failover or just slow because it is overloaded) and server A issues a blocking AST, the client will get evicted from server A since it cannot release the lock in a timely manner.

I tried mounting the client with "nolock" and performance actually got about 4x worse than before.

Strange ... we definitely got better results with 1.8. There is probably something wrong with CLIO.
BTW, to be clear, you should only see a benefit if your direct writes cover multiple stripes, otherwise there won't be any parallelism.

HTH

Comment by Jinshan Xiong (Inactive) [ 05/Nov/13 ]

in 2.x, the only difference between direct IO and cache IO is whether it caches dirty data on the client. They actually share the same IO framework.

Even though it's really strange that no lock version was 4x worse; server takes lock for no lock IO - is there anybody else operating this file meanwhile?

Comment by Robert Read (Inactive) [ 05/Nov/13 ]

I was running a single threaded benchmark (FIO) and there was only a single client on the filesystem. So definitely not shared.

It seems there are other differences between direct and buffered IO, such as direct io is synchronous. I've noticed while testing AIO with various io depths that AIO appears to make no difference with direct IO.

Comment by Jinshan Xiong (Inactive) [ 05/Nov/13 ]

AIO used to work with Direct IO only. I don't know what the state in current kernel is, I'll check it out.

If we want to use direct IO, two problems have to be addressed:
1. lock: if the file is being read or written in direct IO, it's unnecessary to take lock from server. can we make the assumption all direct IO should go with lockless?

2. universal direct IO support: in current implementation, the address of user buffer has to be aligned to page. Niu has a patch to address this problem but it uses obsoleted interfaces.

Both problems should not be difficult to solve.

Robert, will you briefly describe the use case scenarios for direct IO?

Comment by José Valerio [ 19/Nov/13 ]

Hello, all.

I have performed tests in one of the two environments where Brett worked (the dual-OSS setup - 3 SSD per OST - over 2x10 GE).

I made tests writing directly into a local SSD and with other network block storage tool (NBD - Network Block Device), playing with the O_DIRECT and O_SYNC flags:

  • no flags + native Async IO (libaio), 8 writes in flight
  • O_DIRECT + native Async IO (libaio), 8 writes in flight
  • O_DIRECT + O_SYNC

Results are that the performance, in both setups (local SSD and NBD) has the following pattern:

only libaio >> faster than >> O_DIRECT + libaio >> faster than >> O_DIRECT + O_SYNC

whereas, with Lustre, O_DIRECT + libaio and O_DIRECT + O_SYNC show the same performance.

I exchanged a couple of emails with Brett and he confirmed that in Lustre, setting O_DIRECT implies always setting also O_SYNC.

Also, in theory:

O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user-
space buffers. The O_DIRECT flag on its own makes an effort
to transfer data synchronously, but does not give the
guarantees of the O_SYNC flag that data and necessary metadata
are transferred. To guarantee synchronous I/O, O_SYNC must be
used in addition to O_DIRECT. See NOTES below for further
discussion.

A semantically similar (but deprecated) interface for block
devices is described in raw(8).

This comes from open(2) man page.

So, to my understanding, O_DIRECT "tries" to write synchronously, but does not offer any guarantee. This is specially important in the case when using O_DIRECT + libaio, which is a library that allows non-blocking parallel writes to single-threaded user-space applications.

According to the theory (man pages) and my tests with local SSD and NBD, I would personally say that Lustre deviates from the standard POSIX Filesystems.

Not only that, this behavior slows down Lustre, apparently for no reason.

If you agree with me on that, I would like to request a change in the code to correct this behavior, or at least some help on where to change it by myself, so I can test again and maybe see a bump in performance.

Thanks in advance

Comment by Brett Lee (Inactive) [ 29/Jan/14 ]

Using packages built for "SLES 11 SP2" OS, am seeing an LBUG when mounting a newly created storage target (MGT in this case). This event is repeatable. Using packages from:

https://build.whamcloud.com/job/lustre-reviews/21279/arch=x86_64,build_type=server,distro=sles11sp2,ib_stack=inkernel/artifact/artifacts/RPMS/x86_64/

Stack trace seen on the console is:

sles11sp2-2:~/work # mount -t lustre /dev/vdb /sap/mgs
[ 159.338678] LustreError: 3532:0:(sec_ctx.c:80:pop_ctxt()) ASSERTION( segment_eq(get_fs(), get_ds()) ) failed: popping non-kernel context!
[ 159.340266] LustreError: 3532:0:(sec_ctx.c:80:pop_ctxt()) LBUG
[ 159.342723] Kernel panic - not syncing: LBUG
[ 159.343356] Pid: 3532, comm: mount.lustre Tainted: G N 3.0.93-0.5_lustre.ge80a1ca-default #1
[ 159.344239] Call Trace:
[ 159.344384] [<ffffffff810048b5>] dump_trace+0x75/0x310
[ 159.344692] [<ffffffff814473a3>] dump_stack+0x69/0x6f
[ 159.344983] [<ffffffff8144743c>] panic+0x93/0x201
[ 159.345255] [<ffffffffa02f0dc3>] lbug_with_loc+0xa3/0xb0 [libcfs]
[ 159.345621] [<ffffffffa076087c>] pop_ctxt+0x19c/0x1a0 [ptlrpc]
[ 159.345984] [<ffffffffa0b0667d>] osd_ost_init+0x23d/0x8d0 [osd_ldiskfs]
[ 159.346370] [<ffffffffa0b06d39>] osd_obj_map_init+0x29/0x120 [osd_ldiskfs]
[ 159.346767] [<ffffffffa0ae4151>] osd_device_init0+0x281/0x5c0 [osd_ldiskfs]
[ 159.347166] [<ffffffffa0ae47c6>] osd_device_alloc+0x166/0x2c0 [osd_ldiskfs]
[ 159.347574] [<ffffffffa04d642b>] class_setup+0x61b/0xad0 [obdclass]
[ 159.347957] [<ffffffffa04de5f5>] class_process_config+0xc95/0x18f0 [obdclass]
[ 159.348393] [<ffffffffa04e3652>] do_lcfg+0x142/0x460 [obdclass]
[ 159.348752] [<ffffffffa04e3a04>] lustre_start_simple+0x94/0x210 [obdclass]
[ 159.349168] [<ffffffffa051171a>] osd_start+0x4fa/0x7c0 [obdclass]
[ 159.349549] [<ffffffffa051b41d>] server_fill_super+0xfd/0xce0 [obdclass]
[ 159.349965] [<ffffffffa04e91e8>] lustre_fill_super+0x178/0x530 [obdclass]
[ 159.350362] [<ffffffff811556e3>] mount_nodev+0x83/0xc0
[ 159.350668] [<ffffffffa04e1080>] lustre_mount+0x20/0x30 [obdclass]
[ 159.351035] [<ffffffff811551ee>] mount_fs+0x4e/0x1a0
[ 159.351318] [<ffffffff811703f5>] vfs_kern_mount+0x65/0xd0
[ 159.351623] [<ffffffff811704e3>] do_kern_mount+0x53/0x110
[ 159.351930] [<ffffffff81171e2d>] do_mount+0x21d/0x260
[ 159.352246] [<ffffffff81171f30>] sys_mount+0xc0/0xf0
[ 159.352529] [<ffffffff81452112>] system_call_fastpath+0x16/0x1b
[ 159.352867] [<00007f99b62bd1ea>] 0x7f99b62bd1e9

Have confirmed that the same installed OS functions properly using SLES packages from:

http://build.whamcloud.com/job/lustre-b2_5/arch=x86_64,build_type=server,distro=sles11sp2,ib_stack=inkernel/

Comment by Jinshan Xiong (Inactive) [ 30/Jan/14 ]

Hi Brett, it seems not related, please file a new ticket for the problem.

Comment by Brett Lee (Inactive) [ 03/Feb/14 ]

Thanks Jinshan - have opened a different ticket for that issue.

In testing the build from 21279 on CentOS 6.4, saw two similar issues. Configuration is a single node running a MGS, 1 MDT, 2 OSTs, and 1 client mount. Also have an identical "control" system. Both worked well (no issues seen, all tests completed w/o issue) with RHEL server bits from:
http://downloads.whamcloud.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64/

After reconfiguring same system with the new bits:

http://build.whamcloud.com/job/lustre-reviews/21279/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/artifact/artifacts/RPMS/x86_64/

and running IO, there were two issues seen. First was a hung system that resulted in a /tmp/debug log ~ 6MB. Second was a system that got corrupted - showed 61% capacity utilization on both OSTs, though no files were present from the client perspective - also produced a debug log ~300K. Both debug logs are available for further review but not uploaded.

The system hang failure appeared immediate pursuant to the first of 32 IO tests, using synchronous IO. The second failure was on only 2 of the 32 test cases, all of these using AIO. The two tests that failed were 1 GB write and random write, in 64MB bursts. In those two cases, the IO hung but I was able to ctrl-c out of the IO job.

16 tests were run using 1 OST, 16 tests using 2 OSTs. Note that in several of the test cases the performance benefit using these patches (vs. the control node) was very pronounced. Will be working to get more samples to increase the reliability of these data, and to further check/troubleshoot any issues with stability.

Comment by Brett Lee (Inactive) [ 05/Feb/14 ]

Update:
Continuing to run benchmarks against this build.

No further hung system issues. Oddly, the hung system was on the initial IO, and never seen since.

The "corrupted" event is reproducible, though I would no longer call it corrupted. Rather, it has to do with stalled fio kernel threads. After killing off the fio user processes, two kernel threads remained. After rebooting to end those threads, the 61% was cleared.

Note that all fio writes using block size 64M are not completing (though they are on the 2.5 release, as well as the root ext4 file system.

All other reads/writes (sequential and random) are completing successfully and without incident. Performance data comparisons upcoming.

Comment by Brett Lee (Inactive) [ 10/Feb/14 ]

Data in the attached spreadsheet seems to make a good case for including the performance improvements. Also, I’ve not seen any further stability issues since the beginning of the test period.

Comment by Jinshan Xiong (Inactive) [ 10/Feb/14 ]

Will you please increase iodepth to at least 32 and see if we can get any better results?

Comment by Brett Lee (Inactive) [ 12/Feb/14 ]

Better? I thought those results were pretty good already. Will give it a try.

Comment by Brett Lee (Inactive) [ 15/Feb/14 ]

Jinshan - an OST failed on me (each OST is one SATA-II or III disk) and have no other suitable disks. Have ordered a pair of WD 10K RPM Velociraptors (200 MB/s) that will support queue depth up to 32 (NCQ). On hold till then.

Comment by Richard Henwood (Inactive) [ 31/Oct/14 ]

Jinshan, please update this ticket description to include the reason that this ticket is a dependency for LU-3259.

Comment by Richard Henwood (Inactive) [ 12/Dec/14 ]

This ticket isn't directly related to CLIO Simplification work. The ticket relationships on Jira have been updated to reflect this.

Comment by Andreas Dilger [ 19/May/16 ]

Patches in Gerrit for this issue:
http://review.whamcloud.com/8201
http://review.whamcloud.com/8612

Comment by Jinshan Xiong (Inactive) [ 13/Sep/16 ]

Let's reopen this ticket after we have a more convincible solution for this issue.

Comment by Robert Read (Inactive) [ 13/Sep/16 ]

Comment by Patrick Farrell (Inactive) [ 30/May/17 ]

Patch is still in flight. (Hope this is OK.)

Comment by Patrick Farrell (Inactive) [ 30/May/17 ]

LU-247 is pretty old and probably no one has time to update it... But it could be very useful (with this) for improving < page size write performance.

Comment by Patrick Farrell (Inactive) [ 23/Jun/17 ]

Jinshan,

Attached patch is a suggestion for fixing the need for size glimpsing for dio reads. Not 100% sure it's safe, but some local testing suggests it's OK. (Diff was a little big to drop in gerrit)

Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ]

This work is still useful so probably we should keep this ticket open.

Comment by Gerrit Updater [ 16/May/18 ]

Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/32415
Subject: LU-4198 clio: turn on parallel mode for some kind of IO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5afe155acdeaf5fefc230d43d23936f03e0e447b

Comment by Gerrit Updater [ 16/May/18 ]

Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/32416
Subject: LU-4198 clio: AIO support for direct IO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8c480ce9300ad4e4f23f5fac4e5a3c2d038017c7

Comment by Gerrit Updater [ 16/May/18 ]

Jinshan Xiong (jinshan.xiong@gmail.com) uploaded a new patch: https://review.whamcloud.com/32417
Subject: LU-4198 llite: no lock match for lockess I/O
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e1c1366b1bd0129cd5276693da8cb1f895db3594

Comment by Shuichi Ihara [ 26/Nov/18 ]

Here is test resutls of patch https://review.whamcloud.com/32416
LU-4198.png

Client
2 x E5-2650v4@2.20GHz, 128GB memory, 1 x EDR
OSS/MDS
DDN AI200(2xOSS/MDS, 20 x NVMe, 1 x EDR, master branch)

Without patch, we only get 80K IOPS at 4k random read with DIO even increased number of threads. Here is fio parameters.

[randread]
ioengine=sync
;ioengine=libaio
rw=randread
blocksize=4096
iodepth=32
direct=1
size=1g
runtime=120
numjobs=128
group_reporting
directory=/cache0/fio.out
filename_format=f.$jobnum.$filenum

With aio patch https://review.whamcloud.com/32416, it could readch more than 600K IOPS per an client.
patch helps not only supporting libaio on Lustre, but also benchmark (e.g. we only very small number of clients to satulate storage IOPS) and libaio supoprted applications. (e.g. database, virtual machine enviorment).
One thing I need to inform here, but patch worked well only at less number of max_rpcs_in_flight. For instance, max_rpcs_in_flight=1 was scaling very well so far, but max_rpcs_in_flight=256 was problematic and didn't scale anything.

Comment by Gerrit Updater [ 29/Apr/19 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/34774
Subject: LU-4198 llite: transient page simplification
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 704390928ce55173d2b2fca0e0fe244907d750b2

Comment by Gerrit Updater [ 08/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/8201/
Subject: LU-4198 clio: turn on lockless for some kind of IO
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6bce536725efd166d2772f13fe954f271f9c53b8

Comment by Gerrit Updater [ 08/Feb/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32416/
Subject: LU-4198 clio: AIO support for direct IO
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d1dded6e28473d889a9b24b47cbc804f90dd2956

Comment by Gerrit Updater [ 19/Feb/20 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37621
Subject: LU-4198 clio: return error for short direct IO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9913a5411bb305d617ed3bba6bd5d000ffc11121

Comment by Gerrit Updater [ 17/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37824/
Subject: LU-4198 clio: Remove pl_owner
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e880a6fccbc57ae335949c1dd20335359b1cb220

Comment by Nathan Rutman [ 16/Apr/20 ]

Can someone please summarize what the state of this ticket is? The subject seems to have wandered from "Additional stripes on a file does not increase IO performance when using DIRECT IO" to lockless DIO to AIO. Johann's and Jinshan's comments seem to be at odds as to whether DIO stripes are parallelized or not.

Rough testing (DIO, not AIO) seems to indicate they are not.

Comment by Andreas Dilger [ 10/Jun/20 ]

Nathan, I think regardless of how this ticket started, it ended up being used to land the AIO/DIO support for 2.14. If there are still issues that need to be addressed, they should be done in the context of a new ticket.

Comment by Wang Shilong (Inactive) [ 13/Aug/20 ]

nrutman

I think LU-13798 will parallel DIO with stripes enabled, and we see big improvements with that, so i guess you could take a look there.

Comment by Gerrit Updater [ 26/Aug/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39733
Subject: LU-4198 clio: turn on lockless for some kind of IO
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 2acd73b09c86ca7ee436152274f9d1beab1ad571

Generated at Sat Feb 10 01:40:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.