[LU-4198] Improve IO performance when using DIRECT IO using libaio - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: Lustre 2.4.1
Labels:
- clio
Environment:
Seen in two environments. AWS cloud (Robert R.) and a dual-OSS setup (3 SSD per OST) over 2x10 GbE.

Severity:
3
Rank (Obsolete):
11385

Description

Attached to this Jira are some numbers from the direct IO tests. Write operations only.

It was noticed that setting RPCs in flight to 256 in these tests gives poorer performance. max rpc here is set to 32.

A sample FIO output:

fio.4k.write.1.23499: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.2
Starting 1 process
fio.4k.write.1.23499: Laying out IO file(s) (1 file(s) / 10MB)

fio.4k.write.1.23499: (groupid=0, jobs=1): err= 0: pid=10709: Fri Nov  1 11:47:29 2013
  write: io=10240KB, bw=2619.7KB/s, iops=654, runt=  3909msec
    clat (usec): min=579, max=5283, avg=1520.43, stdev=1216.20
     lat (usec): min=580, max=5299, avg=1521.37, stdev=1216.22
    clat percentiles (usec):
     |  1.00th=[  604],  5.00th=[  652], 10.00th=[  668], 20.00th=[  708],
     | 30.00th=[  732], 40.00th=[  756], 50.00th=[  796], 60.00th=[  844],
     | 70.00th=[ 1320], 80.00th=[ 3440], 90.00th=[ 3568], 95.00th=[ 3632],
     | 99.00th=[ 3824], 99.50th=[ 5024], 99.90th=[ 5216], 99.95th=[ 5280],
     | 99.99th=[ 5280]
    bw (KB  /s): min= 1224, max= 4366, per=97.64%, avg=2557.14, stdev=1375.64
    lat (usec) : 750=37.50%, 1000=30.12%
    lat (msec) : 2=5.00%, 4=26.76%, 10=0.62%
  cpu          : usr=0.92%, sys=8.70%, ctx=2562, majf=0, minf=25
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=2560/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=10240KB, aggrb=2619KB/s, minb=2619KB/s, maxb=2619KB/s, mint=3909msec, maxt=3909msec

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fio.direct.xls
13 kB
01/Nov/13 8:34 PM
JinshanPatchesTesting.xlsx
141 kB
10/Feb/14 8:08 PM
LU-4198.png
129 kB
26/Nov/18 3:10 AM
vvp_io.c.dio_i_size.patch
2 kB
23/Jun/17 3:34 PM

Issue Links

is duplicated by

LU-13786 Take server-side locks for direct i/o

Closed

is related to

LU-13900 don't call aio_complete() in lustre upon errors

Resolved

LU-247 Lustre client slow performance on BG/P IONs: unaligned DIRECT_IO

Resolved

LU-13697 short io for AIO

Resolved

LU-9409 Lustre small IO write performance improvement

Resolved

LU-10278 lfs migrate: make use of direct i/o optional

Resolved

LU-13801 Enable io_uring interface for Lustre client

Resolved

is related to

LU-12687 Fast ENOSPC on direct I/O

Resolved

LU-13798 Improve direct i/o performance with multiple stripes: Submit all stripes of a DIO and then wait

Resolved

mentioned in: Page Loading...

(2 is related to, 2 is related to , 1 mentioned in)

Activity

[LU-4198] Improve IO performance when using DIRECT IO using libaio

Jinshan Xiong (Inactive) added a comment - 13/Sep/16 6:03 AM

Let's reopen this ticket after we have a more convincible solution for this issue.

Jinshan Xiong (Inactive) added a comment - 13/Sep/16 6:03 AM Let's reopen this ticket after we have a more convincible solution for this issue.

Andreas Dilger added a comment - 19/May/16 7:41 PM

Patches in Gerrit for this issue:
http://review.whamcloud.com/8201
http://review.whamcloud.com/8612

Andreas Dilger added a comment - 19/May/16 7:41 PM Patches in Gerrit for this issue: http://review.whamcloud.com/8201 http://review.whamcloud.com/8612

Richard Henwood (Inactive) added a comment - 12/Dec/14 8:09 PM

This ticket isn't directly related to CLIO Simplification work. The ticket relationships on Jira have been updated to reflect this.

Richard Henwood (Inactive) added a comment - 12/Dec/14 8:09 PM This ticket isn't directly related to CLIO Simplification work. The ticket relationships on Jira have been updated to reflect this.

Richard Henwood (Inactive) added a comment - 31/Oct/14 10:34 PM

Jinshan, please update this ticket description to include the reason that this ticket is a dependency for ~~LU-3259~~.

Richard Henwood (Inactive) added a comment - 31/Oct/14 10:34 PM Jinshan, please update this ticket description to include the reason that this ticket is a dependency for LU-3259 .

Brett Lee (Inactive) added a comment - 15/Feb/14 10:05 PM

Jinshan - an OST failed on me (each OST is one SATA-II or III disk) and have no other suitable disks. Have ordered a pair of WD 10K RPM Velociraptors (200 MB/s) that will support queue depth up to 32 (NCQ). On hold till then.

Brett Lee (Inactive) added a comment - 15/Feb/14 10:05 PM Jinshan - an OST failed on me (each OST is one SATA-II or III disk) and have no other suitable disks. Have ordered a pair of WD 10K RPM Velociraptors (200 MB/s) that will support queue depth up to 32 (NCQ). On hold till then.

Brett Lee (Inactive) added a comment - 12/Feb/14 3:22 PM

Better? I thought those results were pretty good already. Will give it a try.

Brett Lee (Inactive) added a comment - 12/Feb/14 3:22 PM Better? I thought those results were pretty good already. Will give it a try.

Jinshan Xiong (Inactive) added a comment - 10/Feb/14 8:24 PM

Will you please increase iodepth to at least 32 and see if we can get any better results?

Jinshan Xiong (Inactive) added a comment - 10/Feb/14 8:24 PM Will you please increase iodepth to at least 32 and see if we can get any better results?

Brett Lee (Inactive) added a comment - 10/Feb/14 8:08 PM

Data in the attached spreadsheet seems to make a good case for including the performance improvements. Also, I’ve not seen any further stability issues since the beginning of the test period.

Brett Lee (Inactive) added a comment - 10/Feb/14 8:08 PM Data in the attached spreadsheet seems to make a good case for including the performance improvements. Also, I’ve not seen any further stability issues since the beginning of the test period.

Brett Lee (Inactive) added a comment - 05/Feb/14 3:21 PM

Update:
Continuing to run benchmarks against this build.

No further hung system issues. Oddly, the hung system was on the initial IO, and never seen since.

The "corrupted" event is reproducible, though I would no longer call it corrupted. Rather, it has to do with stalled fio kernel threads. After killing off the fio user processes, two kernel threads remained. After rebooting to end those threads, the 61% was cleared.

Note that all fio writes using block size 64M are not completing (though they are on the 2.5 release, as well as the root ext4 file system.

All other reads/writes (sequential and random) are completing successfully and without incident. Performance data comparisons upcoming.

Brett Lee (Inactive) added a comment - 05/Feb/14 3:21 PM Update: Continuing to run benchmarks against this build. No further hung system issues. Oddly, the hung system was on the initial IO, and never seen since. The "corrupted" event is reproducible, though I would no longer call it corrupted. Rather, it has to do with stalled fio kernel threads. After killing off the fio user processes, two kernel threads remained. After rebooting to end those threads, the 61% was cleared. Note that all fio writes using block size 64M are not completing (though they are on the 2.5 release, as well as the root ext4 file system. All other reads/writes (sequential and random) are completing successfully and without incident. Performance data comparisons upcoming.

Brett Lee (Inactive) added a comment - 03/Feb/14 6:52 PM

Thanks Jinshan - have opened a different ticket for that issue.

In testing the build from 21279 on CentOS 6.4, saw two similar issues. Configuration is a single node running a MGS, 1 MDT, 2 OSTs, and 1 client mount. Also have an identical "control" system. Both worked well (no issues seen, all tests completed w/o issue) with RHEL server bits from:
http://downloads.whamcloud.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64/

After reconfiguring same system with the new bits:

http://build.whamcloud.com/job/lustre-reviews/21279/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/artifact/artifacts/RPMS/x86_64/

and running IO, there were two issues seen. First was a hung system that resulted in a /tmp/debug log ~ 6MB. Second was a system that got corrupted - showed 61% capacity utilization on both OSTs, though no files were present from the client perspective - also produced a debug log ~300K. Both debug logs are available for further review but not uploaded.

The system hang failure appeared immediate pursuant to the first of 32 IO tests, using synchronous IO. The second failure was on only 2 of the 32 test cases, all of these using AIO. The two tests that failed were 1 GB write and random write, in 64MB bursts. In those two cases, the IO hung but I was able to ctrl-c out of the IO job.

16 tests were run using 1 OST, 16 tests using 2 OSTs. Note that in several of the test cases the performance benefit using these patches (vs. the control node) was very pronounced. Will be working to get more samples to increase the reliability of these data, and to further check/troubleshoot any issues with stability.

Brett Lee (Inactive) added a comment - 03/Feb/14 6:52 PM Thanks Jinshan - have opened a different ticket for that issue. In testing the build from 21279 on CentOS 6.4, saw two similar issues. Configuration is a single node running a MGS, 1 MDT, 2 OSTs, and 1 client mount. Also have an identical "control" system. Both worked well (no issues seen, all tests completed w/o issue) with RHEL server bits from: http://downloads.whamcloud.com/public/lustre/latest-feature-release/el6/server/RPMS/x86_64/ After reconfiguring same system with the new bits: http://build.whamcloud.com/job/lustre-reviews/21279/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/artifact/artifacts/RPMS/x86_64/ and running IO, there were two issues seen. First was a hung system that resulted in a /tmp/debug log ~ 6MB. Second was a system that got corrupted - showed 61% capacity utilization on both OSTs, though no files were present from the client perspective - also produced a debug log ~300K. Both debug logs are available for further review but not uploaded. The system hang failure appeared immediate pursuant to the first of 32 IO tests, using synchronous IO. The second failure was on only 2 of the 32 test cases, all of these using AIO. The two tests that failed were 1 GB write and random write, in 64MB bursts. In those two cases, the IO hung but I was able to ctrl-c out of the IO job. 16 tests were run using 1 OST, 16 tests using 2 OSTs. Note that in several of the test cases the performance benefit using these patches (vs. the control node) was very pronounced. Will be working to get more samples to increase the reliability of these data, and to further check/troubleshoot any issues with stability.

Jinshan Xiong (Inactive) added a comment - 30/Jan/14 3:37 AM

Hi Brett, it seems not related, please file a new ticket for the problem.

Jinshan Xiong (Inactive) added a comment - 30/Jan/14 3:37 AM Hi Brett, it seems not related, please file a new ticket for the problem.

People

Assignee:: Zhenyu Xu

Reporter:: Brett Lee (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 25 Start watching this issue

Dates

Created:: 01/Nov/13 8:19 PM

Updated:: 26/Aug/20 1:15 PM

Resolved:: 10/Jun/20 7:40 PM