[LU-9574] Large file read performance degradation from multiple OST's Created: 30/May/17  Updated: 08/Apr/18  Resolved: 21/Sep/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.2

Type: Bug Priority: Major
Reporter: Darby Vicker Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL 7 servers, RHEL 6 and 7 clients.


Attachments: File debug.1node.2.9.0_patch25996.log.gz     File debug.1node_bs_1m.2.9.0_patch25996.log.gz     Text File debug.4node.2.10.0_patch27388_set2.log     File debug.4node.2.9.0_patch25996.log.gz     File debug.4node.2.9.0_patch27388.log.gz     File debug.4node_bs_1m.2.9.0_patch25996.log.gz     File debug.aftermount.2.9.0_patch25996.log.gz     File debug.aftermount.2.9.0_revert_d8467ab.log.gz     File debug.aftertest.2.9.0_patch25996.log.gz     File debug.aftertest.2.9.0_revert_d8467ab.log.gz     PDF File lustre_performance.pdf     File test_ss_sn.sh    
Issue Links:
Related
Epic/Theme: Performance
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We recently noticed that the large file read performance on our 2.9 LFS is dramatically worse than it used to be. The attached plot is the result of a test script that uses dd to write a large file (50GB) to disk, read that file and then copy it to a 2nd file to test write, read and read/write speeds for large files for various stripe sizes and counts. The two sets of data on this plot are on the same server and client hardware. The LFS was originally built and formatted with 2.8.0 but we eventually upgraded to 2.9.0 on the servers and clients. The behavior we are used to seeing is increasing performance as you increase the stripe count with a peak in performance around 4 or 6 OST's and a degradation after that as more OST's are used. This is what we saw under 2.8 (red lines in the plots). With 2.9 we still get very good write performance (almost line rate on our 10 GbE clients). But for reads we see extremely good performance with a single OST and significantly degraded performance for multiple OST's – black lines in the plots.  Using a git bisect to compile and test different clients, we were able to isolate it to this commit:

commit d8467ab8a2ca15fbbd5be3429c9cf9ceb0fa78b8
LU-7990 clio: revise readahead to support 16MB IO

There is slightly more info here:

http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-May/014509.html

Please let me know if you need any other data or info.   



 Comments   
Comment by Jinshan Xiong (Inactive) [ 30/May/17 ]

Is this a reproduction of LU-9214? That is a side-effect when ra_per_file_mb is less than optimal RPC size.

Please try patch https://review.whamcloud.com/25996.

Comment by Darby Vicker [ 30/May/17 ]

No, I don't think so (although we suspected the same for a while too).  I applied patch 25996 against 2.9.0 and compiled a client but still see slow read performance on multiple OST's. 

On the other hand, we were able to revert commit d8467ab8a2ca15fbbd5be3429c9cf9ceb0fa78b8 from the 2.9.0 branch and recover the old behavior (fast reads and writes on multiple OST's).  I can upload more plots of this if you'd like.  

Comment by Jinshan Xiong (Inactive) [ 30/May/17 ]

Can you collect a debug log with READA enabled so that I can take a look?

lctl set_param debug="reada vfstrace"
Comment by Jinshan Xiong (Inactive) [ 31/May/17 ]

Also please tell me the RPC size for each OSC and the stripe size of the test file.

Comment by Darby Vicker [ 31/May/17 ]

Sorry for the delay - I ran into some issues that took a little while to figure out.  

I took the debug logs using the following proceedure:

  • mount -t lustre 192.52.98.30@tcp:192.52.98.31@tcp:/hpfs-fsl /tmp/lustre_test/
  • lctl set_param debug="reada vfstrace"
  • lctl dk debug.aftermount.log
  • ./test_ss_sn.sh
  • lctl dk debug.aftertest.log

The test_ss_sn.sh script is what's driving the attached plots - it writes, reads and copies a large file (~50 GB) with various stripe sizes and counts using dd.  I can provide more details if needed.  I'm doing a short version for this - just 1m and 4m stripe sizes with 1 and 4 stripe counts.  

I did this for two different lustre clients - the first with 2.9.0 + the LU-9214 patch (25996).  The second with 2.9.0 with commit d8467ab reverted. I'm attaching the debug logs and the plots of all of this.  The "lctl set_param" above is affecting the performance quite a bit (both read and write) so I've include results with and without debug logs turned on.   

Comment by Jinshan Xiong (Inactive) [ 01/Jun/17 ]

Yes it has significant performance loss for the read case.

Unfortunately the debug log didn't catch any useful information about readahead. It was overwritten by the last write in your test case. The readahead log can be captured by reading an existing file in the file system and then dump the log, as follows:

lctl set_param debug="reada vfstrace"
dd if=<lustre_file> of=/dev/null bs=...
lctl dk > log

It would be helpful to post the test script here so that I can see the stripe options.

Comment by Darby Vicker [ 01/Jun/17 ]

OK, I'll redo the tests and dump the logs after reading existing files.  I've attached the script so you can see exactly what we're doing, including the stripe options.  

Comment by Jinshan Xiong (Inactive) [ 01/Jun/17 ]

Please also show me the output of:

lctl get_param osc.*.max_pages_per_rpc

from the client node.

Comment by Jinshan Xiong (Inactive) [ 01/Jun/17 ]

I can reproduce this problem by using 25M block size in dd test. Probably the current algorithm can't pipeline readahead stream well w/ large read size.

Can you try to use block size 1MB in your dd test and see if you can still reproduce it?

Comment by Darby Vicker [ 01/Jun/17 ]

Here are the results with the 2.9.0 + patch 25996 client.  

 

[root@dvicker ~]# lctl set_param debug="reada vfstrace"
debug=reada vfstrace
[root@dvicker ~]# dd if=/nobackup/.test/dvicker-1m-04nodes.keep of=/dev/null bs=25M 
2000+0 records in
2000+0 records out
52428800000 bytes (52 GB) copied, 300.815 s, 174 MB/s
[root@dvicker ~]# lctl dk > debug.4node.2.9.0_patch25996.log
[root@dvicker ~]# dd if=/nobackup/.test/dvicker-1m-01nodes.keep of=/dev/null bs=25M 
2000+0 records in
2000+0 records out
52428800000 bytes (52 GB) copied, 56.6835 s, 925 MB/s
[root@dvicker ~]# lctl dk > debug.1node.2.9.0_patch25996.log
[root@dvicker ~]# dd if=/nobackup/.test/dvicker-1m-04nodes.keep of=/dev/null bs=1M 
50000+0 records in
50000+0 records out
52428800000 bytes (52 GB) copied, 513.064 s, 102 MB/s
[root@dvicker ~]# lctl dk > debug.4node_bs_1m.2.9.0_patch25996.log
[root@dvicker ~]# dd if=/nobackup/.test/dvicker-1m-01nodes.keep of=/dev/null bs=1M 
50000+0 records in
50000+0 records out
52428800000 bytes (52 GB) copied, 45.1046 s, 1.2 GB/s
[root@dvicker ~]# lctl dk > debug.1node_bs_1m.2.9.0_patch25996.log

 

So it looks like the block size doesn't make a big difference in our case. Do you also want the debug logs for the 2.9.0 client with commit d8467ab reverted?

Comment by Darby Vicker [ 01/Jun/17 ]

And here is the other output you asked for.  The hpfs-fsl is our 2.9 LFS I've we've been benchmarking here.  The hpfs2eg3 is an older (2.4) LFS we are about to retire.  

 

[root@dvicker ~]# lctl get_param osc.*.max_pages_per_rpc
osc.hpfs-fsl-OST0000-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST0000-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs-fsl-OST0001-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST0001-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs-fsl-OST0002-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST0002-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs-fsl-OST0003-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST0003-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs-fsl-OST0004-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST0004-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs-fsl-OST0005-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST0005-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs-fsl-OST0006-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST0006-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs-fsl-OST0007-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST0007-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs-fsl-OST0008-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST0008-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs-fsl-OST0009-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST0009-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs-fsl-OST000a-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST000a-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs-fsl-OST000b-osc-ffff88085bd08000.max_pages_per_rpc=256
osc.hpfs-fsl-OST000b-osc-ffff8810566ba800.max_pages_per_rpc=256
osc.hpfs2eg3-OST0000-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0001-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0002-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0003-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0004-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0005-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0006-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0007-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0008-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0009-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST000a-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST000b-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST000c-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST000d-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST000e-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST000f-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0010-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0011-osc-ffff88082bfa9800.max_pages_per_rpc=1024
osc.hpfs2eg3-OST0012-osc-ffff88082bfa9800.max_pages_per_rpc=1024
[root@dvicker ~]# mount -t lustre
192.52.98.142@tcp:/hpfs2eg3 on /lustre2 type lustre (rw,flock,lazystatfs)
192.52.98.30@tcp:192.52.98.31@tcp:/hpfs-fsl/work on /nobackup type lustre (rw,flock,lazystatfs)
[root@dvicker ~]# 

Comment by Gerrit Updater [ 01/Jun/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/27388
Subject: LU-9574 llite: pipeline readahead better with large I/O
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a1ed8000d29f00a2933df125e3b8188c5da3d883

Comment by Jinshan Xiong (Inactive) [ 01/Jun/17 ]

The above patch is helpful on my side. However, I can see something weird is going on from the log you attached. I'm looking at it but meanwhile please try the patch.

Comment by Jinshan Xiong (Inactive) [ 01/Jun/17 ]

Have you applied any patches by your own? It seems like some extents of the file are skipped so the read pattern is detected as stride read.

Comment by Jinshan Xiong (Inactive) [ 02/Jun/17 ]

This is the log in question:

00000080:00400000:9.0:1496349067.442741:0:30939:0:(rw.c:490:ll_readahead()) lrp 3792639 cr 1 cp 512 ws 3792384 wl 0 nra 3792640 rpc 256 r 7408 ri 256 csr 0 sf 3791104 sp 0 sl 0
00000080:00400000:9.0:1496349067.442743:0:30939:0:(rw.c:1133:ll_io_read_page()) [0x2000120e5:0x44:0x0]0 pages read ahead at 3792639
00000080:00200000:9.0:1496349067.442753:0:30939:0:(vvp_io.c:308:vvp_io_fini()) [0x2000120e5:0x44:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0
00000080:00200000:9.0:1496349067.442756:0:30939:0:(file.c:1225:ll_file_io_generic()) iot: 0, result: 1048576
00000080:00200000:9.0:1496349067.443350:0:30939:0:(file.c:1124:ll_file_io_generic()) file: dvicker-1m-04nodes.keep, type: 0 ppos: 15536750592, count: 1048576
00000020:00200000:9.0:1496349067.443355:0:30939:0:(cl_io.c:235:cl_io_rw_init()) header@ffff880fb6655290[0x0, 7427329, [0x2000120e5:0x44:0x0] hash]
                                                                                
00000020:00200000:9.0:1496349067.443356:0:30939:0:(cl_io.c:235:cl_io_rw_init()) io range: 0 [15536750592, 15537799168) 0 0
00000080:00200000:9.0:1496349067.443358:0:30939:0:(vvp_io.c:1381:vvp_io_init()) [0x2000120e5:0x44:0x0] ignore/verify layout 0/0, layout version 0 restore needed 0
00020000:00200000:9.0:1496349067.443361:0:30939:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 14817 chunk: [15536750592, 15537799168) 15537799168
00020000:00200000:9.0:1496349067.443364:0:30939:0:(lov_io.c:413:lov_io_iter_init()) shrink: 1 [3883925504, 3884974080)
00000080:00200000:9.0:1496349067.443367:0:30939:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [3793152, 3793407]
00000080:00200000:9.0:1496349067.443376:0:30939:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [15536750592, 15537799168)
00000080:00400000:9.0:1496349067.443383:0:30939:0:(rw.c:741:ras_update()) [0x2000120e5:0x44:0x0] pages at 3793152 miss.
Comment by Darby Vicker [ 05/Jun/17 ]

The only patches we've applied are either 25996 (i.e. https://review.whamcloud.com/#/c/25996/) or reverting commit d8467ab8a2ca15fbbd5be3429c9cf9ceb0fa78b8.  Beyond that, we haven't tried any other patches.  I will try patch 27388 and let you know what I see.  

Comment by Darby Vicker [ 05/Jun/17 ]

Patch 27388 looks good.  I applied this patch to 2.9.0 and our dd tests show good write and read performance again.  Let me know if you want debug logs or performance plots.  

Comment by Jinshan Xiong (Inactive) [ 05/Jun/17 ]

Thanks for confirming. Now that I can reproduce the issue on my side, I don't need the debug log.

Comment by Darby Vicker [ 05/Jun/17 ]

I'm sorry but it looks like I gave you some bad info.  Patch 27388 is inconsistent for us on the read performance.  Now that we've done some more more extensive testing, it is occasionally fast and occasionally slow.  Reverting commit d8467ab is consistently fast for both read and write.  

Comment by Jinshan Xiong (Inactive) [ 05/Jun/17 ]

No problem. From what I have seen in the log you posted, I can see some file extents are skipped so the read was detected as stride read.

Please collect some log with patch 27388 applied.

Comment by Darby Vicker [ 05/Jun/17 ]

OK, here is a debug log with 2.9.0 + patch 27388.  I stopped the read early (notice the ^C) so I wouldn't have to wait for the entire 52 GB file to read.  I don't think that matters but just wanted to mention it to make sure.  

[root@benkirk ~]# lctl set_param debug="reada vfstrace"
debug=reada vfstrace
[root@benkirk ~]# dd if=/nobackup/.test/dvicker-1m-04nodes.keep of=/dev/null bs=25M
^C153+0 records in
152+0 records out
3984588800 bytes (4.0 GB) copied, 158.505 s, 25.1 MB/s

[root@benkirk ~]# lctl dk > debug.4node.2.9.0_patch27388.log 
[root@benkirk ~]#

Comment by Darby Vicker [ 08/Jun/17 ]

I see a new patch was uploaded to LU-9214.  Based on the debug logs you've seen, do you think this still might be related?  I can give the new patch a try if you think it might help.  

Comment by Darby Vicker [ 04/Aug/17 ]

Just checking in on this.  We've upgraded to 2.10 and we still had to revert d8467ab to keep from running into this issue.  

Comment by Jinshan Xiong (Inactive) [ 21/Aug/17 ]

I'm looking at the debug - sorry for delay response.

Comment by Gerrit Updater [ 21/Aug/17 ]

.

Comment by Jinshan Xiong (Inactive) [ 21/Aug/17 ]

I saw some weird thing in the log message:

00020000:00200000:43.0:1496701931.619586:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1908 chunk: [2000683008, 2001731584) 2018508800
00020000:00200000:43.0:1496701931.619592:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 0 [500170752, 501219328)
00000080:00200000:43.0:1496701931.619595:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488448, 488703]
00000080:00200000:43.0:1496701931.619610:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2000683008, 2001731584)
00020000:00200000:43.0:1496701931.620038:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1909 chunk: [2001731584, 2002780160) 2018508800
00020000:00200000:43.0:1496701931.620042:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 1 [500170752, 501219328)
00000080:00200000:43.0:1496701931.620044:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488704, 488959]
00000080:00200000:43.0:1496701931.620052:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2001731584, 2002780160)
00020000:00200000:43.0:1496701931.620456:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1910 chunk: [2002780160, 2003828736) 2018508800
00020000:00200000:43.0:1496701931.620459:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 2 [500170752, 501219328)
00000080:00200000:43.0:1496701931.620461:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488960, 489215]
00000080:00200000:43.0:1496701931.620470:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2002780160, 2003828736)
00000080:00400000:43.0:1496701931.620478:0:3314:0:(rw.c:733:ras_update()) [0x2000120e5:0x44:0x0] pages at 488960 miss.
00000080:00400000:43.0:1496701931.620480:0:3314:0:(rw.c:794:ras_update()) lrp 488447 cr 0 cp 0 ws 488192 wl 7424 nra 488447 rpc 256 r 22 ri 1024 csr 128 sf 359936 sp 512 sl 1024

When the page 488960 was being read, the previous read requests to [488704, 488959] and [488704, 488959] were skipped, probably because it has found cached pages in memory, which confused readahead algorithm and detected it as stride read.

This pattern actually occurred all the time, and I think the skipped reading extents always belong to stripe 0 & 1. Was the application doing reread and somehow the cached pages for stripe 2 & 3 were cleaned up?

Anyway, I still found a problem in stride read so please try the patch set 2 of https://review.whamcloud.com/27388 with debug level setting to "vfstrace reada", and please post the debug log here.

Comment by Darby Vicker [ 22/Aug/17 ]

I just applied that patch on top of 2.10.0 and tested.  Our full suite of dd tests (test_ss_sn.sh) looked good - read and writes were fast across the board.  The stock 2.10.0 client was slow for reads (as mentioned in the Aug 4 post) so I think this patch fixes our problem.  I've attached the debug log from this:

 

[root@dvicker .test]# lctl set_param debug="reada vfstrace" debug=reada vfstrace 
[root@dvicker .test]# dd if=/nobackup/.test/dvicker-1m-04nodes.keep of=/dev/null bs=25M 
2000+0 records in 
2000+0 records out 52428800000 bytes (52 GB) copied, 54.5779 s, 961 MB/s 
[root@dvicker .test]# lctl dk > debug.4node.2.10.0_patch27388_set2.log

Comment by Gerrit Updater [ 15/Sep/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29016
Subject: LU-9574 llite: pipeline readahead better with large I/O
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 83946d6f389a5c047cc958973c4b2cdf291a6429

Comment by Gerrit Updater [ 21/Sep/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27388/
Subject: LU-9574 llite: pipeline readahead better with large I/O
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 011742134e3152f3e389ec30c08ccfc28d7a91a7

Comment by Peter Jones [ 21/Sep/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 24/Oct/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29016/
Subject: LU-9574 llite: pipeline readahead better with large I/O
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 73a7af47a01979d5dd5c7e8dcaf3f62d47f11b0b

Generated at Sat Feb 10 02:27:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.