Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9574

Large file read performance degradation from multiple OST's

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.2
    • Lustre 2.9.0
    • None
    • RHEL 7 servers, RHEL 6 and 7 clients.
    • 3
    • 9223372036854775807

    Description

      We recently noticed that the large file read performance on our 2.9 LFS is dramatically worse than it used to be. The attached plot is the result of a test script that uses dd to write a large file (50GB) to disk, read that file and then copy it to a 2nd file to test write, read and read/write speeds for large files for various stripe sizes and counts. The two sets of data on this plot are on the same server and client hardware. The LFS was originally built and formatted with 2.8.0 but we eventually upgraded to 2.9.0 on the servers and clients. The behavior we are used to seeing is increasing performance as you increase the stripe count with a peak in performance around 4 or 6 OST's and a degradation after that as more OST's are used. This is what we saw under 2.8 (red lines in the plots). With 2.9 we still get very good write performance (almost line rate on our 10 GbE clients). But for reads we see extremely good performance with a single OST and significantly degraded performance for multiple OST's – black lines in the plots.  Using a git bisect to compile and test different clients, we were able to isolate it to this commit:

      commit d8467ab8a2ca15fbbd5be3429c9cf9ceb0fa78b8
      LU-7990 clio: revise readahead to support 16MB IO

      There is slightly more info here:

      http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-May/014509.html

      Please let me know if you need any other data or info.   

      Attachments

        Activity

          [LU-9574] Large file read performance degradation from multiple OST's

          Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29016
          Subject: LU-9574 llite: pipeline readahead better with large I/O
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set: 1
          Commit: 83946d6f389a5c047cc958973c4b2cdf291a6429

          gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29016 Subject: LU-9574 llite: pipeline readahead better with large I/O Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 83946d6f389a5c047cc958973c4b2cdf291a6429
          dvicker Darby Vicker added a comment -

          I just applied that patch on top of 2.10.0 and tested.  Our full suite of dd tests (test_ss_sn.sh) looked good - read and writes were fast across the board.  The stock 2.10.0 client was slow for reads (as mentioned in the Aug 4 post) so I think this patch fixes our problem.  I've attached the debug log from this:

           

          [root@dvicker .test]# lctl set_param debug="reada vfstrace" debug=reada vfstrace 
          [root@dvicker .test]# dd if=/nobackup/.test/dvicker-1m-04nodes.keep of=/dev/null bs=25M 
          2000+0 records in 
          2000+0 records out 52428800000 bytes (52 GB) copied, 54.5779 s, 961 MB/s 
          [root@dvicker .test]# lctl dk > debug.4node.2.10.0_patch27388_set2.log
          
          
          dvicker Darby Vicker added a comment - I just applied that patch on top of 2.10.0 and tested.  Our full suite of dd tests (test_ss_sn.sh) looked good - read and writes were fast across the board.  The stock 2.10.0 client was slow for reads (as mentioned in the Aug 4 post) so I think this patch fixes our problem.  I've attached the debug log from this:   [root@dvicker .test]# lctl set_param debug="reada vfstrace" debug=reada vfstrace [root@dvicker .test]# dd if=/nobackup/.test/dvicker-1m-04nodes.keep of=/dev/null bs=25M 2000+0 records in 2000+0 records out 52428800000 bytes (52 GB) copied, 54.5779 s, 961 MB/s [root@dvicker .test]# lctl dk > debug.4node.2.10.0_patch27388_set2.log

          I saw some weird thing in the log message:

          00020000:00200000:43.0:1496701931.619586:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1908 chunk: [2000683008, 2001731584) 2018508800
          00020000:00200000:43.0:1496701931.619592:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 0 [500170752, 501219328)
          00000080:00200000:43.0:1496701931.619595:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488448, 488703]
          00000080:00200000:43.0:1496701931.619610:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2000683008, 2001731584)
          00020000:00200000:43.0:1496701931.620038:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1909 chunk: [2001731584, 2002780160) 2018508800
          00020000:00200000:43.0:1496701931.620042:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 1 [500170752, 501219328)
          00000080:00200000:43.0:1496701931.620044:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488704, 488959]
          00000080:00200000:43.0:1496701931.620052:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2001731584, 2002780160)
          00020000:00200000:43.0:1496701931.620456:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1910 chunk: [2002780160, 2003828736) 2018508800
          00020000:00200000:43.0:1496701931.620459:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 2 [500170752, 501219328)
          00000080:00200000:43.0:1496701931.620461:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488960, 489215]
          00000080:00200000:43.0:1496701931.620470:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2002780160, 2003828736)
          00000080:00400000:43.0:1496701931.620478:0:3314:0:(rw.c:733:ras_update()) [0x2000120e5:0x44:0x0] pages at 488960 miss.
          00000080:00400000:43.0:1496701931.620480:0:3314:0:(rw.c:794:ras_update()) lrp 488447 cr 0 cp 0 ws 488192 wl 7424 nra 488447 rpc 256 r 22 ri 1024 csr 128 sf 359936 sp 512 sl 1024
          

          When the page 488960 was being read, the previous read requests to [488704, 488959] and [488704, 488959] were skipped, probably because it has found cached pages in memory, which confused readahead algorithm and detected it as stride read.

          This pattern actually occurred all the time, and I think the skipped reading extents always belong to stripe 0 & 1. Was the application doing reread and somehow the cached pages for stripe 2 & 3 were cleaned up?

          Anyway, I still found a problem in stride read so please try the patch set 2 of https://review.whamcloud.com/27388 with debug level setting to "vfstrace reada", and please post the debug log here.

          jay Jinshan Xiong (Inactive) added a comment - I saw some weird thing in the log message: 00020000:00200000:43.0:1496701931.619586:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1908 chunk: [2000683008, 2001731584) 2018508800 00020000:00200000:43.0:1496701931.619592:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 0 [500170752, 501219328) 00000080:00200000:43.0:1496701931.619595:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488448, 488703] 00000080:00200000:43.0:1496701931.619610:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2000683008, 2001731584) 00020000:00200000:43.0:1496701931.620038:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1909 chunk: [2001731584, 2002780160) 2018508800 00020000:00200000:43.0:1496701931.620042:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 1 [500170752, 501219328) 00000080:00200000:43.0:1496701931.620044:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488704, 488959] 00000080:00200000:43.0:1496701931.620052:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2001731584, 2002780160) 00020000:00200000:43.0:1496701931.620456:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1910 chunk: [2002780160, 2003828736) 2018508800 00020000:00200000:43.0:1496701931.620459:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 2 [500170752, 501219328) 00000080:00200000:43.0:1496701931.620461:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488960, 489215] 00000080:00200000:43.0:1496701931.620470:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2002780160, 2003828736) 00000080:00400000:43.0:1496701931.620478:0:3314:0:(rw.c:733:ras_update()) [0x2000120e5:0x44:0x0] pages at 488960 miss. 00000080:00400000:43.0:1496701931.620480:0:3314:0:(rw.c:794:ras_update()) lrp 488447 cr 0 cp 0 ws 488192 wl 7424 nra 488447 rpc 256 r 22 ri 1024 csr 128 sf 359936 sp 512 sl 1024 When the page 488960 was being read, the previous read requests to [488704, 488959] and [488704, 488959] were skipped, probably because it has found cached pages in memory, which confused readahead algorithm and detected it as stride read. This pattern actually occurred all the time, and I think the skipped reading extents always belong to stripe 0 & 1. Was the application doing reread and somehow the cached pages for stripe 2 & 3 were cleaned up? Anyway, I still found a problem in stride read so please try the patch set 2 of https://review.whamcloud.com/27388 with debug level setting to "vfstrace reada", and please post the debug log here.
          gerrit Gerrit Updater added a comment - - edited

          .

          gerrit Gerrit Updater added a comment - - edited .

          I'm looking at the debug - sorry for delay response.

          jay Jinshan Xiong (Inactive) added a comment - I'm looking at the debug - sorry for delay response.
          dvicker Darby Vicker added a comment -

          Just checking in on this.  We've upgraded to 2.10 and we still had to revert d8467ab to keep from running into this issue.  

          dvicker Darby Vicker added a comment - Just checking in on this.  We've upgraded to 2.10 and we still had to revert d8467ab to keep from running into this issue.  
          dvicker Darby Vicker added a comment -

          I see a new patch was uploaded to LU-9214.  Based on the debug logs you've seen, do you think this still might be related?  I can give the new patch a try if you think it might help.  

          dvicker Darby Vicker added a comment - I see a new patch was uploaded to LU-9214 .  Based on the debug logs you've seen, do you think this still might be related?  I can give the new patch a try if you think it might help.  

          OK, here is a debug log with 2.9.0 + patch 27388.  I stopped the read early (notice the ^C) so I wouldn't have to wait for the entire 52 GB file to read.  I don't think that matters but just wanted to mention it to make sure.  

          [root@benkirk ~]# lctl set_param debug="reada vfstrace"
          debug=reada vfstrace
          [root@benkirk ~]# dd if=/nobackup/.test/dvicker-1m-04nodes.keep of=/dev/null bs=25M
          ^C153+0 records in
          152+0 records out
          3984588800 bytes (4.0 GB) copied, 158.505 s, 25.1 MB/s
          
          [root@benkirk ~]# lctl dk > debug.4node.2.9.0_patch27388.log 
          [root@benkirk ~]#
          
          
          dvicker Darby Vicker added a comment - OK, here is a debug log with 2.9.0 + patch 27388.  I stopped the read early (notice the ^C) so I wouldn't have to wait for the entire 52 GB file to read.  I don't think that matters but just wanted to mention it to make sure.   [root@benkirk ~]# lctl set_param debug="reada vfstrace" debug=reada vfstrace [root@benkirk ~]# dd if=/nobackup/.test/dvicker-1m-04nodes.keep of=/dev/null bs=25M ^C153+0 records in 152+0 records out 3984588800 bytes (4.0 GB) copied, 158.505 s, 25.1 MB/s [root@benkirk ~]# lctl dk > debug.4node.2.9.0_patch27388.log [root@benkirk ~]#

          No problem. From what I have seen in the log you posted, I can see some file extents are skipped so the read was detected as stride read.

          Please collect some log with patch 27388 applied.

          jay Jinshan Xiong (Inactive) added a comment - No problem. From what I have seen in the log you posted, I can see some file extents are skipped so the read was detected as stride read. Please collect some log with patch 27388 applied.
          dvicker Darby Vicker added a comment -

          I'm sorry but it looks like I gave you some bad info.  Patch 27388 is inconsistent for us on the read performance.  Now that we've done some more more extensive testing, it is occasionally fast and occasionally slow.  Reverting commit d8467ab is consistently fast for both read and write.  

          dvicker Darby Vicker added a comment - I'm sorry but it looks like I gave you some bad info.  Patch 27388 is inconsistent for us on the read performance.  Now that we've done some more more extensive testing, it is occasionally fast and occasionally slow.  Reverting commit d8467ab is consistently fast for both read and write.  

          Thanks for confirming. Now that I can reproduce the issue on my side, I don't need the debug log.

          jay Jinshan Xiong (Inactive) added a comment - Thanks for confirming. Now that I can reproduce the issue on my side, I don't need the debug log.

          People

            jay Jinshan Xiong (Inactive)
            dvicker Darby Vicker
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: