Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9574

Large file read performance degradation from multiple OST's

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.2
    • Lustre 2.9.0
    • None
    • RHEL 7 servers, RHEL 6 and 7 clients.
    • 3
    • 9223372036854775807

    Description

      We recently noticed that the large file read performance on our 2.9 LFS is dramatically worse than it used to be. The attached plot is the result of a test script that uses dd to write a large file (50GB) to disk, read that file and then copy it to a 2nd file to test write, read and read/write speeds for large files for various stripe sizes and counts. The two sets of data on this plot are on the same server and client hardware. The LFS was originally built and formatted with 2.8.0 but we eventually upgraded to 2.9.0 on the servers and clients. The behavior we are used to seeing is increasing performance as you increase the stripe count with a peak in performance around 4 or 6 OST's and a degradation after that as more OST's are used. This is what we saw under 2.8 (red lines in the plots). With 2.9 we still get very good write performance (almost line rate on our 10 GbE clients). But for reads we see extremely good performance with a single OST and significantly degraded performance for multiple OST's – black lines in the plots.  Using a git bisect to compile and test different clients, we were able to isolate it to this commit:

      commit d8467ab8a2ca15fbbd5be3429c9cf9ceb0fa78b8
      LU-7990 clio: revise readahead to support 16MB IO

      There is slightly more info here:

      http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-May/014509.html

      Please let me know if you need any other data or info.   

      Attachments

        Activity

          [LU-9574] Large file read performance degradation from multiple OST's

          John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29016/
          Subject: LU-9574 llite: pipeline readahead better with large I/O
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set:
          Commit: 73a7af47a01979d5dd5c7e8dcaf3f62d47f11b0b

          gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/29016/ Subject: LU-9574 llite: pipeline readahead better with large I/O Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 73a7af47a01979d5dd5c7e8dcaf3f62d47f11b0b
          pjones Peter Jones added a comment -

          Landed for 2.11

          pjones Peter Jones added a comment - Landed for 2.11

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27388/
          Subject: LU-9574 llite: pipeline readahead better with large I/O
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 011742134e3152f3e389ec30c08ccfc28d7a91a7

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27388/ Subject: LU-9574 llite: pipeline readahead better with large I/O Project: fs/lustre-release Branch: master Current Patch Set: Commit: 011742134e3152f3e389ec30c08ccfc28d7a91a7

          Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29016
          Subject: LU-9574 llite: pipeline readahead better with large I/O
          Project: fs/lustre-release
          Branch: b2_10
          Current Patch Set: 1
          Commit: 83946d6f389a5c047cc958973c4b2cdf291a6429

          gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/29016 Subject: LU-9574 llite: pipeline readahead better with large I/O Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 83946d6f389a5c047cc958973c4b2cdf291a6429
          dvicker Darby Vicker added a comment -

          I just applied that patch on top of 2.10.0 and tested.  Our full suite of dd tests (test_ss_sn.sh) looked good - read and writes were fast across the board.  The stock 2.10.0 client was slow for reads (as mentioned in the Aug 4 post) so I think this patch fixes our problem.  I've attached the debug log from this:

           

          [root@dvicker .test]# lctl set_param debug="reada vfstrace" debug=reada vfstrace 
          [root@dvicker .test]# dd if=/nobackup/.test/dvicker-1m-04nodes.keep of=/dev/null bs=25M 
          2000+0 records in 
          2000+0 records out 52428800000 bytes (52 GB) copied, 54.5779 s, 961 MB/s 
          [root@dvicker .test]# lctl dk > debug.4node.2.10.0_patch27388_set2.log
          
          
          dvicker Darby Vicker added a comment - I just applied that patch on top of 2.10.0 and tested.  Our full suite of dd tests (test_ss_sn.sh) looked good - read and writes were fast across the board.  The stock 2.10.0 client was slow for reads (as mentioned in the Aug 4 post) so I think this patch fixes our problem.  I've attached the debug log from this:   [root@dvicker .test]# lctl set_param debug="reada vfstrace" debug=reada vfstrace [root@dvicker .test]# dd if=/nobackup/.test/dvicker-1m-04nodes.keep of=/dev/null bs=25M 2000+0 records in 2000+0 records out 52428800000 bytes (52 GB) copied, 54.5779 s, 961 MB/s [root@dvicker .test]# lctl dk > debug.4node.2.10.0_patch27388_set2.log

          I saw some weird thing in the log message:

          00020000:00200000:43.0:1496701931.619586:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1908 chunk: [2000683008, 2001731584) 2018508800
          00020000:00200000:43.0:1496701931.619592:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 0 [500170752, 501219328)
          00000080:00200000:43.0:1496701931.619595:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488448, 488703]
          00000080:00200000:43.0:1496701931.619610:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2000683008, 2001731584)
          00020000:00200000:43.0:1496701931.620038:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1909 chunk: [2001731584, 2002780160) 2018508800
          00020000:00200000:43.0:1496701931.620042:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 1 [500170752, 501219328)
          00000080:00200000:43.0:1496701931.620044:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488704, 488959]
          00000080:00200000:43.0:1496701931.620052:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2001731584, 2002780160)
          00020000:00200000:43.0:1496701931.620456:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1910 chunk: [2002780160, 2003828736) 2018508800
          00020000:00200000:43.0:1496701931.620459:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 2 [500170752, 501219328)
          00000080:00200000:43.0:1496701931.620461:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488960, 489215]
          00000080:00200000:43.0:1496701931.620470:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2002780160, 2003828736)
          00000080:00400000:43.0:1496701931.620478:0:3314:0:(rw.c:733:ras_update()) [0x2000120e5:0x44:0x0] pages at 488960 miss.
          00000080:00400000:43.0:1496701931.620480:0:3314:0:(rw.c:794:ras_update()) lrp 488447 cr 0 cp 0 ws 488192 wl 7424 nra 488447 rpc 256 r 22 ri 1024 csr 128 sf 359936 sp 512 sl 1024
          

          When the page 488960 was being read, the previous read requests to [488704, 488959] and [488704, 488959] were skipped, probably because it has found cached pages in memory, which confused readahead algorithm and detected it as stride read.

          This pattern actually occurred all the time, and I think the skipped reading extents always belong to stripe 0 & 1. Was the application doing reread and somehow the cached pages for stripe 2 & 3 were cleaned up?

          Anyway, I still found a problem in stride read so please try the patch set 2 of https://review.whamcloud.com/27388 with debug level setting to "vfstrace reada", and please post the debug log here.

          jay Jinshan Xiong (Inactive) added a comment - I saw some weird thing in the log message: 00020000:00200000:43.0:1496701931.619586:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1908 chunk: [2000683008, 2001731584) 2018508800 00020000:00200000:43.0:1496701931.619592:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 0 [500170752, 501219328) 00000080:00200000:43.0:1496701931.619595:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488448, 488703] 00000080:00200000:43.0:1496701931.619610:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2000683008, 2001731584) 00020000:00200000:43.0:1496701931.620038:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1909 chunk: [2001731584, 2002780160) 2018508800 00020000:00200000:43.0:1496701931.620042:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 1 [500170752, 501219328) 00000080:00200000:43.0:1496701931.620044:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488704, 488959] 00000080:00200000:43.0:1496701931.620052:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2001731584, 2002780160) 00020000:00200000:43.0:1496701931.620456:0:3314:0:(lov_io.c:448:lov_io_rw_iter_init()) stripe: 1910 chunk: [2002780160, 2003828736) 2018508800 00020000:00200000:43.0:1496701931.620459:0:3314:0:(lov_io.c:413:lov_io_iter_init()) shrink: 2 [500170752, 501219328) 00000080:00200000:43.0:1496701931.620461:0:3314:0:(vvp_io.c:230:vvp_io_one_lock_index()) lock: 0 [488960, 489215] 00000080:00200000:43.0:1496701931.620470:0:3314:0:(vvp_io.c:725:vvp_io_read_start()) read: -> [2002780160, 2003828736) 00000080:00400000:43.0:1496701931.620478:0:3314:0:(rw.c:733:ras_update()) [0x2000120e5:0x44:0x0] pages at 488960 miss. 00000080:00400000:43.0:1496701931.620480:0:3314:0:(rw.c:794:ras_update()) lrp 488447 cr 0 cp 0 ws 488192 wl 7424 nra 488447 rpc 256 r 22 ri 1024 csr 128 sf 359936 sp 512 sl 1024 When the page 488960 was being read, the previous read requests to [488704, 488959] and [488704, 488959] were skipped, probably because it has found cached pages in memory, which confused readahead algorithm and detected it as stride read. This pattern actually occurred all the time, and I think the skipped reading extents always belong to stripe 0 & 1. Was the application doing reread and somehow the cached pages for stripe 2 & 3 were cleaned up? Anyway, I still found a problem in stride read so please try the patch set 2 of https://review.whamcloud.com/27388 with debug level setting to "vfstrace reada", and please post the debug log here.
          gerrit Gerrit Updater added a comment - - edited

          .

          gerrit Gerrit Updater added a comment - - edited .

          I'm looking at the debug - sorry for delay response.

          jay Jinshan Xiong (Inactive) added a comment - I'm looking at the debug - sorry for delay response.
          dvicker Darby Vicker added a comment -

          Just checking in on this.  We've upgraded to 2.10 and we still had to revert d8467ab to keep from running into this issue.  

          dvicker Darby Vicker added a comment - Just checking in on this.  We've upgraded to 2.10 and we still had to revert d8467ab to keep from running into this issue.  
          dvicker Darby Vicker added a comment -

          I see a new patch was uploaded to LU-9214.  Based on the debug logs you've seen, do you think this still might be related?  I can give the new patch a try if you think it might help.  

          dvicker Darby Vicker added a comment - I see a new patch was uploaded to LU-9214 .  Based on the debug logs you've seen, do you think this still might be related?  I can give the new patch a try if you think it might help.  

          People

            jay Jinshan Xiong (Inactive)
            dvicker Darby Vicker
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: