Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9194

single stream read performance with ZFS OSTs

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.9.0
    • ZFS based OSTs
      Lustre 2.9.0 or newer, EL 3.0, EL 3.1

    Description

      With ZFS OSTs the single stream read performance for files with 1 OST has dropped and is depending very strongly on some tunables. I use OSTs which are capable of reading a file directly from ZFS with 1.8GB/s and more due to good performance of the zfetch prefetcher. With Lustre read performance is often in the range of 300-500MB/s, which is pretty low compared to solid 1GB/s which we can see with ldiskfs on a hardware RAID controller storage unit.

      The explanation for the performance problem is that the RPCs in flight (up to 256) that do read-ahead for the Lustre client are scheduled on the server in more or less random order on the OSS side (as ll_ost_io* kernel threads) and break the zfetch pattern. The best performance is with low max_rpcs_in_flight and large max_read_ahead_per_file_mb values. Unfortunately these values are bad for loads with many streams per client.

      The effort in LU-8964 actually makes read performance with ZFS OSTs worse, because the additional parallel tasks are again scheduled in random order.

      Measurements are with "dd bs=1MB count=100000 ...", between the measurements the caches were dropped on both OSS and client.

      single stream dd bs=1M read bandwidth
      -----------------------------------------
      rpcs        ZFS prefetch enabled
      in         max_read_ahead_per_file_mb                            
      flight     1     16    32    64   256                              
      -----------------------------------------
      1         335   597   800   817   910    
      2         379   500   657   705   690    
      4         335   444   516   558   615    
      8         339   396   439   471   546    
      16        378   359   385   404   507    
      32        333   360   378   379   429    
      64        332   346   377   377   398    
      128       375   359   379   381   402    
      256       339   351   380   378   409    
      

       

      Disabling the ZFS prefetcher completely helps when max_read_ahead_per_file_mb is huge and max_rpcs_in_flight are large because the many (random) streams lead to some sort of prefetching effect. Unfortunately the multiple stream workloads are very bad, so no prefetch is not really a solution.

      single stream dd bs=1M read bandwidth
      -----------------------------------------
      rpcs        ZFS prefetch disabled
      in         max_read_ahead_per_file_mb                   
      flight     1     16    32    64   256
      -------------------------- --------------
      1         155   247   286   288   283    
      2         157   292   360   360   358    
      4         157   346   461   465   450    
      8         155   389   580   602   604    
      16        158   384   614   782   791    
      32        152   386   600   878   972    
      64        158   386   597   858  1100    
      128       155   390   603   863   948    
      256       160   382   602   859   934    
      

       

      There are probably two approaches to the problem:

      1. make ZFS zfetch smarter, such that it can cope with the pseudo-randomly ordered read requests from Lustre.
      2. change the Lustre client such that it has only one RPC in flight to a particular OST object. This would present an acceptable pattern to zfetch and lead to ~1GB/s for the single stream read.

      This ticket is about implementing option 2. I prepared patches for tracking the number of read requests per osc_object but have difficulties to limit/enforce them in osc_cache.c. I hope for some hints...

      Attachments

        Activity

          People

            jgmitter Joseph Gmitter (Inactive)
            efocht Erich Focht
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: