Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13312

Optimized RA for stride read under memory pressure

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • None
    • master
    • 9223372036854775807

    Description

      LU-12518 introduced new RA and it supports for page unaligned stride IO and significant improved performance (e.g. IO500 IOR_hard_read). However, it still can be optimized. The current patch sometimes doesn't work well under memory pressure?, but performance is back after dropping page caches before read. Here is a reproducer and results.

      4 x client(1 x Gold 5218, 96GB RAM)
      segment=400000 (~300GB per node)

      # mpirun -np 64 ior -w -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300
      
      # mpirun -np 64 ior -r -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300
       
      Max Read:  5087.32 MiB/sec (5334.44 MB/sec)
      
      One of client's RA stat
      # lctl get_param llite.*.read_ahead_stats
      llite.fast-ffff99878133d000.read_ahead_stats=
      snapshot_time             1582946538.113259755 secs.nsecs
      hits                      72125088 samples [pages]
      misses                    1686810 samples [pages]
      readpage not consecutive  6400000 samples [pages]
      miss inside window        3011 samples [pages]
      failed grab_cache_page    2945424 samples [pages]
      read but discarded        35565 samples [pages]
      zero size window          100245 samples [pages]
      failed to reach end       73663094 samples [pages]
      failed to fast read       6396933 samples [pages]
      

      After dropping pagecache on clients before read.

      # clush -a "echo 3 > /proc/sys/vm/drop_caches "
      # mpirun -np 64 ior -r -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300
      
      Max Read:  16244.62 MiB/sec (17033.72 MB/sec)
      
      Client's RA stat
      # lctl get_param llite.*.read_ahead_stats
      llite.fast-ffff99878133d000.read_ahead_stats=
      snapshot_time             1582947544.040550353 secs.nsecs
      hits                      73799940 samples [pages]
      misses                    63 samples [pages]
      readpage not consecutive  6400000 samples [pages]
      failed grab_cache_page    2654231 samples [pages]
      read but discarded        1 samples [pages]
      zero size window          500 samples [pages]
      failed to reach end       402367 samples [pages]
      failed to fast read       35075 samples [pages]
      

       

      Attachments

        Activity

          [LU-13312] Optimized RA for stride read under memory pressure

          spitzcor i'll abandon that patch.

          wshilong Wang Shilong (Inactive) added a comment - spitzcor i'll abandon that patch.
          spitzcor Cory Spitz added a comment -

          wshilong, you closed this, but https://review.whamcloud.com/#/c/37761/ is still pending for this LU. Do you intend to abandon or re-target that patch? Or, shall we re-open this ticket?

          spitzcor Cory Spitz added a comment - wshilong , you closed this, but https://review.whamcloud.com/#/c/37761/ is still pending for this LU. Do you intend to abandon or re-target that patch? Or, shall we re-open this ticket?

          This is not acutally memory problem.

          wshilong Wang Shilong (Inactive) added a comment - This is not acutally memory problem.

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37762/
          Subject: LU-13312 ldlm: fix to stop iterating tree early in ldlm_kms_shift_cb()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: b28b3bd9094ee7be8e3c11a531383246a71d5dec

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37762/ Subject: LU-13312 ldlm: fix to stop iterating tree early in ldlm_kms_shift_cb() Project: fs/lustre-release Branch: master Current Patch Set: Commit: b28b3bd9094ee7be8e3c11a531383246a71d5dec

          I've also confimred canceling whole locks before read helped a lot always regardless under memory pressure or not.

          # mpirun -np 64 ior -w -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300
          
          # clush -w ec[01-04] lctl set_param  ldlm.namespaces.*.lru_size=clear > /dev/null
          
          # mpirun -np 64 ior -w -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300
          
          Max Read:  22606.54 MiB/sec (23704.67 MB/sec)
          
          Without canceling locks before read
          Max Read:  4241.10 MiB/sec (4447.12 MB/sec)
          
          sihara Shuichi Ihara added a comment - I've also confimred canceling whole locks before read helped a lot always regardless under memory pressure or not. # mpirun -np 64 ior -w -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300 # clush -w ec[01-04] lctl set_param ldlm.namespaces.*.lru_size=clear > /dev/null # mpirun -np 64 ior -w -s 400000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /fast/dir/file -O stoneWallingStatusFile=/fast/dir/stonewall -O stoneWallingWearOut=1 -D 300 Max Read: 22606.54 MiB/sec (23704.67 MB/sec) Without canceling locks before read Max Read: 4241.10 MiB/sec (4447.12 MB/sec)
          -61 is ENODATA which returned by osc_io_read_ahead(), it means readahead could not grab locks ahead, this might be related to
          your "lru_max_age=100" Shuichi Ihara?
          

          Nope, I didn't chnage lru_max_age when I got this log.

          sihara Shuichi Ihara added a comment - -61 is ENODATA which returned by osc_io_read_ahead(), it means readahead could not grab locks ahead, this might be related to your "lru_max_age=100" Shuichi Ihara? Nope, I didn't chnage lru_max_age when I got this log.

          Regarding to lock cancel problem, i think we talked somewhere, but finally not get a chance to push a known issue there, let's push it this ticket.

          wshilong Wang Shilong (Inactive) added a comment - Regarding to lock cancel problem, i think we talked somewhere, but finally not get a chance to push a known issue there, let's push it this ticket.

          Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37762
          Subject: LU-13312 ldlm: fix to stop iterating tree early in ldlm_kms_shift_cb()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: c3b211134e5021f52db80d00613de8699f805f7c

          gerrit Gerrit Updater added a comment - Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37762 Subject: LU-13312 ldlm: fix to stop iterating tree early in ldlm_kms_shift_cb() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c3b211134e5021f52db80d00613de8699f805f7c

          I guess why your set lru_max_age=100 is because of after writing, lock cancel could take a bit time if there is too many PW locks cached in memory.

          wshilong Wang Shilong (Inactive) added a comment - I guess why your set lru_max_age=100 is because of after writing, lock cancel could take a bit time if there is too many PW locks cached in memory.
          wshilong Wang Shilong (Inactive) added a comment - - edited

          After checking debugs logs, there are many bunch of error logs like:

          00020000:00400000:5.0:1582946316.433652:0:18406:0:(lov_io.c:1049:lov_io_read_ahead()) [0x200000404:0x8:0x0] cra_end = 0, stripes = 240, rc = -61

          -61 is ENODATA which returned by osc_io_read_ahead(), it means readahead could not grab locks ahead, this might be related to
          your "lru_max_age=100" sihara?

          So that explain ldlm.namespaces.*.lru_size=clear before reading testing start, it guarantee there is no PW locks from other clients and PR locks could be grabbed very aggressively which makes our readahead go very well.

          wshilong Wang Shilong (Inactive) added a comment - - edited After checking debugs logs, there are many bunch of error logs like: 00020000:00400000:5.0:1582946316.433652:0:18406:0:(lov_io.c:1049:lov_io_read_ahead()) [0x200000404:0x8:0x0] cra_end = 0, stripes = 240, rc = -61 -61 is ENODATA which returned by osc_io_read_ahead(), it means readahead could not grab locks ahead, this might be related to your "lru_max_age=100" sihara ? So that explain ldlm.namespaces.*.lru_size=clear before reading testing start, it guarantee there is no PW locks from other clients and PR locks could be grabbed very aggressively which makes our readahead go very well.

          People

            wshilong Wang Shilong (Inactive)
            sihara Shuichi Ihara
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: