Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12429

Single client buffered SSF write is slower than O_DIRECT

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      Single client's SSF doesn't scale by nubmer of process

      # mpirun --allow-run-as-root -np X /work/tools/bin/ior -w -t 16m -b $((32/X))g -e -o file
      
      NP     Write(MB/s)
        1     1594
        2     2525
        4     1892
        8     2032
       16     1812
      

      A framegraph output at ior with NP=16 pointed out huge amount cost of spin_lock in add_to_page_cache_lru() and set_page_dirty(). At the resutls, Buffered SSF write on single client is slower than SSF with O_DIRECT. Here is my quick test resutls of single client SSF with/without O_DIRECT.

      # mpirun -np 16 --allow-run-as-root /work/tools/bin/ior -w -t 16m -b 4g -e -o /scratch0/stripe/file 
      Max Write: 1806.31 MiB/sec (1894.06 MB/sec)
      
      # mpirun -np 16 --allow-run-as-root /work/tools/bin/ior -w -t 16m -b 4g -e -o /scratch0/stripe/file -B
      Max Write: 5547.13 MiB/sec (5816.58 MB/sec)
      

      Attachments

        Issue Links

          Activity

            [LU-12429] Single client buffered SSF write is slower than O_DIRECT

            Does it make sense to just automatically bypass the page cache on the client for read() and/or write() calls that are large enough and aligned (essentially use O_DIRECT automatically)? For example, read/write over 16MB if single-threaded, or over 4MB if multi-threaded? That would totally avoid the overhead of the page cache for those syscalls.

            adilger Andreas Dilger added a comment - Does it make sense to just automatically bypass the page cache on the client for read() and/or write() calls that are large enough and aligned (essentially use O_DIRECT automatically)? For example, read/write over 16MB if single-threaded, or over 4MB if multi-threaded? That would totally avoid the overhead of the page cache for those syscalls.
            dongyang Dongyang Li added a comment -

            I agree, this ticket is more about the page cache overhead for multi-thread buffered write.

            dongyang Dongyang Li added a comment - I agree, this ticket is more about the page cache overhead for multi-thread buffered write.
            sihara Shuichi Ihara added a comment - - edited

            DY, attached is an framegraph of Lustre client when single thread IOR write on it. it might be related, but differnt workload (e.g. buffered IO vs O_DIRECT, single thread vs single client). I wonder if I should open new ticket for it?

            sihara Shuichi Ihara added a comment - - edited DY, attached is an framegraph of Lustre client when single thread IOR write on it. it might be related, but differnt workload (e.g. buffered IO vs O_DIRECT, single thread vs single client). I wonder if I should open new ticket for it?

            https://review.whamcloud.com/#/c/28711/ (latest patchset 8) doesn't help very much either.

            # mpirun -np 16 --allow-run-as-root /work/tools/bin/ior -w -t 16m -b 4g -e -o /cache1/stripe/file 
            Max Write: 2109.99 MiB/sec (2212.49 MB/sec)
            
            sihara Shuichi Ihara added a comment - https://review.whamcloud.com/#/c/28711/ (latest patchset 8) doesn't help very much either. # mpirun -np 16 --allow-run-as-root /work/tools/bin/ior -w -t 16m -b 4g -e -o /cache1/stripe/file Max Write: 2109.99 MiB/sec (2212.49 MB/sec)

            I'm glad the patch improves things by 25%.  I'm pretty sure a new flame graph would basically show more time shifting to the contention on page allocation rather than page dirtying, but still those two hot spots.  It would be interesting to see, though.

            Backing up:
            Packed node direct i/o with reasonable sizes is always going to be better than buffered i/o.  We're not going to be able to fix that unless we were to convert direct to buffered in that scenario.

            I also don't have any other good ideas for improvements - That contention we're facing is in the page cache itself, and Lustre isn't contributing to it.  Unless we want to do something radical like try to convert from buffered to direct when we run in to trouble, there will always be a gap.  (I don't like that idea of switching when the node is busy for a variety of reasons, FYI)

            So I think we have to decide what the goal is for this ticket, as the implied goal of making them the same is, unfortunately, not realistic.

            pfarrell Patrick Farrell (Inactive) added a comment - I'm glad the patch improves things by 25%.  I'm pretty sure a new flame graph would basically show more time shifting to the contention on page allocation rather than page dirtying, but still those two hot spots.  It would be interesting to see, though. Backing up: Packed node direct i/o with reasonable sizes is always going to be better than buffered i/o.  We're not going to be able to fix that unless we were to convert direct to buffered in that scenario. I also don't have any other good ideas for improvements - That contention we're facing is in the page cache itself, and Lustre isn't contributing to it.  Unless we want to do something radical like try to convert from buffered to direct when we run in to trouble, there will always be a gap.  (I don't like that idea of switching when the node is busy for a variety of reasons, FYI) So I think we have to decide what the goal is for this ticket, as the implied goal of making them the same is, unfortunately, not realistic.

            That patch probably doesn't work with newer kernels - The mapping->tree_lock has been renamed.  I need to fix that, and will do shortly...  But you shouldn't expect much benefit, there have not been many changes in that area.  Just some reshuffling.

            pfarrell Patrick Farrell (Inactive) added a comment - That patch probably doesn't work with newer kernels - The mapping->tree_lock has been renamed.  I need to fix that, and will do shortly...  But you shouldn't expect much benefit, there have not been many changes in that area.  Just some reshuffling.

            People

              dongyang Dongyang Li
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: