Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14055

Write performance regression caused by an commit from LU-13344

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • Lustre 2.14.0
    • None
    • master (commit: 56526a90ae)
    • 3
    • 9223372036854775807

    Description

      commit 76626d6c52 "LU-13344 all: Separate debugfs and procfs handling" caused write performance regression. Here is a reproducer and tested workload.

      Single Client(Ubuntu 18.04, 5.4.0-47-generic), 16MB O_DIRECT, FPP (128 processes)

      # mpirun --allow-run-as-root -np 128 --oversubscribe --mca btl_openib_warn_default_gid_prefix 0 --bind-to none ior -u -w -r
       -k -e -F -t 16384k -b 16384k -s 1000 -u -o /mnt/ai400x/ior.out/file --posix.odirect 
      

      "git bisect" indentified an commit where regression started.

      Here is test results.
      76626d6c52 LU-13344 all: Separate debugfs and procfs handling

      access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
      ------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
      write     21861      1366.33    60.78       16384      16384      0.091573   93.68      40.38      93.68      0   
      read      38547      2409.18    46.14       16384      16384      0.005706   53.13      8.26       53.13      0   
      

      5bc1fe092c LU-13196 llite: Remove mutex on dio read

      access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
      ------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
      write     32678      2042.40    58.96       16384      16384      0.105843   62.67      4.98       62.67      0   
      read      38588      2411.78    45.89       16384      16384      0.004074   53.07      8.11       53.07      0   
      

      master (commit 56526a90ae)

      access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
      ------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
      write     17046      1065.37    119.02      16384      16384      0.084449   120.15     67.76      120.15     0   
      read      38512      2407.00    45.04       16384      16384      0.006462   53.18      9.07       53.18      0   
      

      master still has this regression and when commit 76626d6c52 reverts from master, the performrance is back.

      master (commit 56526a90ae)+ revert commit 76626d6c52

      access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
      ------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
      write     32425      2026.59    59.88       16384      16384      0.095842   63.16      4.79       63.16      0   
      read      39601      2475.09    47.22       16384      16384      0.003637   51.72      5.73       51.72      0  
      

      Attachments

        1. master.svg
          245 kB
        2. master-revert.svg
          414 kB

        Issue Links

          Activity

            [LU-14055] Write performance regression caused by an commit from LU-13344

            My understanding is this issue is resolved, though we never fully understood it.  Let me knw if that's wrong and we can reopen or open a new ticket.

            paf0186 Patrick Farrell added a comment - My understanding is this issue is resolved, though we never fully understood it.  Let me knw if that's wrong and we can reopen or open a new ticket.
            pjones Peter Jones added a comment -

            sihara can this issue now be considered fixed?

            pjones Peter Jones added a comment - sihara can this issue now be considered fixed?

            Ihara,

            FYI I had to rebase that second patch above:
            https://review.whamcloud.com/#/c/39482/

            It was not compatible with current master, should be fixed now.

            paf0186 Patrick Farrell added a comment - Ihara, FYI I had to rebase that second patch above: https://review.whamcloud.com/#/c/39482/ It was not compatible with current master, should be fixed now.
            paf0186 Patrick Farrell added a comment - - edited

            tappro made this comment over on LU-14580, and I wanted to bring it here:

             
            "I don't see problems with patch itself. Increment in osc_consume_write_grant() was removed because it is done by atomic_long_add_return() now outside that call and it is done in both places where it is called. But maybe the patch "LU-12687 osc: consume grants for direct I/O" itself causes slowdown? Now grants are taken for Direct IO as well, so maybe that is related to not enough grants problem or similar. Are there any complains about grants on client during IOR run?"

            That patch definitely has performance implications.  Direct i/o will keep sending even when there are no grants - since it is already synchronous - but it significantly increases the load on the cl_loi_list_lock in some cases.  The patches I noted above are aimed at that.

            There's still very likely a memory layout issue here, but perhaps these will help...

            paf0186 Patrick Farrell added a comment - - edited tappro made this comment over on LU-14580 , and I wanted to bring it here:   "I don't see problems with patch itself. Increment in  osc_consume_write_grant()  was removed because it is done by  atomic_long_add_return()  now outside that call and it is done in both places where it is called. But maybe the patch " LU-12687  osc: consume grants for direct I/O" itself causes slowdown? Now grants are taken for Direct IO as well, so maybe that is related to not enough grants problem or similar. Are there any complains about grants on client during IOR run?" That patch definitely has performance implications.  Direct i/o will keep sending even when there are no grants - since it is already synchronous - but it significantly increases the load on the cl_loi_list_lock in some cases.  The patches I noted above are aimed at that. There's still very likely a memory layout issue here, but perhaps these will help...

            Ihara,

            Ignoring entirely the apparent memory layout issue, I have two suggestions that may reduce the pain on cl_loi_list_lock.

            This patch just landed to master:

            https://review.whamcloud.com/38214

            And this one which has not yet landed:
            https://review.whamcloud.com/#/c/39482/

            If you're able to give these a try, that would be very interesting.  Both should cut the contention on cl_loi_list_lock when doing dio.  I'm not sure precisely what effect they'll have here, but it seems worth trying.

            paf0186 Patrick Farrell added a comment - Ihara, Ignoring entirely the apparent memory layout issue, I have two suggestions that may reduce the pain on cl_loi_list_lock. This patch just landed to master: https://review.whamcloud.com/38214 And this one which has not yet landed: https://review.whamcloud.com/#/c/39482/ If you're able to give these a try, that would be very interesting.  Both should cut the contention on cl_loi_list_lock when doing dio.  I'm not sure precisely what effect they'll have here, but it seems worth trying.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41162/
            Subject: LU-14055 lmv: reduce struct lmv_obd size
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e11deeb1e6d114608eac4ee998d4cea22e30b0f5

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41162/ Subject: LU-14055 lmv: reduce struct lmv_obd size Project: fs/lustre-release Branch: master Current Patch Set: Commit: e11deeb1e6d114608eac4ee998d4cea22e30b0f5
            jhammond John Hammond added a comment -

            > Attached are two frame graphes on good (reverted whole LU-13344 patches) and bad (master) cases.

            How long did perf run for the two flame graphs? One has 5x the samples as the other. master-revert.svg shows 20% in swapper and 5% in open. Had ior completed startup when perf was run?

            jhammond John Hammond added a comment - > Attached are two frame graphes on good (reverted whole LU-13344 patches) and bad (master) cases. How long did perf run for the two flame graphs? One has 5x the samples as the other. master-revert.svg shows 20% in swapper and 5% in open. Had ior completed startup when perf was run?

            It looks like the contended lock in master.svg is cl_loi_list_lock in both osc_queue_sync_pages() and brw_interpret(). The cl_loi_list_lock is in struct client_obd. What is strange is that this field is before the added structure, so I can't see how that would be affected by the new field. I'll have to check if there is something later in obd_device that is being accessed under that lock.

            adilger Andreas Dilger added a comment - It looks like the contended lock in master.svg is cl_loi_list_lock in both osc_queue_sync_pages() and brw_interpret() . The cl_loi_list_lock is in struct client_obd . What is strange is that this field is before the added structure, so I can't see how that would be affected by the new field. I'll have to check if there is something later in obd_device that is being accessed under that lock.

            Attached are two frame graphes on good (reverted whole LU-13344 patches) and bad (master) cases.
            I see huge amount of spin_locks in osc_queue_sync_pages() under ll_direct_IO() in bad case. There are also several spin_lock impacts in the functions under brw_interpret().

            sihara Shuichi Ihara added a comment - Attached are two frame graphes on good (reverted whole LU-13344 patches) and bad (master) cases. I see huge amount of spin_locks in osc_queue_sync_pages() under ll_direct_IO() in bad case. There are also several spin_lock impacts in the functions under brw_interpret().

            Strange that the patch to reduce the structure to the previous size (-8 bytes) and the patch to move the new 8-byte field to the end didn't help performance and the patch to reduce the struct size -2000 bytes also didn't help. It is possible my "-8 byte" patch was too complex, and I moved too many fields around? I had tried to keep the related fields together.

            Do you have any idea where the performance is being lost? Is it high CPU usage or lock contention or something else?

            adilger Andreas Dilger added a comment - Strange that the patch to reduce the structure to the previous size (-8 bytes) and the patch to move the new 8-byte field to the end didn't help performance and the patch to reduce the struct size -2000 bytes also didn't help. It is possible my "-8 byte" patch was too complex, and I moved too many fields around? I had tried to keep the related fields together. Do you have any idea where the performance is being lost? Is it high CPU usage or lock contention or something else?

            People

              paf0186 Patrick Farrell
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: