Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17680

performance regression in sanity test_123ac

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      It looks like there is a performance slowdown in sanity test_123ac on master only, not on other branches. Before 2023-11-29 this test took about 500s to complete, and after 2023-11-30 it is taking about 900s to complete.
      https://testing.whamcloud.com/reports?test_set_script_id=f9516376-32bc-11e0-aaee-52540025f9ae&sub_test_script_id=5b5f8f4c-154f-11ea-b934-52540065bddc&source=subtest_trend#redirect

      It appears that sanity test_135 is also showing a similar slowdown at the same time, so it isn't specific to this subtest, but rather a patch or environment change:
      https://testing.whamcloud.com/reports?test_set_script_id=f9516376-32bc-11e0-aaee-52540025f9ae&sub_test_script_id=10505ba6-1cdb-11ea-971c-52540065bddc&source=subtest_trend#redirect

      As well as parallel-scale test_mdtestssf:
      https://testing.whamcloud.com/reports?test_set_script_id=b10ed7ea-55b4-11e0-bb3d-52540025f9af&sub_test_script_id=af586da2-fab5-11e0-bbc0-52540025f9af&source=subtest_trend#redirect

      and performance-sanity test_1:
      https://testing.whamcloud.com/reports?test_set_script_id=a7db9478-5989-11e0-a272-52540025f9af&sub_test_script_id=5de0968e-61e5-11e0-a2b4-52540025f9af&source=subtest_trend#redirect

      This looks like it is because of a significant slowdown during file creation. The older sanity test_123ac shows 90000 creates in ~90s, and the newer tests show 90000 creates in 400s:
      BEFORE 2023-11-29:

       - open/close 10000 (time 1701261460.68 total 8.79 last 1137.72)
       - open/close 20000 (time 1701261470.62 total 18.73 last 1006.14)
       - open/close 30000 (time 1701261479.73 total 27.84 last 1097.80)
       - open/close 40000 (time 1701261489.31 total 37.42 last 1043.21)
       - open/close 50000 (time 1701261498.23 total 46.35 last 1120.73)
       - open/close 60000 (time 1701261507.50 total 55.62 last 1078.72)
       - open/close 70000 (time 1701261516.75 total 64.87 last 1081.18)
       - open/close 80000 (time 1701261526.16 total 74.27 last 1063.71)
      total: 90000 open/close in 83.82 seconds: 1073.69 ops/second
      

      AFTER 2023-11-30:

       - open/close 8266 (time 1701355946.26 total 10.00 last 826.49)
       - open/close 10000 (time 1701355948.92 total 12.66 last 651.12)
       - open/close 15838 (time 1701355958.92 total 22.66 last 583.80)
       - open/close 20000 (time 1701355967.19 total 30.93 last 503.66)
      :
       - open/close 86594 (time 1701356305.29 total 369.03 last 102.87)
       - open/close 87611 (time 1701356315.30 total 379.04 last 101.66)
       - open/close 88604 (time 1701356325.30 total 389.04 last 99.27)
       - open/close 89540 (time 1701356335.30 total 399.04 last 93.57)
      total: 90000 open/close in 403.81 seconds: 222.88 ops/second
      

      The patches landed on 2023-11-29 are:

      de352465eb LU-17046 tests: fix write success in 1g
      0ef4e5b0c1 LU-16518 lnet: fix uninitialized variable in api-ni.c
      350dfbcfa8 LU-17293 kernel: update SLES15 SP5 [5.14.21-150500.55.36.1]
      57217b7e4e LU-17280 scrub: skip dir stripes with OI
      698498b563 LU-17275 kernel: RHEL 8.9 client and server support
      9eb87e7ef3 LU-17274 kernel: new kernel [RHEL 9.3 5.14.0-362.8.1.el9_3]
      f3b45a0547 LU-17278 ldlm: don't grant failed lock
      c5aa16db17 LU-17265 tests: allow margin for sanity/39r
      6897dbe67c LU-17230 socklnd: treat UNKNOWN netif operstate as UP
      e383791b1c LU-17216 ofd: make enable_health_write tunable
      3fcddf6dcd LU-17212 gss: survive improper obd or imp at ctx init
      a66daa9c1b LU-17097 osc: delete items in Xarray before its destroy
      78605b74a5 LU-8191 lustre: convert lmv,lod,lov functions to static
      17c1cdf40e LU-10391 socklnd: handle IPv6 for zero copy messages
      8276ade19d LU-10391 mgs: copy full nid string
      6b514c0bd0 LU-16827 obdfilter: Fix "emfperf obdfilter-survey" error
      286809eeb6 LU-17277 build: Distribute lutf.sh unconditionally
      774c3d2883 LU-10391 lnet: missing some peer functionality
      c0973b9fd6 LU-10729 tests: replay-dual/22d to wait
      ce404bd07c LU-17174 misc: fix hash functions
      

      It isn't obvious what patch might have affected this test performance. The most likely change would be LU-17174, which is changing the core LDLM hash functions. I will be pushing a patch one before that and one after it to see if this is the case.

      Attachments

        Issue Links

          Activity

            People

              adilger Andreas Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: