Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
It looks like there is a performance slowdown in sanity test_123ac on master only, not on other branches. Before 2023-11-29 this test took about 500s to complete, and after 2023-11-30 it is taking about 900s to complete.
https://testing.whamcloud.com/reports?test_set_script_id=f9516376-32bc-11e0-aaee-52540025f9ae&sub_test_script_id=5b5f8f4c-154f-11ea-b934-52540065bddc&source=subtest_trend#redirect
It appears that sanity test_135 is also showing a similar slowdown at the same time, so it isn't specific to this subtest, but rather a patch or environment change:
https://testing.whamcloud.com/reports?test_set_script_id=f9516376-32bc-11e0-aaee-52540025f9ae&sub_test_script_id=10505ba6-1cdb-11ea-971c-52540065bddc&source=subtest_trend#redirect
As well as parallel-scale test_mdtestssf:
https://testing.whamcloud.com/reports?test_set_script_id=b10ed7ea-55b4-11e0-bb3d-52540025f9af&sub_test_script_id=af586da2-fab5-11e0-bbc0-52540025f9af&source=subtest_trend#redirect
and performance-sanity test_1:
https://testing.whamcloud.com/reports?test_set_script_id=a7db9478-5989-11e0-a272-52540025f9af&sub_test_script_id=5de0968e-61e5-11e0-a2b4-52540025f9af&source=subtest_trend#redirect
This looks like it is because of a significant slowdown during file creation. The older sanity test_123ac shows 90000 creates in ~90s, and the newer tests show 90000 creates in 400s:
BEFORE 2023-11-29:
- open/close 10000 (time 1701261460.68 total 8.79 last 1137.72) - open/close 20000 (time 1701261470.62 total 18.73 last 1006.14) - open/close 30000 (time 1701261479.73 total 27.84 last 1097.80) - open/close 40000 (time 1701261489.31 total 37.42 last 1043.21) - open/close 50000 (time 1701261498.23 total 46.35 last 1120.73) - open/close 60000 (time 1701261507.50 total 55.62 last 1078.72) - open/close 70000 (time 1701261516.75 total 64.87 last 1081.18) - open/close 80000 (time 1701261526.16 total 74.27 last 1063.71) total: 90000 open/close in 83.82 seconds: 1073.69 ops/second
AFTER 2023-11-30:
- open/close 8266 (time 1701355946.26 total 10.00 last 826.49) - open/close 10000 (time 1701355948.92 total 12.66 last 651.12) - open/close 15838 (time 1701355958.92 total 22.66 last 583.80) - open/close 20000 (time 1701355967.19 total 30.93 last 503.66) : - open/close 86594 (time 1701356305.29 total 369.03 last 102.87) - open/close 87611 (time 1701356315.30 total 379.04 last 101.66) - open/close 88604 (time 1701356325.30 total 389.04 last 99.27) - open/close 89540 (time 1701356335.30 total 399.04 last 93.57) total: 90000 open/close in 403.81 seconds: 222.88 ops/second
The patches landed on 2023-11-29 are:
de352465eb LU-17046 tests: fix write success in 1g 0ef4e5b0c1 LU-16518 lnet: fix uninitialized variable in api-ni.c 350dfbcfa8 LU-17293 kernel: update SLES15 SP5 [5.14.21-150500.55.36.1] 57217b7e4e LU-17280 scrub: skip dir stripes with OI 698498b563 LU-17275 kernel: RHEL 8.9 client and server support 9eb87e7ef3 LU-17274 kernel: new kernel [RHEL 9.3 5.14.0-362.8.1.el9_3] f3b45a0547 LU-17278 ldlm: don't grant failed lock c5aa16db17 LU-17265 tests: allow margin for sanity/39r 6897dbe67c LU-17230 socklnd: treat UNKNOWN netif operstate as UP e383791b1c LU-17216 ofd: make enable_health_write tunable 3fcddf6dcd LU-17212 gss: survive improper obd or imp at ctx init a66daa9c1b LU-17097 osc: delete items in Xarray before its destroy 78605b74a5 LU-8191 lustre: convert lmv,lod,lov functions to static 17c1cdf40e LU-10391 socklnd: handle IPv6 for zero copy messages 8276ade19d LU-10391 mgs: copy full nid string 6b514c0bd0 LU-16827 obdfilter: Fix "emfperf obdfilter-survey" error 286809eeb6 LU-17277 build: Distribute lutf.sh unconditionally 774c3d2883 LU-10391 lnet: missing some peer functionality c0973b9fd6 LU-10729 tests: replay-dual/22d to wait ce404bd07c LU-17174 misc: fix hash functions
It isn't obvious what patch might have affected this test performance. The most likely change would be LU-17174, which is changing the core LDLM hash functions. I will be pushing a patch one before that and one after it to see if this is the case.
Attachments
Issue Links
- is related to
-
LU-17174 lustre hashes broken now.
- Resolved