Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.14.0
-
None
-
master (commit: 56526a90ae)
-
3
-
9223372036854775807
Description
commit 76626d6c52 "LU-13344 all: Separate debugfs and procfs handling" caused write performance regression. Here is a reproducer and tested workload.
Single Client(Ubuntu 18.04, 5.4.0-47-generic), 16MB O_DIRECT, FPP (128 processes)
# mpirun --allow-run-as-root -np 128 --oversubscribe --mca btl_openib_warn_default_gid_prefix 0 --bind-to none ior -u -w -r -k -e -F -t 16384k -b 16384k -s 1000 -u -o /mnt/ai400x/ior.out/file --posix.odirect
"git bisect" indentified an commit where regression started.
Here is test results.
76626d6c52 LU-13344 all: Separate debugfs and procfs handling
access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 21861 1366.33 60.78 16384 16384 0.091573 93.68 40.38 93.68 0 read 38547 2409.18 46.14 16384 16384 0.005706 53.13 8.26 53.13 0
5bc1fe092c LU-13196 llite: Remove mutex on dio read
access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 32678 2042.40 58.96 16384 16384 0.105843 62.67 4.98 62.67 0 read 38588 2411.78 45.89 16384 16384 0.004074 53.07 8.11 53.07 0
master (commit 56526a90ae)
access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 17046 1065.37 119.02 16384 16384 0.084449 120.15 67.76 120.15 0 read 38512 2407.00 45.04 16384 16384 0.006462 53.18 9.07 53.18 0
master still has this regression and when commit 76626d6c52 reverts from master, the performrance is back.
master (commit 56526a90ae)+ revert commit 76626d6c52
access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 32425 2026.59 59.88 16384 16384 0.095842 63.16 4.79 63.16 0 read 39601 2475.09 47.22 16384 16384 0.003637 51.72 5.73 51.72 0
Shuichi, if shrinking struct obd_device does not solve the problem, then it seems the problem is caused by a misalignment of some data structure that follows the added obd_debugfs_vars field.
Can you please try another set of tests that move the "obd_debugfs_vars" line until we isolate the problematic field. The first test would be to move obd_debugfs_vars to the end of the struct:
to see if this solves the problem (without my other patches). I've pushed a patch to do this. If it fixes the problem, then this confirms that the problem is caused by the alignment or cacheline contention on of one of the fields between obd_evict_inprogress and obd_kobj_unregister. This would be enough to land for 2.14.0 to solve the problem, but I don't want to leave the reason for the problem unsolved, since it is likely to be accidentally returned again in the future (e.g. by landing my patches to shrink lu_tgt_desc or anything else).
To isolate the reason for the problem you would need to "bisect" the 11 fields/366 bytes to see which one is causing the slowdown.
First try moving obd_debugfs_vars after obd_kset to see if this causes the slowdown again. If not, then the problem is obd_kset or earlier, so try moving it immediately before obd_kset (this is the largest field so makes it difficult to "bisect" exactly). If the problem is still not seen, move it after obd_evict_list, etc. Essentially, when obd_debugfs_vars is immediately before the offending struct the performance will be bad, and when it is immediately after the struct then the performance problem should go away. Once you find out what the structure is, try moving that field to be at the start of struct obd_device so that there is no chance of it being misaligned, after obd_lu_dev and after obd_recovery_expired. If these also show good performance, then this can be a permanent solution (I would prefer after obd_recovery_expired since these bitfields are very commonly used).
Please run the "pahole" command on the obdclass.ko module to show the "good" and "bad" structures to see what the problem is, and attach the results here.
Neil, James, since the obd_kobj and obd_ktype fields are recent additions and the largest fields in this area, it seems likely that they are the culprit here. Is there anything "special" about them that would require their alignment, or to avoid cacheline contention? Are they "hot" and referenced/refcounted continuously during object access?