[LU-14055] Write performance regression caused by an commit from LU-13344 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.15.0
Affects Version/s: Lustre 2.14.0
Labels:
None
Environment:
master (commit: 56526a90ae)

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

commit 76626d6c52 "LU-13344 all: Separate debugfs and procfs handling" caused write performance regression. Here is a reproducer and tested workload.

Single Client(Ubuntu 18.04, 5.4.0-47-generic), 16MB O_DIRECT, FPP (128 processes)

# mpirun --allow-run-as-root -np 128 --oversubscribe --mca btl_openib_warn_default_gid_prefix 0 --bind-to none ior -u -w -r
 -k -e -F -t 16384k -b 16384k -s 1000 -u -o /mnt/ai400x/ior.out/file --posix.odirect

"git bisect" indentified an commit where regression started.

Here is test results.
76626d6c52 ~~LU-13344~~ all: Separate debugfs and procfs handling

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     21861      1366.33    60.78       16384      16384      0.091573   93.68      40.38      93.68      0   
read      38547      2409.18    46.14       16384      16384      0.005706   53.13      8.26       53.13      0

5bc1fe092c ~~LU-13196~~ llite: Remove mutex on dio read

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     32678      2042.40    58.96       16384      16384      0.105843   62.67      4.98       62.67      0   
read      38588      2411.78    45.89       16384      16384      0.004074   53.07      8.11       53.07      0

master (commit 56526a90ae)

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     17046      1065.37    119.02      16384      16384      0.084449   120.15     67.76      120.15     0   
read      38512      2407.00    45.04       16384      16384      0.006462   53.18      9.07       53.18      0

master still has this regression and when commit 76626d6c52 reverts from master, the performrance is back.

master (commit 56526a90ae)+ revert commit 76626d6c52

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     32425      2026.59    59.88       16384      16384      0.095842   63.16      4.79       63.16      0   
read      39601      2475.09    47.22       16384      16384      0.003637   51.72      5.73       51.72      0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

master.svg
245 kB
12/Jan/21 3:45 AM
master-revert.svg
414 kB
12/Jan/21 3:45 AM

Issue Links

is related to

LU-14580 Lustre 2.12.6 performance regression

Open

LU-13601 page allocation failure during mount

Resolved

is related to

LU-8837 Cleanly separate server code from client code.

Open

LU-13344 Support for linux 5.6 clients

Resolved

Activity

[LU-14055] Write performance regression caused by an commit from LU-13344

Patrick Farrell added a comment - 14/Jun/21 9:00 PM - edited

tappro made this comment over on LU-14580, and I wanted to bring it here:

"I don't see problems with patch itself. Increment in osc_consume_write_grant() was removed because it is done by atomic_long_add_return() now outside that call and it is done in both places where it is called. But maybe the patch "~~LU-12687~~ osc: consume grants for direct I/O" itself causes slowdown? Now grants are taken for Direct IO as well, so maybe that is related to not enough grants problem or similar. Are there any complains about grants on client during IOR run?"

That patch definitely has performance implications. Direct i/o will keep sending even when there are no grants - since it is already synchronous - but it significantly increases the load on the cl_loi_list_lock in some cases. The patches I noted above are aimed at that.

There's still very likely a memory layout issue here, but perhaps these will help...

Patrick Farrell added a comment - 14/Jun/21 9:00 PM - edited tappro made this comment over on LU-14580 , and I wanted to bring it here: "I don't see problems with patch itself. Increment in osc_consume_write_grant() was removed because it is done by atomic_long_add_return() now outside that call and it is done in both places where it is called. But maybe the patch " LU-12687 osc: consume grants for direct I/O" itself causes slowdown? Now grants are taken for Direct IO as well, so maybe that is related to not enough grants problem or similar. Are there any complains about grants on client during IOR run?" That patch definitely has performance implications. Direct i/o will keep sending even when there are no grants - since it is already synchronous - but it significantly increases the load on the cl_loi_list_lock in some cases. The patches I noted above are aimed at that. There's still very likely a memory layout issue here, but perhaps these will help...

Patrick Farrell added a comment - 14/Jun/21 8:40 PM

Ihara,

Ignoring entirely the apparent memory layout issue, I have two suggestions that may reduce the pain on cl_loi_list_lock.

This patch just landed to master:

https://review.whamcloud.com/38214

And this one which has not yet landed:
https://review.whamcloud.com/#/c/39482/

If you're able to give these a try, that would be very interesting. Both should cut the contention on cl_loi_list_lock when doing dio. I'm not sure precisely what effect they'll have here, but it seems worth trying.

Patrick Farrell added a comment - 14/Jun/21 8:40 PM Ihara, Ignoring entirely the apparent memory layout issue, I have two suggestions that may reduce the pain on cl_loi_list_lock. This patch just landed to master: https://review.whamcloud.com/38214 And this one which has not yet landed: https://review.whamcloud.com/#/c/39482/ If you're able to give these a try, that would be very interesting. Both should cut the contention on cl_loi_list_lock when doing dio. I'm not sure precisely what effect they'll have here, but it seems worth trying.

Gerrit Updater added a comment - 10/Apr/21 5:39 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41162/
Subject: ~~LU-14055~~ lmv: reduce struct lmv_obd size
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e11deeb1e6d114608eac4ee998d4cea22e30b0f5

Gerrit Updater added a comment - 10/Apr/21 5:39 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41162/ Subject: LU-14055 lmv: reduce struct lmv_obd size Project: fs/lustre-release Branch: master Current Patch Set: Commit: e11deeb1e6d114608eac4ee998d4cea22e30b0f5

John Hammond added a comment - 14/Jan/21 3:32 PM

> Attached are two frame graphes on good (reverted whole ~~LU-13344~~ patches) and bad (master) cases.

How long did perf run for the two flame graphs? One has 5x the samples as the other. master-revert.svg shows 20% in swapper and 5% in open. Had ior completed startup when perf was run?

John Hammond added a comment - 14/Jan/21 3:32 PM > Attached are two frame graphes on good (reverted whole LU-13344 patches) and bad (master) cases. How long did perf run for the two flame graphs? One has 5x the samples as the other. master-revert.svg shows 20% in swapper and 5% in open. Had ior completed startup when perf was run?

Andreas Dilger added a comment - 12/Jan/21 7:17 AM

It looks like the contended lock in master.svg is cl_loi_list_lock in both osc_queue_sync_pages() and brw_interpret(). The cl_loi_list_lock is in struct client_obd. What is strange is that this field is before the added structure, so I can't see how that would be affected by the new field. I'll have to check if there is something later in obd_device that is being accessed under that lock.

Andreas Dilger added a comment - 12/Jan/21 7:17 AM It looks like the contended lock in master.svg is cl_loi_list_lock in both osc_queue_sync_pages() and brw_interpret() . The cl_loi_list_lock is in struct client_obd . What is strange is that this field is before the added structure, so I can't see how that would be affected by the new field. I'll have to check if there is something later in obd_device that is being accessed under that lock.

Shuichi Ihara added a comment - 12/Jan/21 3:46 AM

Attached are two frame graphes on good (reverted whole ~~LU-13344~~ patches) and bad (master) cases.
I see huge amount of spin_locks in osc_queue_sync_pages() under ll_direct_IO() in bad case. There are also several spin_lock impacts in the functions under brw_interpret().

Shuichi Ihara added a comment - 12/Jan/21 3:46 AM Attached are two frame graphes on good (reverted whole LU-13344 patches) and bad (master) cases. I see huge amount of spin_locks in osc_queue_sync_pages() under ll_direct_IO() in bad case. There are also several spin_lock impacts in the functions under brw_interpret().

Andreas Dilger added a comment - 10/Jan/21 5:57 AM

Strange that the patch to reduce the structure to the previous size (-8 bytes) and the patch to move the new 8-byte field to the end didn't help performance and the patch to reduce the struct size -2000 bytes also didn't help. It is possible my "-8 byte" patch was too complex, and I moved too many fields around? I had tried to keep the related fields together.

Do you have any idea where the performance is being lost? Is it high CPU usage or lock contention or something else?

Andreas Dilger added a comment - 10/Jan/21 5:57 AM Strange that the patch to reduce the structure to the previous size (-8 bytes) and the patch to move the new 8-byte field to the end didn't help performance and the patch to reduce the struct size -2000 bytes also didn't help. It is possible my "-8 byte" patch was too complex, and I moved too many fields around? I had tried to keep the related fields together. Do you have any idea where the performance is being lost? Is it high CPU usage or lock contention or something else?

Shuichi Ihara added a comment - 10/Jan/21 4:36 AM - edited

patch: https://review.whamcloud.com/41178 didn't help and it was still same slow down. let me try your "bisect" ideas to find if it shows good (or still bad) performance.

Shuichi Ihara added a comment - 10/Jan/21 4:36 AM - edited patch: https://review.whamcloud.com/41178 didn't help and it was still same slow down. let me try your "bisect" ideas to find if it shows good (or still bad) performance.

Andreas Dilger added a comment - 08/Jan/21 7:09 PM

Shuichi, if shrinking struct obd_device does not solve the problem, then it seems the problem is caused by a misalignment of some data structure that follows the added obd_debugfs_vars field.

Can you please try another set of tests that move the "obd_debugfs_vars" line until we isolate the problematic field. The first test would be to move obd_debugfs_vars to the end of the struct:

 /* --- cacheline 101 boundary (6464 bytes) --- */
-       struct ldebugfs_vars *     obd_debugfs_vars;     /*  6464     8 */
        atomic_t                   obd_evict_inprogress; /*  6472     4 */
        wait_queue_head_t          obd_evict_inprogress_waitq; /*  6480    24 */
        struct list_head           obd_evict_list;       /*  6504    16 */
        rwlock_t                   obd_pool_lock;        /*  6520     8 */
        /* --- cacheline 102 boundary (6528 bytes) --- */                
        __u64                      obd_pool_slv;         /*  6528     8 */
        int                        obd_pool_limit;       /*  6536     4 */
        int                        obd_conn_inprogress;  /*  6540     4 */
        struct lu_ref              obd_reference;        /*  6544     0 */
        struct kset                obd_kset;             /*  6544   160 */
        /* --- cacheline 104 boundary (6656 bytes) was 48 bytes ago --- */
        struct kobj_type           obd_ktype;            /*  6704    72 */
        /* --- cacheline 105 boundary (6720 bytes) was 56 bytes ago --- */
        struct completion          obd_kobj_unregister;  /*  6776    32 */
+       struct ldebugfs_vars *     obd_debugfs_vars;     /*  6464     8 */

to see if this solves the problem (without my other patches). I've pushed a patch to do this. If it fixes the problem, then this confirms that the problem is caused by the alignment or cacheline contention on of one of the fields between obd_evict_inprogress and obd_kobj_unregister. This would be enough to land for 2.14.0 to solve the problem, but I don't want to leave the reason for the problem unsolved, since it is likely to be accidentally returned again in the future (e.g. by landing my patches to shrink lu_tgt_desc or anything else).

To isolate the reason for the problem you would need to "bisect" the 11 fields/366 bytes to see which one is causing the slowdown.

First try moving obd_debugfs_vars after obd_kset to see if this causes the slowdown again. If not, then the problem is obd_kset or earlier, so try moving it immediately before obd_kset (this is the largest field so makes it difficult to "bisect" exactly). If the problem is still not seen, move it after obd_evict_list, etc. Essentially, when obd_debugfs_vars is immediately before the offending struct the performance will be bad, and when it is immediately after the struct then the performance problem should go away. Once you find out what the structure is, try moving that field to be at the start of struct obd_device so that there is no chance of it being misaligned, after obd_lu_dev and after obd_recovery_expired. If these also show good performance, then this can be a permanent solution (I would prefer after obd_recovery_expired since these bitfields are very commonly used).

Please run the "pahole" command on the obdclass.ko module to show the "good" and "bad" structures to see what the problem is, and attach the results here.

Neil, James, since the obd_kobj and obd_ktype fields are recent additions and the largest fields in this area, it seems likely that they are the culprit here. Is there anything "special" about them that would require their alignment, or to avoid cacheline contention? Are they "hot" and referenced/refcounted continuously during object access?

Andreas Dilger added a comment - 08/Jan/21 7:09 PM Shuichi, if shrinking struct obd_device does not solve the problem, then it seems the problem is caused by a misalignment of some data structure that follows the added obd_debugfs_vars field. Can you please try another set of tests that move the " obd_debugfs_vars " line until we isolate the problematic field. The first test would be to move obd_debugfs_vars to the end of the struct: /* --- cacheline 101 boundary (6464 bytes) --- */ - struct ldebugfs_vars * obd_debugfs_vars; /* 6464 8 */ atomic_t obd_evict_inprogress; /* 6472 4 */ wait_queue_head_t obd_evict_inprogress_waitq; /* 6480 24 */ struct list_head obd_evict_list; /* 6504 16 */ rwlock_t obd_pool_lock; /* 6520 8 */ /* --- cacheline 102 boundary (6528 bytes) --- */ __u64 obd_pool_slv; /* 6528 8 */ int obd_pool_limit; /* 6536 4 */ int obd_conn_inprogress; /* 6540 4 */ struct lu_ref obd_reference; /* 6544 0 */ struct kset obd_kset; /* 6544 160 */ /* --- cacheline 104 boundary (6656 bytes) was 48 bytes ago --- */ struct kobj_type obd_ktype; /* 6704 72 */ /* --- cacheline 105 boundary (6720 bytes) was 56 bytes ago --- */ struct completion obd_kobj_unregister; /* 6776 32 */ + struct ldebugfs_vars * obd_debugfs_vars; /* 6464 8 */ to see if this solves the problem (without my other patches). I've pushed a patch to do this. If it fixes the problem, then this confirms that the problem is caused by the alignment or cacheline contention on of one of the fields between obd_evict_inprogress and obd_kobj_unregister . This would be enough to land for 2.14.0 to solve the problem, but I don't want to leave the reason for the problem unsolved, since it is likely to be accidentally returned again in the future (e.g. by landing my patches to shrink lu_tgt_desc or anything else). To isolate the reason for the problem you would need to "bisect" the 11 fields/366 bytes to see which one is causing the slowdown. First try moving obd_debugfs_vars after obd_kset to see if this causes the slowdown again. If not, then the problem is obd_kset or earlier, so try moving it immediately before obd_kset (this is the largest field so makes it difficult to "bisect" exactly). If the problem is still not seen, move it after obd_evict_list , etc. Essentially, when obd_debugfs_vars is immediately before the offending struct the performance will be bad, and when it is immediately after the struct then the performance problem should go away. Once you find out what the structure is, try moving that field to be at the start of struct obd_device so that there is no chance of it being misaligned, after obd_lu_dev and after obd_recovery_expired . If these also show good performance, then this can be a permanent solution (I would prefer after obd_recovery_expired since these bitfields are very commonly used). Please run the " pahole " command on the obdclass.ko module to show the "good" and "bad" structures to see what the problem is, and attach the results here. Neil, James, since the obd_kobj and obd_ktype fields are recent additions and the largest fields in this area, it seems likely that they are the culprit here. Is there anything "special" about them that would require their alignment, or to avoid cacheline contention? Are they "hot" and referenced/refcounted continuously during object access?

Gerrit Updater added a comment - 08/Jan/21 6:38 PM

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41178
Subject: ~~LU-14055~~ obdclass: move obd_debugfs_vars to end obd_device
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dd0a0df383387e6455bbad565503883433516454

Gerrit Updater added a comment - 08/Jan/21 6:38 PM Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41178 Subject: LU-14055 obdclass: move obd_debugfs_vars to end obd_device Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: dd0a0df383387e6455bbad565503883433516454

Shuichi Ihara added a comment - 08/Jan/21 5:33 AM

Andreas,
I just tested two patches, but the performrance was even worse below.

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     14218      888.67     161.01      16384      16384      0.159725   172.84     27.90      172.85     0   
write     14168      885.51     159.56      16384      16384      0.218456   173.46     23.85      173.46     0   
write     14093      880.82     161.54      16384      16384      0.191401   174.38     25.50      174.38     0

Shuichi Ihara added a comment - 08/Jan/21 5:33 AM Andreas, I just tested two patches, but the performrance was even worse below. access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 14218 888.67 161.01 16384 16384 0.159725 172.84 27.90 172.85 0 write 14168 885.51 159.56 16384 16384 0.218456 173.46 23.85 173.46 0 write 14093 880.82 161.54 16384 16384 0.191401 174.38 25.50 174.38 0

People

Assignee:: Patrick Farrell

Reporter:: Shuichi Ihara

Votes:: 0 Vote for this issue

Watchers:: 20 Start watching this issue

Dates

Created:: 21/Oct/20 4:12 AM

Updated:: 05/Jan/22 7:51 PM

Resolved:: 05/Jan/22 7:51 PM