[LU-14055] Write performance regression caused by an commit from LU-13344 Created: 21/Oct/20 Updated: 05/Jan/22 Resolved: 05/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Shuichi Ihara | Assignee: | Patrick Farrell |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
master (commit: 56526a90ae) |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
commit 76626d6c52 " Single Client(Ubuntu 18.04, 5.4.0-47-generic), 16MB O_DIRECT, FPP (128 processes) # mpirun --allow-run-as-root -np 128 --oversubscribe --mca btl_openib_warn_default_gid_prefix 0 --bind-to none ior -u -w -r -k -e -F -t 16384k -b 16384k -s 1000 -u -o /mnt/ai400x/ior.out/file --posix.odirect "git bisect" indentified an commit where regression started. Here is test results. access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 21861 1366.33 60.78 16384 16384 0.091573 93.68 40.38 93.68 0 read 38547 2409.18 46.14 16384 16384 0.005706 53.13 8.26 53.13 0 5bc1fe092c access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 32678 2042.40 58.96 16384 16384 0.105843 62.67 4.98 62.67 0 read 38588 2411.78 45.89 16384 16384 0.004074 53.07 8.11 53.07 0 master (commit 56526a90ae) access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 17046 1065.37 119.02 16384 16384 0.084449 120.15 67.76 120.15 0 read 38512 2407.00 45.04 16384 16384 0.006462 53.18 9.07 53.18 0 master still has this regression and when commit 76626d6c52 reverts from master, the performrance is back. master (commit 56526a90ae)+ revert commit 76626d6c52 access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 32425 2026.59 59.88 16384 16384 0.095842 63.16 4.79 63.16 0 read 39601 2475.09 47.22 16384 16384 0.003637 51.72 5.73 51.72 0 |
| Comments |
| Comment by Andreas Dilger [ 21/Oct/20 ] | ||||||||||||
|
I looked through that patch and didn't see anything obvious that would introduce a performance regression, since none of those proc/seq files should be anywhere close to the IO path. I think the likely source of the problem is that the change from /proc to /sys/kernel/debug has resulted in some tunable parameter to be lost completely (random example readcache_max_io_mb), or is somehow "disconnected" from the actual parameter (e.g. the parameter can be set, but it doesn't actually change the behavior of the kernel). There weren't any parameters that I could easily see, beyond the readcache_max_io_mb and related osd-ldiskfs parameters that could be the culprit. | ||||||||||||
| Comment by Shuichi Ihara [ 22/Oct/20 ] | ||||||||||||
|
it might be related to what cpu is quite busy rather than IO? A test case was 128 IOR process against 128 CPU cores and client's cpu was quite busy and I saw some ior threads were much slower than others in bad case. Let me dig in this detail. | ||||||||||||
| Comment by Andreas Dilger [ 22/Oct/20 ] | ||||||||||||
|
I don't think any of the code changed by the patch would actually be used during operation. Are there any of tunable parameters set on the client or server that were modified by this patch? I think the most likely cause is that one of the modified parameter settings is not working properly. It might be possible to revert part (ideally half) of the parameter changes to see if that avoids or reintroduces the problem, to isolate it down to a specific part of the patch? | ||||||||||||
| Comment by Shuichi Ihara [ 22/Oct/20 ] | ||||||||||||
|
I only reverted patch on client and performance was back. obdfilter.*.brw_size: 16 osd-ldiskfs.*.writethrough_cache_enable: 0 osd-ldiskfs.*.read_cache_enable: 0 | ||||||||||||
| Comment by Peter Jones [ 20/Nov/20 ] | ||||||||||||
|
Any thoughts on this one? Should we just revert this patch while we are working out the details? | ||||||||||||
| Comment by Shaun Tancheff [ 24/Nov/20 ] | ||||||||||||
|
I looked through the patch again and I do not see anything obvious. | ||||||||||||
| Comment by James A Simmons [ 24/Nov/20 ] | ||||||||||||
|
I can't reproduce this problem. Is it only a Ubunut issue? Can you reproduce it Shaun? Does it break mpirun or mdtest / IOR. Lustre is only used by mdtest to call llapi_dir_set_default_lmv_stripe() for the -g option. What setup do you have ? Everything on one node or all VMs? | ||||||||||||
| Comment by Shaun Tancheff [ 24/Nov/20 ] | ||||||||||||
|
No I don't have a reproducer here. My ubuntu testing is just a couple of VMs. | ||||||||||||
| Comment by Andreas Dilger [ 25/Nov/20 ] | ||||||||||||
|
Shaun, would you be able to post a handful of different patches that revert different parts of patch https://review.whamcloud.com/37834 to see if Ihara can isolate which part is slowing things down? | ||||||||||||
| Comment by Shaun Tancheff [ 25/Nov/20 ] | ||||||||||||
|
Sounds good. I will post a series of reverts. | ||||||||||||
| Comment by Gerrit Updater [ 02/Dec/20 ] | ||||||||||||
|
Shaun Tancheff (shaun.tancheff@hpe.com) uploaded a new patch: https://review.whamcloud.com/40833 | ||||||||||||
| Comment by Gerrit Updater [ 02/Dec/20 ] | ||||||||||||
|
Shaun Tancheff (shaun.tancheff@hpe.com) uploaded a new patch: https://review.whamcloud.com/40835 | ||||||||||||
| Comment by Gerrit Updater [ 02/Dec/20 ] | ||||||||||||
|
Shaun Tancheff (shaun.tancheff@hpe.com) uploaded a new patch: https://review.whamcloud.com/40836 | ||||||||||||
| Comment by Gerrit Updater [ 02/Dec/20 ] | ||||||||||||
|
Shaun Tancheff (shaun.tancheff@hpe.com) uploaded a new patch: https://review.whamcloud.com/40837 | ||||||||||||
| Comment by Gerrit Updater [ 02/Dec/20 ] | ||||||||||||
|
Shaun Tancheff (shaun.tancheff@hpe.com) uploaded a new patch: https://review.whamcloud.com/40838 | ||||||||||||
| Comment by Gerrit Updater [ 02/Dec/20 ] | ||||||||||||
|
Shaun Tancheff (shaun.tancheff@hpe.com) uploaded a new patch: https://review.whamcloud.com/40839 | ||||||||||||
| Comment by Gerrit Updater [ 02/Dec/20 ] | ||||||||||||
|
Shaun Tancheff (shaun.tancheff@hpe.com) uploaded a new patch: https://review.whamcloud.com/40840 | ||||||||||||
| Comment by Gerrit Updater [ 02/Dec/20 ] | ||||||||||||
|
Shaun Tancheff (shaun.tancheff@hpe.com) uploaded a new patch: https://review.whamcloud.com/40841 | ||||||||||||
| Comment by Andreas Dilger [ 02/Dec/20 ] | ||||||||||||
|
Shuichi, could you please give these patches a try to help isolate the source of the regression. | ||||||||||||
| Comment by James A Simmons [ 18/Dec/20 ] | ||||||||||||
|
Shuichi I pushed a fixed patch set that Shaun started. Please give it a try. Some of the RHEL8.2 builds failed due to an unrelated bug. Those patches do build as you can see in later patches building properly. | ||||||||||||
| Comment by Shuichi Ihara [ 18/Dec/20 ] | ||||||||||||
|
Thank you James! I will test new patches on my test node and let you know. | ||||||||||||
| Comment by Shuichi Ihara [ 25/Dec/20 ] | ||||||||||||
|
Here is test results after reverted patch. ("Write" performance only measured) So, it still needed to revert all patches to get the performance back. 7 of 7 patch 29,406 MiB/s 6 of 7 patch 18,791 MiB/s 5 of 7 patch 18,924 MiB/s 4 of 7 patch 18,715 MiB/s I didn't test 3,2,1 of 7 patch since regression already started when 6 of 7 patch reverted. should we split a bit 7 of 7 patch to narrow down? | ||||||||||||
| Comment by Andreas Dilger [ 25/Dec/20 ] | ||||||||||||
|
Shuichi, just to confirm, the "good" number beside "7 of 7" is with all 7 of the patches applied (ie. the entire original patch reverted), and if the "7 of 7" patch is removed then the performance is bad? What is strange is that the "7 of 7" patch is not really doing much except renaming a few structs and adding one variable near the end of struct obd_device. I had previously wondered if the increase in struct obd_device size was causing something bad with cache lines on SMP systems, or the struct became too large for some reason (it is just over 4KB now)? You could test this by reverting all of the patches, the only adding a "void *dummy;" pointer where "struct ldebugfs_vars *obd_debugfs_vars" is located in the "7 of 7" patch in https://review.whamcloud.com/#/c/40841/2/lustre/include/obd.h Other than that, I can't see anything in the "7 of 7" patch that would cause any problems. The /proc files changed are for "fid" and "fld" services, which I don't think are being tuned at all, so it shouldn't matter even if the tunables were broken. | ||||||||||||
| Comment by Shuichi Ihara [ 26/Dec/20 ] | ||||||||||||
That's correct. Once patch "7 of 7" applied (reverted entire
okey, I will test this idea. | ||||||||||||
| Comment by Andreas Dilger [ 26/Dec/20 ] | ||||||||||||
|
I noticed that even before the problematic patch is applied, there are already some places in the code using "ldebugfs" that are affected by the changes in the patch. Most places are fid, fld, seq, which I don't think can be harmful. There is a bit of usage in ldlm, but it doesn't look harmful either. One important parameter that may be affected is llite.*.max_cached_mb". Shuichi, the other thing you could try with and without the patch applied is "lctl get_param -R '*'" to see if the parameters available changes between the two versions? | ||||||||||||
| Comment by John Hammond [ 04/Jan/21 ] | ||||||||||||
|
For the test runs, was there a default striping set on the filesystem? | ||||||||||||
| Comment by Shuichi Ihara [ 06/Jan/21 ] | ||||||||||||
yes, stripe count was 1 for all tests. | ||||||||||||
| Comment by Shuichi Ihara [ 07/Jan/21 ] | ||||||||||||
Andreas, here is test resutls of your suggestion. And it seems that results tells us a root cause that you exaclty pointed out. it was tested 3 times to verify.
| ||||||||||||
| Comment by Neil Brown [ 07/Jan/21 ] | ||||||||||||
|
If it isn't too much trouble (and if you didn't already) can you try adding that field at the end of the structure, rather than where it currently is? If that makes a difference, then my guess is that there is a hot cache-line in that struct which is being split. If it doesn't make a difference, then my guess is that the struct size crosses a boundary resulting in obd_device_cache being shared with a different other cache, which maybe doesn't something strange. I doubt that though. | ||||||||||||
| Comment by Shuichi Ihara [ 07/Jan/21 ] | ||||||||||||
|
I should have to inform that adding a "void *dummy;" top of "7 of 7" revert patch also caused same performance problem, then tested adding "struct ldebugfs_vars *obd_debugfs_vars" after of that. | ||||||||||||
| Comment by Andreas Dilger [ 07/Jan/21 ] | ||||||||||||
|
Very strange that adding an 8-byte pointer to the already-large struct obd_device causes such a huge performance problem. According to pahole:
/* size: 6808, cachelines: 107, members: 98 */
/* sum members: 6792, holes: 2, sum holes: 8 */
/* sum bitfield members: 20 bits, bit holes: 1, sum bit holes: 44 bits */
/* paddings: 2, sum paddings: 8 */
/* forced alignments: 1 */
/* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));
Maybe it pushes the struct from 106 cachelines to 107, but I wouldn't think that would make a huge difference in performance. We could pack the two 4-byte holes in the struct to get back 8 bytes to solve this problem quickly before 2.14. That means the next patch that touches this struct is going to cause problems again, but would also give us some breathing room to resolve this issue more completely in 2.15. I will make a patch for this. pahole shows a number of other major offenders in the struct that could be cleaned up:
struct rhashtable obd_uuid_hash; /* 200 176 */
struct rhltable obd_nid_hash; /* 376 176 */
struct obd_llog_group obd_olg; /* 936 176 */
struct hrtimer obd_recovery_timer __attribute__((__aligned__
(8))); /* 1272 80 */
union {
struct obd_device_target obt; /* 1520 96 */
struct filter_obd filter; /* 1520 96 */
struct ost_obd ost; /* 1520 72 */
struct echo_obd echo; /* 1520 336 */
struct client_obd cli; /* 1520 2664 */
struct echo_client_obd echo_client; /* 1520 56 */
struct lov_obd lov; /* 1520 680 */
struct lmv_obd lmv; /* 1520 4872 */
} u; /* 1520 4872 */
There seems to have been a lot of cruft added to obd_device over the years that is server-specific and could be moved into struct obd_device_target instead of being kept in the common struct:
The worst offender is the device-specific union u, with client_obd and lmv_obd being the largest members (though obd_device_target may increase in size in the future). One option is to dynamically allocate this member depending on the type used, since there is typically only one lmv_obd on a client, though client_obd is used for most of the devices on the client so will not help much. The (almost only) offender in lmv_obd is struct lu_tgt_descs lmv_mdt_descs, which is a static array for all of the potential MDT devices the LMV may have (struct lu_tgt_desc_idx *ltd_tgt_idx[TGT_PTRS] being the major contributor). This is really unnecessary, and could be dynamically allocated for the maximum current MDT count. I have included that into patch https://review.whamcloud.com/40901 " The worst offender in client_obd are the seven obd_histogram fields, consuming 1848 of 2664 bytes, with 34 bytes of holes. I had a patch to dynamically allocate these structures on an as-needed basis, but it was complex and never landed. Maybe I need to revive that patch again. | ||||||||||||
| Comment by Andreas Dilger [ 07/Jan/21 ] | ||||||||||||
|
As for fields that might be affected by the addition of obd_debugfs_vars there are only a few that are after this new field, and of those only a subset are used on the client:
/* --- cacheline 101 boundary (6464 bytes) --- */
struct ldebugfs_vars * obd_debugfs_vars; /* 6464 8 */
atomic_t obd_evict_inprogress; /* 6472 4 */
/* XXX 4 bytes hole, try to pack */
wait_queue_head_t obd_evict_inprogress_waitq; /* 6480 24 */
struct list_head obd_evict_list; /* 6504 16 */
rwlock_t obd_pool_lock; /* 6520 8 */
/* --- cacheline 102 boundary (6528 bytes) --- */
__u64 obd_pool_slv; /* 6528 8 */
int obd_pool_limit; /* 6536 4 */
int obd_conn_inprogress; /* 6540 4 */
struct lu_ref obd_reference; /* 6544 0 */
struct kset obd_kset; /* 6544 160 */
/* --- cacheline 104 boundary (6656 bytes) was 48 bytes ago --- */
struct kobj_type obd_ktype; /* 6704 72 */
/* --- cacheline 105 boundary (6720 bytes) was 56 bytes ago --- */
struct completion obd_kobj_unregister; /* 6776 32 */
Likely candidates might be obd_pool_slv being split from obd_pool_lock, or something in the recently-added obd_kset or obd_ktype fields. I'm hoping that shrinking obd_device will resolve the problem, since random field alignment issues causing such huge performance swings is a nightmare. | ||||||||||||
| Comment by Gerrit Updater [ 07/Jan/21 ] | ||||||||||||
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41161 | ||||||||||||
| Comment by Gerrit Updater [ 07/Jan/21 ] | ||||||||||||
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41162 | ||||||||||||
| Comment by Andreas Dilger [ 07/Jan/21 ] | ||||||||||||
|
Shuichi, could you please try with these two patches. The first one just decreases the struct size by 4 bytes, but I'm not sure if that will be enough. The second reduces it by over 2KB, which hopefully is enough. Otherwise, it may be that there is an alignment issue with some struct (likely the new "obd_k*" ones at the end) that will need some specific alignment requests. | ||||||||||||
| Comment by John Hammond [ 07/Jan/21 ] | ||||||||||||
|
I cannot tell if this is already understood but it would be useful to know if the change in performance is due to the debugfs changes being applied to the client, to the server, or both. | ||||||||||||
| Comment by Andreas Dilger [ 07/Jan/21 ] | ||||||||||||
|
John, it is on the client only, AFAIK. Shuichi, could you please run "pahole" on the obdclass.ko module with and without the problem and attach it here. That program is part of the "dwarves RPM. | ||||||||||||
| Comment by Shuichi Ihara [ 08/Jan/21 ] | ||||||||||||
|
Andreas, access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ---- write 14218 888.67 161.01 16384 16384 0.159725 172.84 27.90 172.85 0 write 14168 885.51 159.56 16384 16384 0.218456 173.46 23.85 173.46 0 write 14093 880.82 161.54 16384 16384 0.191401 174.38 25.50 174.38 0 | ||||||||||||
| Comment by Gerrit Updater [ 08/Jan/21 ] | ||||||||||||
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41178 | ||||||||||||
| Comment by Andreas Dilger [ 08/Jan/21 ] | ||||||||||||
|
Shuichi, if shrinking struct obd_device does not solve the problem, then it seems the problem is caused by a misalignment of some data structure that follows the added obd_debugfs_vars field. Can you please try another set of tests that move the "obd_debugfs_vars" line until we isolate the problematic field. The first test would be to move obd_debugfs_vars to the end of the struct: /* --- cacheline 101 boundary (6464 bytes) --- */ - struct ldebugfs_vars * obd_debugfs_vars; /* 6464 8 */ atomic_t obd_evict_inprogress; /* 6472 4 */ wait_queue_head_t obd_evict_inprogress_waitq; /* 6480 24 */ struct list_head obd_evict_list; /* 6504 16 */ rwlock_t obd_pool_lock; /* 6520 8 */ /* --- cacheline 102 boundary (6528 bytes) --- */ __u64 obd_pool_slv; /* 6528 8 */ int obd_pool_limit; /* 6536 4 */ int obd_conn_inprogress; /* 6540 4 */ struct lu_ref obd_reference; /* 6544 0 */ struct kset obd_kset; /* 6544 160 */ /* --- cacheline 104 boundary (6656 bytes) was 48 bytes ago --- */ struct kobj_type obd_ktype; /* 6704 72 */ /* --- cacheline 105 boundary (6720 bytes) was 56 bytes ago --- */ struct completion obd_kobj_unregister; /* 6776 32 */ + struct ldebugfs_vars * obd_debugfs_vars; /* 6464 8 */ to see if this solves the problem (without my other patches). I've pushed a patch to do this. If it fixes the problem, then this confirms that the problem is caused by the alignment or cacheline contention on of one of the fields between obd_evict_inprogress and obd_kobj_unregister. This would be enough to land for 2.14.0 to solve the problem, but I don't want to leave the reason for the problem unsolved, since it is likely to be accidentally returned again in the future (e.g. by landing my patches to shrink lu_tgt_desc or anything else). To isolate the reason for the problem you would need to "bisect" the 11 fields/366 bytes to see which one is causing the slowdown. First try moving obd_debugfs_vars after obd_kset to see if this causes the slowdown again. If not, then the problem is obd_kset or earlier, so try moving it immediately before obd_kset (this is the largest field so makes it difficult to "bisect" exactly). If the problem is still not seen, move it after obd_evict_list, etc. Essentially, when obd_debugfs_vars is immediately before the offending struct the performance will be bad, and when it is immediately after the struct then the performance problem should go away. Once you find out what the structure is, try moving that field to be at the start of struct obd_device so that there is no chance of it being misaligned, after obd_lu_dev and after obd_recovery_expired. If these also show good performance, then this can be a permanent solution (I would prefer after obd_recovery_expired since these bitfields are very commonly used). Please run the "pahole" command on the obdclass.ko module to show the "good" and "bad" structures to see what the problem is, and attach the results here. Neil, James, since the obd_kobj and obd_ktype fields are recent additions and the largest fields in this area, it seems likely that they are the culprit here. Is there anything "special" about them that would require their alignment, or to avoid cacheline contention? Are they "hot" and referenced/refcounted continuously during object access? | ||||||||||||
| Comment by Shuichi Ihara [ 10/Jan/21 ] | ||||||||||||
|
patch: https://review.whamcloud.com/41178 didn't help and it was still same slow down. let me try your "bisect" ideas to find if it shows good (or still bad) performance. | ||||||||||||
| Comment by Andreas Dilger [ 10/Jan/21 ] | ||||||||||||
|
Strange that the patch to reduce the structure to the previous size (-8 bytes) and the patch to move the new 8-byte field to the end didn't help performance and the patch to reduce the struct size -2000 bytes also didn't help. It is possible my "-8 byte" patch was too complex, and I moved too many fields around? I had tried to keep the related fields together. Do you have any idea where the performance is being lost? Is it high CPU usage or lock contention or something else? | ||||||||||||
| Comment by Shuichi Ihara [ 12/Jan/21 ] | ||||||||||||
|
Attached are two frame graphes on good (reverted whole | ||||||||||||
| Comment by Andreas Dilger [ 12/Jan/21 ] | ||||||||||||
|
It looks like the contended lock in master.svg | ||||||||||||
| Comment by John Hammond [ 14/Jan/21 ] | ||||||||||||
|
> Attached are two frame graphes on good (reverted whole How long did perf run for the two flame graphs? One has 5x the samples as the other. master-revert.svg shows 20% in swapper and 5% in open. Had ior completed startup when perf was run? | ||||||||||||
| Comment by Gerrit Updater [ 10/Apr/21 ] | ||||||||||||
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41162/ | ||||||||||||
| Comment by Patrick Farrell [ 14/Jun/21 ] | ||||||||||||
|
Ihara, Ignoring entirely the apparent memory layout issue, I have two suggestions that may reduce the pain on cl_loi_list_lock. This patch just landed to master: https://review.whamcloud.com/38214 And this one which has not yet landed: If you're able to give these a try, that would be very interesting. Both should cut the contention on cl_loi_list_lock when doing dio. I'm not sure precisely what effect they'll have here, but it seems worth trying. | ||||||||||||
| Comment by Patrick Farrell [ 14/Jun/21 ] | ||||||||||||
|
tappro made this comment over on LU-14580, and I wanted to bring it here: That patch definitely has performance implications. Direct i/o will keep sending even when there are no grants - since it is already synchronous - but it significantly increases the load on the cl_loi_list_lock in some cases. The patches I noted above are aimed at that. There's still very likely a memory layout issue here, but perhaps these will help... | ||||||||||||
| Comment by Patrick Farrell [ 15/Jun/21 ] | ||||||||||||
|
Ihara, FYI I had to rebase that second patch above: It was not compatible with current master, should be fixed now. | ||||||||||||
| Comment by Peter Jones [ 26/Nov/21 ] | ||||||||||||
|
sihara can this issue now be considered fixed? | ||||||||||||
| Comment by Patrick Farrell [ 05/Jan/22 ] | ||||||||||||
|
My understanding is this issue is resolved, though we never fully understood it. Let me knw if that's wrong and we can reopen or open a new ticket. |