[LU-15546] Shared Directory File Creates regression seen in 2.15 when comparing to 2.12.6 Created: 10/Feb/22 Updated: 01/Jul/22 Resolved: 18/Mar/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Petros Koutoupis | Assignee: | Etienne Aujames |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
When testing mdtest 2.15 (2.14.57) and comparing to 2.12.6, I see a large 25% regression with Shared Directory File Creates. Perf traces (attached) show a lot of extra ldlm overhead. #!/bin/bash NODES=21 PPN=16 PROCS=$(( $NODES * $PPN )) MDT_COUNT=1 PAUSED=120 srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -E -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -d /mnt/kjlmo2/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_0k_ost_shared.out srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -w 32768 -E -e 32768 -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -d /mnt/kjlmo2/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_32k_ost_shared.out |
| Comments |
| Comment by Patrick Farrell [ 10/Feb/22 ] |
|
Petros, You've provided a bunch of different perf traces, but it looks like there's a flame graph summary for only 2.15 and not for 2.12, so I can't compare them? Can you be more specific about which perf traces show LDLM overhead, exactly how they're showing it, and where they were gathered from? It looks like based on the node name these were gathered on an MDT? For what it's worth, the on CPU perf traces here can wave in the general direction, but unless we're CPU bound, they're not going to tell us much. Generally a DNE issue like this is related to LDLM locking behavior, but that shows up mostly in sleeping time as nodes wait for locks to be passed around, not on-CPU time. Most likely we're going to need comparative ldlm stats and, ideally, dlmtrace logs from the server or clients. |
| Comment by Petros Koutoupis [ 14/Feb/22 ] |
|
Let me delete the one tarball as it is not really needed and may add confusion (LUS-10749-perf-results.tar.gz). The remaining tarball LUS-10749-perf-traces.tar.gz show a comparison between a 2.12.6 baseline which performs fairly well in our environment and a 2.15 build which does not. In the 2.15 flamegraphs I see a significant amount of ldlm_reprocess_all occurring which is not reflect in the 2.12.6 traces. And yes, this is a trace on the MDT I am writing to. We are also not pegging the CPU. |
| Comment by Patrick Farrell [ 14/Feb/22 ] |
|
OK, that makes sense, and I see we have both svgs now. So the greater LDLM activity suggests worse locking behavior, eg, a conflict where one did not exist before. It might be possible to guess why that's occurring from the greater LDLM thread activity and looking at patches, but it's a huge gap in changes (~2 two full releases). So what's really needed is at least ldlm stats (ldlm_cbd_stats I believe it's called) or, ideally, a snippet of debug logs from a client in both cases. (A client is better than a server because hopefully we can see which operations are leading to conflicts.) |
| Comment by Petros Koutoupis [ 14/Feb/22 ] |
|
Patrick, You are correct, it is a very large gap which is quite overwhelming. I will work on grabbing client traces and gather the requested debug data shortly. Thank you. |
| Comment by Patrick Farrell [ 14/Feb/22 ] |
|
Note on a mistake in my previous - ldlm_cbd_stats are client side stats, I don't have the server side stats name handy. |
| Comment by Peter Jones [ 17/Feb/22 ] |
|
From a community release point of view, shouldn't the comparison be between 2.14 and 2.15 (with no patches applied)? If a drop is seen then can you do a git bisect to identify which change(s) has introduced the performance regression? |
| Comment by Petros Koutoupis [ 25/Feb/22 ] |
|
Uploaded commit 33dc40d58ef6eb8b384fce1da9f8d21cad4ef6d8
Author: Dominique Martinet <dominique.martinet@cea.fr>
Date: Fri Aug 31 18:03:36 2018 +0900
LU-10262 mdt: mdt_reint_open: check EEXIST without lock
Many applications blindly open files with O_CREAT, and the mds gets a
write lock to the parent directory for these even if the file already
exists.
Checking for file existence first lets us take a PR lock if file
already existed even if O_CREAT was specified.
This opens up multiple races between the first lookup and the actual
locking, in each of them drop the resources we aquired and retry from
scratch to keep things as far from complicated as possible, with mixed
success.
|
| Comment by Cory Spitz [ 25/Feb/22 ] |
Yes, of course. I think that assumes that 2.14.0 was deemed 'good', and perhaps it was, but with vitaly_fertman has indicated that it would be best to revert |
| Comment by Peter Jones [ 25/Feb/22 ] |
|
So it looks like this patch was in 2.14 and also the last several 2.12.x releases. Perhaps, as with the performance issue we found late in the 2.14 release cycle, this performance issue is more apparent on some systems than others.... |
| Comment by Peter Jones [ 25/Feb/22 ] |
| Comment by Dominique Martinet [ 25/Feb/22 ] |
|
The original rationale for this patch was indeed something we saw on FORTRAN code, but any sloppy program opening files with O_CREAT on all ranks with a large number of ranks would trigger this – basically we are doing a lock storm without this patch where each client has to revoke the lock of other clients for each open, and we've pushed hard for the patch because on very large jobs (12-32k ranks) we've seen MDS hangs from it. Had it been just sloppy programs we'd fix the programs, but FORTRAN open always pass O_CREAT so there is no fix (we've also temporarily had a LD_PRELOAD lib that traps opens with O_CREAT, does access() and strips the flags if file exists but that's just ugly...) afair it's been known from day 1 that patterns like mdtest (each rank creating new files in the directory) would be a slowdown, while cases where a shared file is open get immensely faster (the O_CREAT+precreate line of Etienne's benchmark in the other LU), and that also fixed the MDS hang we had. The LDLM overhead is not so much a conflict that we're taking the lock one more time iirc. gerrit doesn't load for me right now so I didn't re-read the patch, but we first take the lock for read, check file existence then upgrade it to write if it didn't exist – that will be slower in the really create case. iirc the final patch for file creation vs. directory creation was different and directory might have been checking without taking the read lock, in which case the impact would be lower? Perhaps a similar optimization could be done for file creation... we want to take care of catching MPI-style "wall of open(O_CREAT) on same file" so I was thinking locking is required but I recall a comment on the later patch saying later on that this check is on MDT so it's actually not required. |
| Comment by Etienne Aujames [ 01/Mar/22 ] |
|
The But yes, the To prevent this overhead, we could determine the lock_mode with a lookup without DLM lock and then we have to redo the lookup with the lock. If the file has been removed in the meantime, we have to re-lock the parent with PW but hopefully this will be rare. What do you think about this? |
| Comment by Gerrit Updater [ 02/Mar/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46679 |
| Comment by Andreas Dilger [ 03/Mar/22 ] |
|
koutoupis, would you be able to test this patch in your testbed to see if it resolves the performance problem? eaujames, thanks for the patch. One further option that might avoid the extra lookup is to track a history of recent open modes (e.g. most recent 256 opens), and then use that to decide whether to do the pre-lookup or not. For workloads like mdtest that never try to recreate existing files the pre-lookup is purely overhead, so defaulting to PW after a number of such opens would avoid this. For many FORTRAN threads, repeatedly opening an existing file with O_CREAT this would default over time to PR. A mixed workload would always do the pre-lookup, which is still a win compared to getting the wrong DLM lock. In the rare case that the decision is wrong (e.g. workload change), it can retry the same as if the lookup was racy. Something like:
struct mdt_device {
unsigned int mdt_open_lock_history;
};
#define MDT_OPEN_LOCK_COUNT 256
/* bias toward LCK_PW since it does not need a full retry loop */
#define MDT_OPEN_LOCK_THRESH_PW LCK_PW * (MDT_OPEN_LOCK_COUNT - 32)
#define MDT_OPEN_LOCK_THRESH_PR LCK_PR * (MDT_OPEN_LOCK_COUNT + 8)
static int mdt_init0(const struct lu_env *env, struct mdt_device *m,
struct lu_device_type *ldt, struct lustre_cfg *cfg)
{
mdt->mdt_open_history = (LCK_PR + LCK_PW) * MDT_OPEN_LOCK_COUNT / 2;
}
void mdt_open_lock_history_add(struct mdt_device *mdt, enum ldlm_mode mode)
{
mdt->mdt_open_history = (mdt->mdt_open_history * (MDT_OPEN_LOCK_COUNT - 1) + mode) /
MDT_OPEN_LOCK_COUNT;
}
static inline enum ldlm_mode mdt_open_lock_mode(struct mdt_thread_info *info,
struct mdt_object *p,
struct lu_name *name,
u64 open_flags)
{
struct lu_fid fid;
enum ldlm_lock_mode mode;
/* We don't need to take the DLM lock for a volatile */
if (open_flags & MDS_OPEN_VOLATILE)
return LCK_NL;
/* if most recent opens used the same mode, assume next one will also */
if (info->mti_mdt->mdt_open_history > MDT_OPEN_LOCK_THRESH_PW)
return LCK_PW;
if (info->mti_mdt->mdt_open_history > MDT_OPEN_LOCK_THRESH_PR)
return LCK_PR;
/* If the file exists we only need a read lock on the parent */
mode = mdo_lookup(info->mti_env, mdt_object_child(p), name, &fid,
&info->mti_spec) == 0 ? LCK_PR : LCK_PW;
mdt_open_lock_history_add(info->mti_mdt, mode);
return mode;
}
int mdt_reint_open(struct mdt_thread_info *info, struct mdt_lock_handle *lhc)
{
if (result == -ENOENT) {
if (lh != NULL && lock_mode == LCK_PR) {
/* first pass: get write lock and restart */
mdt_object_unlock(info, parent, lh, 1);
mdt_clear_disposition(info, ldlm_rep, DISP_LOOKUP_NEG);
mdt_lock_handle_init(lh);
lock_mode = LCK_PW;
mdt_open_lock_history_add(info->mti_mdt, lock_mode);
goto again_pw;
}
}
:
if (created) {
:
} else if (lock_mode == LCK_PW) {
/* LCK_PW was not needed, child already exists */
mdt_open_lock_history_add(info->mti_mdt, lock_mode);
}
}
|
| Comment by Petros Koutoupis [ 03/Mar/22 ] |
|
@Andreas Dilger. Of course. I can do so over the next few days. |
| Comment by Gerrit Updater [ 04/Mar/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46696 |
| Comment by Petros Koutoupis [ 04/Mar/22 ] |
|
I tested (mdtest shared directory file creates) change 46679 against my original 2.15 baseline and it is about a 20-22% improvement but still a 5% regression from a revert of |
| Comment by Andreas Dilger [ 05/Mar/22 ] |
|
koutoupis feel free to play with my patch https://review.whamcloud.com/46696 - it isn't quite ready, just something I whipped together and have not tested, but close to something that would work. |
| Comment by Gerrit Updater [ 08/Mar/22 ] |
|
"Dominique Martinet <qhufhnrynczannqp.f@noclue.notk.org>" uploaded a new patch: https://review.whamcloud.com/46738 |
| Comment by Petros Koutoupis [ 16/Mar/22 ] |
|
I tested change 46696 and it performs a bit better than just 46679 (on shared directory file creates). It is within 5% of my original baseline. Actually, around 3.5-4% less which I consider great. |
| Comment by Andreas Dilger [ 16/Mar/22 ] |
|
Petros, did you make fixes to 46696 before testing it? There are a couple of bugs in the current version patch (though clearly described there), but they could be fixed relatively easily, but I have no fast system to benchmark it myself. If you have made fixes, please push a new version, otherwise I will take another crack at it. |
| Comment by Petros Koutoupis [ 16/Mar/22 ] |
|
Andreas, I have not made the fixes. I see the bugs that you are referring to now. |
| Comment by Gerrit Updater [ 18/Mar/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46679/ |
| Comment by Peter Jones [ 18/Mar/22 ] |
|
As per the discussion on the LWG call, Etienne's patch (which just landed) is what we will address for 2.15. I suggest that the other patches in flight get moved to a new JIRA for possible inclusion in a future release |
| Comment by Andreas Dilger [ 19/Mar/22 ] |
|
Petros and Shuichi, the latest version of my patch: https://review.whamcloud.com/46696 " The patch has been updated to have a per-directory history counter. In my local testing it takes about 128 open+creates (with pre-lookup, like Etienne's just-landed patch) before it gets "into the zone" and speculatively skips the lookup to predict the PW lock mode and skip the pre-lookup. It takes about 16 "bad" lookups in the same directory before it reverts to doing the pre-lookup again, and 256 open-existing before it swings to the opposite end to predict PR locks and skip the pre-lookup. Mixed workloads within a single directory will be essentially the same as the current code, so it will always do a pre-lookup in the directory if the open mode doesn't give enough info. |
| Comment by Gerrit Updater [ 30/May/22 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47487 |
| Comment by Andreas Dilger [ 28/Jun/22 ] |
|
Shuichi tested my patch but didn't find it changed the performance significantly:
|