[LU-15546] Shared Directory File Creates regression seen in 2.15 when comparing to 2.12.6 Created: 10/Feb/22  Updated: 01/Jul/22  Resolved: 18/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Petros Koutoupis Assignee: Etienne Aujames
Resolution: Fixed Votes: 0
Labels: None

Attachments: File LU-15546-reverted-LU-10262-performance.tar.gz     File LUS-10749-perf-traces.tar.gz    
Issue Links:
Related
is related to LU-10262 Lock contention when doing creates fo... Resolved
is related to LU-15692 performance regressions for files in ... Resolved
is related to LU-15720 imbalanced file creation in 'crush' s... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When testing mdtest 2.15 (2.14.57) and comparing to 2.12.6, I see a large 25% regression with Shared Directory File Creates. Perf traces (attached) show a lot of extra ldlm overhead.

#!/bin/bash

NODES=21
PPN=16
PROCS=$(( $NODES * $PPN ))
MDT_COUNT=1
PAUSED=120

srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -E -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -d /mnt/kjlmo2/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_0k_ost_shared.out

srun -N $NODES --ntasks-per-node $PPN ~bloewe/benchmarks/ior-3.3.0-CentOS-8.2/install/bin/mdtest -v -i 5 -p $PAUSED -C -w 32768 -E -e 32768 -T -r -n $(( $MDT_COUNT * 1048576 / $PROCS )) -d /mnt/kjlmo2/pkoutoupis/mdt0/test.`date +"%Y%m%d.%H%M%S"` 2>&1 |& tee f_mdt0_32k_ost_shared.out


 Comments   
Comment by Patrick Farrell [ 10/Feb/22 ]

Petros,

You've provided a bunch of different perf traces, but it looks like there's a flame graph summary for only 2.15 and not for 2.12, so I can't compare them?

Can you be more specific about which perf traces show LDLM overhead, exactly how they're showing it, and where they were gathered from?  It looks like based on the node name these were gathered on an MDT?

For what it's worth, the on CPU perf traces here can wave in the general direction, but unless we're CPU bound, they're not going to tell us much.

Generally a DNE issue like this is related to LDLM locking behavior, but that shows up mostly in sleeping time as nodes wait for locks to be passed around, not on-CPU time.  Most likely we're going to need comparative ldlm stats and, ideally, dlmtrace logs from the server or clients.

Comment by Petros Koutoupis [ 14/Feb/22 ]

Let me delete the one tarball as it is not really needed and may add confusion (LUS-10749-perf-results.tar.gz). The remaining tarball LUS-10749-perf-traces.tar.gz show a comparison between a 2.12.6 baseline which performs fairly well in our environment and a 2.15 build which does not. In the 2.15 flamegraphs I see  a significant amount of ldlm_reprocess_all occurring which is not reflect in the 2.12.6 traces. And yes, this is a trace on the MDT I am writing to. We are also not pegging the CPU.

Comment by Patrick Farrell [ 14/Feb/22 ]

OK, that makes sense, and I see we have both svgs now.

So the greater LDLM activity suggests worse locking behavior, eg, a conflict where one did not exist before.

It might be possible to guess why that's occurring from the greater LDLM thread activity and looking at patches, but it's a huge gap in changes (~2 two full releases).  So what's really needed is at least ldlm stats (ldlm_cbd_stats I believe it's called) or, ideally, a snippet of debug logs from a client in both cases.  (A client is better than a server because hopefully we can see which operations are leading to conflicts.)

Comment by Petros Koutoupis [ 14/Feb/22 ]

Patrick, You are correct, it is a very large gap which is quite overwhelming. I will work on grabbing client traces and gather the requested debug data shortly. Thank you.

Comment by Patrick Farrell [ 14/Feb/22 ]

Note on a mistake in my previous - ldlm_cbd_stats are client side stats, I don't have the server side stats name handy.

Comment by Peter Jones [ 17/Feb/22 ]

From a community release point of view, shouldn't the comparison be between 2.14 and 2.15 (with no patches applied)? If a drop is seen then can you do a git bisect to identify which change(s) has introduced the performance regression?

Comment by Petros Koutoupis [ 25/Feb/22 ]

Uploaded LU-15546-reverted-LU-10262-performance.tar.gz which contains the shared directory results of our 2.12, 2.15 and again, 2.15 with LU-10262 reverted. A bisect of the master branch revealed the patch for LU-10262 being the culprit and once the patch was reverted, the performance for shared directory file creates was restored and more in line with what we were seeing in our 2.12.

commit 33dc40d58ef6eb8b384fce1da9f8d21cad4ef6d8
Author: Dominique Martinet <dominique.martinet@cea.fr>
Date:   Fri Aug 31 18:03:36 2018 +0900

    LU-10262 mdt: mdt_reint_open: check EEXIST without lock
    
    Many applications blindly open files with O_CREAT, and the mds gets a
    write lock to the parent directory for these even if the file already
    exists.
    Checking for file existence first lets us take a PR lock if file
    already existed even if O_CREAT was specified.
    
    This opens up multiple races between the first lookup and the actual
    locking, in each of them drop the resources we aquired and retry from
    scratch to keep things as far from complicated as possible, with mixed
    success.
Comment by Cory Spitz [ 25/Feb/22 ]

From a community release point of view, shouldn't the comparison be between 2.14 and 2.15 (with no patches applied)?

Yes, of course. I think that assumes that 2.14.0 was deemed 'good', and perhaps it was, but with LU-10262 identified as the culprit, we now know that 2.14.0 may not have been as good as thought. No matter, what's important now is to understand how to move forward.

vitaly_fertman has indicated that it would be best to revert LU-10262 because reverting it would not reintroduce any defect. LU-10262 was inserted as an 'improvement'. In fact, there appears to be plenty of performance test results provided as landing collateral that indicated that it was 'good'. I'd like to try and understand how to reconcile that info with the shared-dir mdtest results that demonstrate a regression with the patch. I know that LU-10262 was made as an optimization targeting FORTRAN applications, but perhaps it was made at the expense of the general case? Is this a simple tradeoff or is there more to it?

Comment by Peter Jones [ 25/Feb/22 ]

So it looks like this patch was in 2.14 and also the last several 2.12.x releases. Perhaps, as with the performance issue we found late in the 2.14 release cycle, this performance issue is more apparent on some systems than others....

Comment by Peter Jones [ 25/Feb/22 ]

eaujames asmadeus any thoughts on this reported issue?

Comment by Dominique Martinet [ 25/Feb/22 ]

The original rationale for this patch was indeed something we saw on FORTRAN code, but any sloppy program opening files with O_CREAT on all ranks with a large number of ranks would trigger this – basically we are doing a lock storm without this patch where each client has to revoke the lock of other clients for each open, and we've pushed hard for the patch because on very large jobs (12-32k ranks) we've seen MDS hangs from it. Had it been just sloppy programs we'd fix the programs, but FORTRAN open always pass O_CREAT so there is no fix (we've also temporarily had a LD_PRELOAD lib that traps opens with O_CREAT, does access() and strips the flags if file exists but that's just ugly...)

afair it's been known from day 1 that patterns like mdtest (each rank creating new files in the directory) would be a slowdown, while cases where a shared file is open get immensely faster (the O_CREAT+precreate line of Etienne's benchmark in the other LU), and that also fixed the MDS hang we had.

The LDLM overhead is not so much a conflict that we're taking the lock one more time iirc. gerrit doesn't load for me right now so I didn't re-read the patch, but we first take the lock for read, check file existence then upgrade it to write if it didn't exist – that will be slower in the really create case. iirc the final patch for file creation vs. directory creation was different and directory might have been checking without taking the read lock, in which case the impact would be lower? Perhaps a similar optimization could be done for file creation... we want to take care of catching MPI-style "wall of open(O_CREAT) on same file" so I was thinking locking is required but I recall a comment on the later patch saying later on that this check is on MDT so it's actually not required.

Comment by Etienne Aujames [ 01/Mar/22 ]

The LU-10262 test results are not from mdtests, they represent a common production case where all the nodes of a job attempt to create the same files at the same time with the same parent directory ("wall of open(O_CREAT) on same file"). In that case without the LU-10262, for each O_CREAT request the MDT will take a LCK_PW on the parent (blocking further operation on the directory) while the first request creates the file.

But yes, the LU-10262 degrades performances for the case of pure file creations: for each file to create we take a LCK_PR on the parent lock to check if the child exists and then a LCK_PW.

To prevent this overhead, we could determine the lock_mode with a lookup without DLM lock and then we have to redo the lookup with the lock. If the file has been removed in the meantime, we have to re-lock the parent with PW but hopefully this will be rare.

What do you think about this?

Comment by Gerrit Updater [ 02/Mar/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46679
Subject: LU-15546 mdt: mdt_reint_open lookup before locking
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2df91b9c5aa46c5e3cfbe7206f7e2644e304f81d

Comment by Andreas Dilger [ 03/Mar/22 ]

koutoupis, would you be able to test this patch in your testbed to see if it resolves the performance problem?

eaujames, thanks for the patch. One further option that might avoid the extra lookup is to track a history of recent open modes (e.g. most recent 256 opens), and then use that to decide whether to do the pre-lookup or not.

For workloads like mdtest that never try to recreate existing files the pre-lookup is purely overhead, so defaulting to PW after a number of such opens would avoid this. For many FORTRAN threads, repeatedly opening an existing file with O_CREAT this would default over time to PR. A mixed workload would always do the pre-lookup, which is still a win compared to getting the wrong DLM lock. In the rare case that the decision is wrong (e.g. workload change), it can retry the same as if the lookup was racy. Something like:

struct mdt_device {
         unsigned int mdt_open_lock_history;
};
#define MDT_OPEN_LOCK_COUNT 256
/* bias toward LCK_PW since it does not need a full retry loop */
#define MDT_OPEN_LOCK_THRESH_PW LCK_PW * (MDT_OPEN_LOCK_COUNT - 32)
#define MDT_OPEN_LOCK_THRESH_PR LCK_PR * (MDT_OPEN_LOCK_COUNT + 8)

static int mdt_init0(const struct lu_env *env, struct mdt_device *m,
                     struct lu_device_type *ldt, struct lustre_cfg *cfg)
{
        mdt->mdt_open_history  = (LCK_PR + LCK_PW) * MDT_OPEN_LOCK_COUNT / 2;
}

void mdt_open_lock_history_add(struct mdt_device *mdt, enum ldlm_mode mode)
{
        mdt->mdt_open_history = (mdt->mdt_open_history * (MDT_OPEN_LOCK_COUNT - 1) + mode) /
                MDT_OPEN_LOCK_COUNT;
}

static inline enum ldlm_mode mdt_open_lock_mode(struct mdt_thread_info *info,
						struct mdt_object *p,
						struct lu_name *name,
						u64 open_flags)
{
	struct lu_fid fid;
	enum ldlm_lock_mode mode;

	/* We don't need to take the DLM lock for a volatile */
	if (open_flags & MDS_OPEN_VOLATILE)
		return LCK_NL;

        /* if most recent opens used the same mode, assume next one will also */
        if (info->mti_mdt->mdt_open_history > MDT_OPEN_LOCK_THRESH_PW)
               return  LCK_PW;
        if (info->mti_mdt->mdt_open_history > MDT_OPEN_LOCK_THRESH_PR)
               return  LCK_PR;

	/* If the file exists we only need a read lock on the parent */
	mode = mdo_lookup(info->mti_env, mdt_object_child(p), name, &fid,
			&info->mti_spec) == 0 ? LCK_PR : LCK_PW;

        mdt_open_lock_history_add(info->mti_mdt, mode);

	return mode;
}

int mdt_reint_open(struct mdt_thread_info *info, struct mdt_lock_handle *lhc)
{
        if (result == -ENOENT) {
                if (lh != NULL && lock_mode == LCK_PR) {
                        /* first pass: get write lock and restart */
                        mdt_object_unlock(info, parent, lh, 1);
                        mdt_clear_disposition(info, ldlm_rep, DISP_LOOKUP_NEG);
                        mdt_lock_handle_init(lh);
                        lock_mode = LCK_PW;
                        mdt_open_lock_history_add(info->mti_mdt, lock_mode);
                        goto again_pw;
                }
        }
        :
                if (created) {
                        :
                } else if (lock_mode == LCK_PW) {
                        /* LCK_PW was not needed, child already exists */
                        mdt_open_lock_history_add(info->mti_mdt, lock_mode);
                }
}
Comment by Petros Koutoupis [ 03/Mar/22 ]

@Andreas Dilger. Of course. I can do so over the next few days.

Comment by Gerrit Updater [ 04/Mar/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46696
Subject: LU-15546 mdt: keep history of mdt_reint_open() lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 47cf66750b44a34be3543e0f3e629cbf5103f8da

Comment by Petros Koutoupis [ 04/Mar/22 ]

I tested (mdtest shared directory file creates) change 46679 against my original 2.15 baseline and it is about a 20-22% improvement but still a 5% regression from a revert of LU-10262.

Comment by Andreas Dilger [ 05/Mar/22 ]

koutoupis feel free to play with my patch https://review.whamcloud.com/46696 - it isn't quite ready, just something I whipped together and have not tested, but close to something that would work.

Comment by Gerrit Updater [ 08/Mar/22 ]

"Dominique Martinet <qhufhnrynczannqp.f@noclue.notk.org>" uploaded a new patch: https://review.whamcloud.com/46738
Subject: LU-15546 mdt: optimistically trust non-locked lookup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 55162c4d0fc798b5070d08a670f1e36b575672c6

Comment by Petros Koutoupis [ 16/Mar/22 ]

I tested change 46696 and it performs a bit better than just 46679 (on shared directory file creates). It is within 5% of my original baseline. Actually, around 3.5-4% less which I consider great.

Comment by Andreas Dilger [ 16/Mar/22 ]

Petros, did you make fixes to 46696 before testing it? There are a couple of bugs in the current version patch (though clearly described there), but they could be fixed relatively easily, but I have no fast system to benchmark it myself. If you have made fixes, please push a new version, otherwise I will take another crack at it.

Comment by Petros Koutoupis [ 16/Mar/22 ]

Andreas, I have not made the fixes. I see the bugs that you are referring to now.

Comment by Gerrit Updater [ 18/Mar/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46679/
Subject: LU-15546 mdt: mdt_reint_open lookup before locking
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f14090e56c9d94e3cfaa6f13f357173d6d570547

Comment by Peter Jones [ 18/Mar/22 ]

As per the discussion on the LWG call, Etienne's patch (which just landed) is what we will address for 2.15. I suggest that the other patches in flight get moved to a new JIRA for possible inclusion in a future release

Comment by Andreas Dilger [ 19/Mar/22 ]

Petros and Shuichi, the latest version of my patch: https://review.whamcloud.com/46696 "LU-15546 mdt: keep history of mdt_reint_open() lock" looks like it is working properly in my local testing, but needs some benchmarking on real hardware to see whether it provides a performance improvement.

The patch has been updated to have a per-directory history counter. In my local testing it takes about 128 open+creates (with pre-lookup, like Etienne's just-landed patch) before it gets "into the zone" and speculatively skips the lookup to predict the PW lock mode and skip the pre-lookup. It takes about 16 "bad" lookups in the same directory before it reverts to doing the pre-lookup again, and 256 open-existing before it swings to the opposite end to predict PR locks and skip the pre-lookup.

Mixed workloads within a single directory will be essentially the same as the current code, so it will always do a pre-lookup in the directory if the open mode doesn't give enough info.

Comment by Gerrit Updater [ 30/May/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/47487
Subject: LU-15546 mdt: mdt_reint_open lookup before locking
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: cfb5c55e9550a04aa22a5849ac8e86a2dc36eada

Comment by Andreas Dilger [ 28/Jun/22 ]

Shuichi tested my patch but didn't find it changed the performance significantly:

I did test patch and compared against lustre-2.15.0-RC3.

  • 1 x MDS(1xMDT, 12 CPU cores, 142GB RAM)
  • 4 x OSS(2xOST/OSS)
  • 40 x client(16 CPU cores, 96GB RAM)
  • IB-HDR100 network

a workload is many processes (640 processes) write huge amount of files (19.2M files) into a single shared directory.

mpirun -np 640 mdtest -n 30000 -F -v -d /exafs/d0/d1/d2/mdtest.out -C -r -p 30 -i 3
lustre-2.15.0-RC3
SUMMARY rate: (of 3 iterations)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   File creation               64713.164      53431.547      60835.215       6414.183
   File stat                       0.000          0.000          0.000          0.000
   File read                       0.000          0.000          0.000          0.000
   File removal                46277.792      44080.164      45406.512       1167.351
   Tree creation                4629.475       3495.253       4131.971        579.784
   Tree removal                    2.302          2.019          2.137          0.147

lustre-2.15.0-RC3 + patch46696
SUMMARY rate: (of 3 iterations)
   Operation                     Max            Min           Mean        Std Dev
   ---------                     ---            ---           ----        -------
   File creation               67544.538      52056.964      61429.964       8241.920
   File stat                       0.000          0.000          0.000          0.000
   File read                       0.000          0.000          0.000          0.000
   File removal                45532.402      41724.110      43966.753       1992.363
   Tree creation                4132.319       3472.106       3721.652        358.386
   Tree removal                    2.251          1.837          2.030          0.209

In my test environment, I didn't see huge improvements by patch (might be limited of MDS's CPU resources), but didn't find regressions too.

Generated at Sat Feb 10 03:19:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.