[LU-12273] DNE3: Metadata overstriping Created: 08/May/19  Updated: 22/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Minor
Reporter: Patrick Farrell (Inactive) Assignee: Patrick Farrell
Resolution: Unresolved Votes: 0
Labels: dne3

Issue Links:
Related
is related to LU-9846 Overstriping - more than stripe per O... Resolved
is related to LU-12926 lfs mkdir -c can cause repeated MDT i... Open
Rank (Obsolete): 9223372036854775807

 Description   

LU-9846 describes the overstriping feature for data, which allows placing > 1 stripe per OST.  This can be done with MDTs as well, for somewhat similar reasons:

"it allows more concurrency on the MDT, exceeding single-directory size limitations, directory migration/compaction, etc." (per Andreas)

This exists in limited form today, accessible with a fail loc:
OBD_FAIL_LARGE_STRIPE (0x1703)

Which is used in sanity test 300k to put a bunch of stripes on MDT0:

        #define OBD_FAIL_LARGE_STRIPE   0x1703
        $LCTL set_param fail_loc=0x1703
        $LFS setdirstripe -i 0 -c192 $DIR/$tdir/striped_dir ||
                error "set striped dir err

Actually doing this as a feature requires various other enabling changes, but this test shows it should be possible.  It's also possible to use the method in this test to create temporary setups for benchmarking this idea to confirm it's worth pursuing.



 Comments   
Comment by Gerrit Updater [ 02/Jun/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35034
Subject: LU-12273 lod: metadata overstriping
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c3bfede6854263915c967f53602d0a1e46d6d1bf

Comment by Gerrit Updater [ 27/Aug/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35939
Subject: LU-12273 lod: Trivial metadata overstriping
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6e97b13d5dd0ba3d49b5c5191f4733e95663e573

Comment by Andreas Dilger [ 21/Oct/22 ]

if we compare 2 x MDT per MDS vs MDT Overstriping=2, it's 2 x jbd2 thread vs still single journal thread, etc?

Correct - with overstriping there is still only a single journal/device and some filesystem locks, while two separate MDTs have totally separate infrastructure (but each one is 1/2 the size and needs more space balancing, double journal memory usage). If we can get "close" performance with 2x or 4x overstriping vs. 2x or 4x MDTs then using directory overstriping would be better overall.

Comment by Gerrit Updater [ 19/Jan/23 ]

"Patrick Farrell <farr0186@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49707
Subject: LU-12273 obd: Reserve metadata overstriping flags
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 91f41501f508166e33881d00ac795f9296f8978c

Comment by Patrick Farrell [ 23/Jan/23 ]

So this is fascinating.

Each MDT does not know about itself in the pool code, because it's the local device, so it's handled differently.
The list of targets (in the QOS/RR pool code) on an MDT is the other MDTs in the system.

The practical result of this is the first MDT is not selected by the allocation code on the MDT, so it only
gets one stripe on it.

eg, with 2 MDTs:

lmv_stripe_count: 8 lmv_stripe_offset: 1 lmv_hash_type: crush,overstriped
mdtidx           FID[seq:oid:ver]
     1           [0x240000400:0x2:0x0]
     0           [0x200000401:0x2:0x0]
     0           [0x200000401:0x3:0x0]
     0           [0x200000401:0x4:0x0]
     0           [0x200000401:0x5:0x0]
     0           [0x200000401:0x6:0x0]
     0           [0x200000401:0x7:0x0]
     0           [0x200000401:0x8:0x0]

Or, with 4 MDTs, it can look like this: 
lmv_stripe_count: 16 lmv_stripe_offset: 3 lmv_hash_type: crush,overstriped
mdtidx           FID[seq:oid:ver]
     3           [0x2c0000400:0x6:0x0]
     0           [0x200000403:0x10:0x0]
     1           [0x240000402:0x11:0x0]
     2           [0x280000401:0x11:0x0]
     0           [0x200000403:0x11:0x0]
     1           [0x240000402:0x12:0x0]
     2           [0x280000401:0x12:0x0]
     0           [0x200000403:0x12:0x0]
     1           [0x240000402:0x13:0x0]
     2           [0x280000401:0x13:0x0]
     0           [0x200000403:0x13:0x0]
     1           [0x240000402:0x14:0x0]
     2           [0x280000401:0x14:0x0]
     0           [0x200000403:0x14:0x0]
     1           [0x240000402:0x15:0x0]
     2           [0x280000401:0x15:0x0]

Notice 3 is only used once.

Allocation of the first stripe is handled like this, without reference to the pool:
        /* Allocate the first stripe locally */
        rc = dt_fid_alloc(env, lod->lod_child, &fid, NULL, NULL);
        if (rc < 0)
                GOTO(out, rc);

        stripes[0] = dt_locate_at(env, lod->lod_child, &fid,
                                  dt->do_lu.lo_dev->ld_site->ls_top_dev, &conf);

then the qos/rr alloc code is called to allocate the rest of the stripes.

I'm not sure what to do about this - The device init process doesn't really seem something to mess with.
I'm thinking the right thing to do is special case this for RR + overstriping.

Basically, add one more to the range of indices that can be selected during RR, and if it's found, then do
a local allocation.  It complicates the code slightly but it's the only solution that seems sane.

I'll do that if there's not an objection.

Comment by Gerrit Updater [ 19/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49707/
Subject: LU-12273 obd: Reserve metadata overstriping flags
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8b62a8ca9c2e824d94fbf8bc266b4028e9b5fa63

Generated at Sat Feb 10 02:51:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.