Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • 9223372036854775807

    Description

      LU-9846 describes the overstriping feature for data, which allows placing > 1 stripe per OST.  This can be done with MDTs as well, for somewhat similar reasons:

      "it allows more concurrency on the MDT, exceeding single-directory size limitations, directory migration/compaction, etc." (per Andreas)

      This exists in limited form today, accessible with a fail loc:
      OBD_FAIL_LARGE_STRIPE (0x1703)

      Which is used in sanity test 300k to put a bunch of stripes on MDT0:

              #define OBD_FAIL_LARGE_STRIPE   0x1703
              $LCTL set_param fail_loc=0x1703
              $LFS setdirstripe -i 0 -c192 $DIR/$tdir/striped_dir ||
                      error "set striped dir err
      
      

      Actually doing this as a feature requires various other enabling changes, but this test shows it should be possible.  It's also possible to use the method in this test to create temporary setups for benchmarking this idea to confirm it's worth pursuing.

      Attachments

        Issue Links

          Activity

            [LU-12273] DNE3: Metadata overstriping
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/35034/
            Subject: LU-12273 lod: metadata overstriping
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 81ac7c0c989dd862e2215a4635c77e5123289658

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/35034/ Subject: LU-12273 lod: metadata overstriping Project: fs/lustre-release Branch: master Current Patch Set: Commit: 81ac7c0c989dd862e2215a4635c77e5123289658

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49707/
            Subject: LU-12273 obd: Reserve metadata overstriping flags
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8b62a8ca9c2e824d94fbf8bc266b4028e9b5fa63

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49707/ Subject: LU-12273 obd: Reserve metadata overstriping flags Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8b62a8ca9c2e824d94fbf8bc266b4028e9b5fa63

            So this is fascinating.

            Each MDT does not know about itself in the pool code, because it's the local device, so it's handled differently.
            The list of targets (in the QOS/RR pool code) on an MDT is the other MDTs in the system.

            The practical result of this is the first MDT is not selected by the allocation code on the MDT, so it only
            gets one stripe on it.

            eg, with 2 MDTs:

            lmv_stripe_count: 8 lmv_stripe_offset: 1 lmv_hash_type: crush,overstriped
            mdtidx           FID[seq:oid:ver]
                 1           [0x240000400:0x2:0x0]
                 0           [0x200000401:0x2:0x0]
                 0           [0x200000401:0x3:0x0]
                 0           [0x200000401:0x4:0x0]
                 0           [0x200000401:0x5:0x0]
                 0           [0x200000401:0x6:0x0]
                 0           [0x200000401:0x7:0x0]
                 0           [0x200000401:0x8:0x0]

            Or, with 4 MDTs, it can look like this: 
            lmv_stripe_count: 16 lmv_stripe_offset: 3 lmv_hash_type: crush,overstriped
            mdtidx           FID[seq:oid:ver]
                 3           [0x2c0000400:0x6:0x0]
                 0           [0x200000403:0x10:0x0]
                 1           [0x240000402:0x11:0x0]
                 2           [0x280000401:0x11:0x0]
                 0           [0x200000403:0x11:0x0]
                 1           [0x240000402:0x12:0x0]
                 2           [0x280000401:0x12:0x0]
                 0           [0x200000403:0x12:0x0]
                 1           [0x240000402:0x13:0x0]
                 2           [0x280000401:0x13:0x0]
                 0           [0x200000403:0x13:0x0]
                 1           [0x240000402:0x14:0x0]
                 2           [0x280000401:0x14:0x0]
                 0           [0x200000403:0x14:0x0]
                 1           [0x240000402:0x15:0x0]
                 2           [0x280000401:0x15:0x0]

            Notice 3 is only used once.

            Allocation of the first stripe is handled like this, without reference to the pool:
                    /* Allocate the first stripe locally */
                    rc = dt_fid_alloc(env, lod->lod_child, &fid, NULL, NULL);
                    if (rc < 0)
                            GOTO(out, rc);

                    stripes[0] = dt_locate_at(env, lod->lod_child, &fid,
                                              dt->do_lu.lo_dev->ld_site->ls_top_dev, &conf);

            then the qos/rr alloc code is called to allocate the rest of the stripes.

            I'm not sure what to do about this - The device init process doesn't really seem something to mess with.
            I'm thinking the right thing to do is special case this for RR + overstriping.

            Basically, add one more to the range of indices that can be selected during RR, and if it's found, then do
            a local allocation.  It complicates the code slightly but it's the only solution that seems sane.

            I'll do that if there's not an objection.

            paf0186 Patrick Farrell added a comment - So this is fascinating. Each MDT does not know about itself in the pool code, because it's the local device, so it's handled differently. The list of targets (in the QOS/RR pool code) on an MDT is the other MDTs in the system . The practical result of this is the first MDT is not selected by the allocation code on the MDT , so it only gets one stripe on it. eg, with 2 MDTs: lmv_stripe_count: 8 lmv_stripe_offset: 1 lmv_hash_type: crush,overstriped mdtidx           FID [seq:oid:ver]      1           [0x240000400:0x2:0x0]      0           [0x200000401:0x2:0x0]      0           [0x200000401:0x3:0x0]      0           [0x200000401:0x4:0x0]      0           [0x200000401:0x5:0x0]      0           [0x200000401:0x6:0x0]      0           [0x200000401:0x7:0x0]      0           [0x200000401:0x8:0x0] Or, with 4 MDTs, it can look like this:  lmv_stripe_count: 16 lmv_stripe_offset: 3 lmv_hash_type: crush,overstriped mdtidx           FID [seq:oid:ver]      3           [0x2c0000400:0x6:0x0]      0           [0x200000403:0x10:0x0]      1           [0x240000402:0x11:0x0]      2           [0x280000401:0x11:0x0]      0           [0x200000403:0x11:0x0]      1           [0x240000402:0x12:0x0]      2           [0x280000401:0x12:0x0]      0           [0x200000403:0x12:0x0]      1           [0x240000402:0x13:0x0]      2           [0x280000401:0x13:0x0]      0           [0x200000403:0x13:0x0]      1           [0x240000402:0x14:0x0]      2           [0x280000401:0x14:0x0]      0           [0x200000403:0x14:0x0]      1           [0x240000402:0x15:0x0]      2           [0x280000401:0x15:0x0] Notice 3 is only used once. Allocation of the first stripe is handled like this, without reference to the pool:         /* Allocate the first stripe locally */         rc = dt_fid_alloc(env, lod->lod_child, &fid, NULL, NULL);         if (rc < 0)                 GOTO(out, rc);         stripes [0] = dt_locate_at(env, lod->lod_child, &fid,                                   dt->do_lu.lo_dev->ld_site->ls_top_dev, &conf); then the qos/rr alloc code is called to allocate the rest of the stripes. I'm not sure what to do about this - The device init process doesn't really seem something to mess with. I'm thinking the right thing to do is special case this for RR + overstriping. Basically, add one more to the range of indices that can be selected during RR, and if it's found, then do a local allocation.  It complicates the code slightly but it's the only solution that seems sane. I'll do that if there's not an objection.

            "Patrick Farrell <farr0186@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49707
            Subject: LU-12273 obd: Reserve metadata overstriping flags
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 91f41501f508166e33881d00ac795f9296f8978c

            gerrit Gerrit Updater added a comment - "Patrick Farrell <farr0186@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49707 Subject: LU-12273 obd: Reserve metadata overstriping flags Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 91f41501f508166e33881d00ac795f9296f8978c

            if we compare 2 x MDT per MDS vs MDT Overstriping=2, it's 2 x jbd2 thread vs still single journal thread, etc?

            Correct - with overstriping there is still only a single journal/device and some filesystem locks, while two separate MDTs have totally separate infrastructure (but each one is 1/2 the size and needs more space balancing, double journal memory usage). If we can get "close" performance with 2x or 4x overstriping vs. 2x or 4x MDTs then using directory overstriping would be better overall.

            adilger Andreas Dilger added a comment - if we compare 2 x MDT per MDS vs MDT Overstriping=2, it's 2 x jbd2 thread vs still single journal thread, etc? Correct - with overstriping there is still only a single journal/device and some filesystem locks, while two separate MDTs have totally separate infrastructure (but each one is 1/2 the size and needs more space balancing, double journal memory usage). If we can get "close" performance with 2x or 4x overstriping vs. 2x or 4x MDTs then using directory overstriping would be better overall.

            Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35939
            Subject: LU-12273 lod: Trivial metadata overstriping
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6e97b13d5dd0ba3d49b5c5191f4733e95663e573

            gerrit Gerrit Updater added a comment - Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35939 Subject: LU-12273 lod: Trivial metadata overstriping Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6e97b13d5dd0ba3d49b5c5191f4733e95663e573

            Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35034
            Subject: LU-12273 lod: metadata overstriping
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c3bfede6854263915c967f53602d0a1e46d6d1bf

            gerrit Gerrit Updater added a comment - Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35034 Subject: LU-12273 lod: metadata overstriping Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c3bfede6854263915c967f53602d0a1e46d6d1bf

            People

              paf0186 Patrick Farrell
              pfarrell Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: