[LU-17334] Client should handle dir/file/object created on newly added MDT/OST Created: 04/Dec/23  Updated: 30/Jan/24

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Andreas Dilger Assignee: Lai Siyao
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Cloners
Clones LU-17300 Avoid creating new dir/file/object on... Open
Related
is related to LU-12036 Add option to create new OSTs inactive Resolved
is related to LU-12998 DNE3: tunable to disable directory cr... Resolved
is related to LU-13417 DNE3: mkdir() automatically create re... Resolved
is related to LU-10784 DNE3: mkdir() automatically create re... Resolved
is related to LU-17327 Write conf-santity test case for onli... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When a new MDT or OST is added to a filesystem without no_create, then a new subdirectory or file could be created on the new MDT, or a new object created on an OST relatively quickly after it is added to the filesystem, in particular because the new MDT/OST would be preferred by QOS space balancing due to lots of free space. However, it might take a few seconds for the addition of the new MDT/OST to be propagated across all of the clients, so there is a risk that the MDS creates file object on OSTs that a client is not yet aware of. There is a much smaller risk that an MDT is used for a subdirectory or file that a client is (depending on workload, if multiple clients are working in the same directory tree in parallel).

This ticket is tracking the case where a new MDT or OST is used for a subdirectory that is not in the config, then the client should either wait and retry for some short time, possibly actively pulling the config from the MGS to see if the target was newly added, instead of immediately returning an error to the application. LU-17300 is tracking the issue of not creating new subdirs/files/objects on newly-added targets in the first place.

It is still possible that the file layout is itself corrupted for whatever reason, and referencing an OST or MDT index that will never exist in the filesystem, so the client should not retry this operation indefinitely. But an (up to) ~30 second application delay while the configuration is distributed across the cluster is far preferable to the application getting an error.



 Comments   
Comment by Andreas Dilger [ 04/Dec/23 ]

The new conf-sanity test_46b added in patch: https://review.whamcloud.com/53300 "LU-17327 tests: add test case for online MDT and OST addition" shows this case nicely:

[  183.117948] Lustre: DEBUG MARKER: == conf-sanity test 46b: online OST and MDT addition ===== 16:48:36 (1701380916)
[  206.830361] Lustre: Mounted lustre-client
[  214.067224] LustreError: 14230:0:(lov_ea.c:279:lsme_unpack()) lustre-clilov_UUID: OST index 1 more than OST count 1
[  214.070240] Lustre: 14230:0:(lov_pack.c:57:lov_dump_lmm_common()) objid 0x2ab:1025, magic 0x0bd10bd0, pattern 0x1
[  214.072822] Lustre: 14230:0:(lov_pack.c:61:lov_dump_lmm_common()) stripe_size 4194304, stripe_count 1, layout_gen 0
[  214.075459] Lustre: 14230:0:(lov_pack.c:81:lov_dump_lmm_objects()) stripe 0 idx 1 subobj 0x2c0000401:2
[  214.078104] LustreError: 14230:0:(lcommon_cl.c:196:cl_file_inode_init()) lustre: failed to initialize cl_object [0x200000401:0x2ab:0x0]: rc = -22
[  214.081653] LustreError: 14230:0:(llite_lib.c:3613:ll_prep_inode()) new_inode -fatal: rc -22
[  460.900709] Lustre: DEBUG MARKER: conf-sanity test_46b: @@@@@@ FAIL: rsync failed

Somewhere along this codepath, preferably before actually printing a console message, the client should retry this operation periodically (once per second assuming no RPC is being sent, up to 2 * MGC_TARGET_REG_LIMIT_MAX seconds in total) before printing the error about the bad layout and returning an error to the application.

Since this only applies to new targets that the client isn't even aware of, there shouldn't be a need to delay indefinitely due to recovery. If the OST/MDT registers with the MGS and is then taken offline again immediately, the client will still be notified of the new target by the MGS, at which point the RPC will be blocked in normal RPC resend and will wait there indefinitely.

The number of retries on the client might be shortened slightly by immediately triggering the MGC to refetch the configuration from the MGS, but I'm not sure that is easily done in this part of the code, and it is a relatively small optimization for a very uncommon situation, so I don't think it is needed for the initial implementation.

Comment by Andreas Dilger [ 04/Dec/23 ]

It looks like a reasonable approach to fixing this might be to add a loop in lsme_unpack() (which is called in various places for different layout types) that is waiting and retrying (not in a busy loop) for some number of seconds for the filesystem layout to be updated if either the "loi->loi_ost_idx >= lov->desc.ld_tgt_count" or "!ltd" condition is hit. Since this situation is expected to become more "normal" in the future, it would be useful if the CERROR() message is only printed once per 600s in this case as a D_WARNING until the retry time has elapsed, then it should be printed as a CERROR(). Something like:

        static time_t next_print;
        unsigned int level;
        time_t retry_limit = 0;

        :
        :
retry:
                if (unlikely(loi->loi_ost_idx >= lov->desc.ld_tgt_count &&
                    !lov2obd(lov)->obd_process_conf)) {
                        time_t now = ktime_get_seconds();

                        /* print a message on the first hit, error if giving up */
                        if (retry_limit == 0) {
                                level = now > next_print ? D_WARNING : D_INFO;
                                retry_limit = now + MGC_TARGET_REG_LIMIT_MAX;
                        } else if (now > retry_limit) {
                                level = D_ERROR;
                        } else {
                                level = D_INFO;
                        }

                        /* log in debug log every loop, just to see it is trying */
                        CDEBUG_LIMIT(level, "%s: OST index %d more than OST count %d\n",
                                     lov->desc.ld_uuid.uuid, loi->loi_ost_idx,
                                     lov->desc.ld_tgt_count); 
                        if (now > next_print) {
                                LCONSOLE_INFO("%s: wait %us while client connects to new OST\n",
                                        lov->desc.ld_uuid.uuid, retry_limit - now);
                                next_print = retry_limit + 600;
                        }
                        if (now < retry_limit) {
                               rc = schedule_timeout_interruptible(cfs_time_seconds(1));
                               if (rc == 0)
                                       goto retry;
                        }
                        lov_dump_lmm_v1(D_WARNING, lmm);
                        GOTO(out_lsme, rc = -EINVAL);
                }

and the same for the "!ltd" case, using the same global next_print and retry_limit so that the messages aren't printed multiple times for different threads/files or even OSTs within the same file.

Comment by Andreas Dilger [ 04/Dec/23 ]

Something similar would be needed somewhere in the LMV code on the client to handle the case of a new inode FID on an MDT that the client hasn't connected to yet.  That is less critical than the OST case (client accessing file it created itself, vs. client accessing a file/directory created by another client on a different MDT) but should still be fixed.  Having a test case for that would need something like mdtest in a loop creating small files and then verifying them on a different client/mountpoint.

Comment by Andreas Dilger [ 05/Dec/23 ]

laisiyao would you be able to look at the LMV side of this code while Jian is working on the LOV side?  We would ideally want something working in the next couple of days, as there will be a customer system that is expanding from 20->40 MDTs, so handling the similar "trusted.lmv is accessing an unknown MDT index" retry loop would be very useful to have.

Comment by Jian Yu [ 05/Dec/23 ]

On the LOV side, while building and testing the patch locally, I hit some build errors. I'm still resolving the failures.

Comment by Lai Siyao [ 05/Dec/23 ]

Yujian, where is your patch?

Comment by Jian Yu [ 05/Dec/23 ]

Hi laisiyao, I've not pushed it to Gerrit yet. The patch needs to pass local building and testing first.
Basically, the changes are in #comment-395401.

Comment by Andreas Dilger [ 05/Dec/23 ]

Jian, I was thinking what might be a lot simpler instead of wait_event_interruptible_timeout() is something like schedule_timeout_interruptible(cfs_time_seconds(1) and it just goes through the loop more times, which is fine because it doesn't do anything except check the ld_tgt_count and the layout, which have very low overhead.

This should basically never happen, so these checks could also be marked unlikely() so they don't add any overhead.

Comment by Jian Yu [ 05/Dec/23 ]

Sure, Andreas, I'm testing the new changes.

Comment by Gerrit Updater [ 05/Dec/23 ]

"Jian Yu <yujian@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53335
Subject: LU-17334 lov: handle object created on newly added OST
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 117bf170337301512d0b2deaeb19b64b4f58baba

Comment by Andreas Dilger [ 06/Dec/23 ]

laisiyao, it is getting more likely that the test failures after Jian's LOV patch are now related to MDT addition:

1701844910: rsync: stat "/mnt/lustre/d46b.conf-sanity/etc/NetworkManager" failed: No such file or directory (2)
1701844910: rsync: recv_generator: mkdir "/mnt/lustre/d46b.conf-sanity/etc/NetworkManager/conf.d" failed: No such file or directory (2)
:
:
(Default) /mnt/lustre/d46b.conf-sanity/etc/NetworkManager
lmm_fid:           [0x280000401:0x1:0x0]
stripe_count:  1 stripe_size:   4194304 pattern:       0 stripe_offset: -1

This is the first file to report an error, and it looks like the first file to be allocated on the newly-added MDT0001 (based on FID):

lustre: cli-ctl-lustre-MDT0001: Allocated super-sequence [0x0000000240000400-0x0000000280000400):1:mdt]
Comment by Lai Siyao [ 06/Dec/23 ]

Mmm, Fujian, I'll update your patch with the LMV change.

Comment by Andreas Dilger [ 06/Dec/23 ]

It looks like the 53335 patch is working reasonably well. I see evidence in the testing logs that this patch is working:

 Lustre: 79053:0:(lov_ea.c:299:lsme_unpack()) lustre-clilov_UUID: OST index 1 more than OST count 1
 Lustre: lustre-clilov_UUID: wait 30s while client connects to new OST
 Lustre: 35469:0:(lov_ea.c:299:lsme_unpack()) lustre-clilov_UUID: OST index 2 more than OST count 2
 Lustre: lustre-clilov_UUID: wait 30s while client connects to new OST 

There are still some failures of the subtest, but it looks like these are MDT issues and not OST issues, since it looks like they relate to filename lookup (though the "No such file or directory (2) = -ENOENT" error is ambiguous as to whether an OST or MDT object could not be found).

Using the OSS console log timestamps, it looks like it is just after MDT0003 was mounted, and while the first OST is finished mounting:

MDS2 [ 2488.656376] Lustre: DEBUG MARKER: e2label /dev/mapper/mds4_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
MDS2 [ 2488.990425] Lustre: DEBUG MARKER: sync; sleep 1; sync
MDS2 [ 2490.646167] Lustre: DEBUG MARKER: e2label /dev/mapper/mds4_flakey 2>/dev/null

OSS [ 2493.319919] LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro
OSS [ 2494.373579] LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
OSS [ 2494.931873] Lustre: DEBUG MARKER: e2label /dev/mapper/ost2_flakey 2>/dev/null
OSS [ 2495.391620] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param 				seq.cli-lustre-OST0001-super.width=18724
OSS [ 2495.728824] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check

LOG rsync: stat "/mnt/lustre/d46b.conf-sanity/python3.6/site-packages" failed: No such file or directory (2)

OSS [ 2497.208663] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-33vm3.trevis.whamcloud.com: executing set_default_debug -1 all 4
OSS [ 2497.426045] Lustre: DEBUG MARKER: trevis-33vm3.trevis.whamcloud.com: executing set_default_debug -1 all 4
OSS [ 2497.448788] Lustre: DEBUG MARKER: trevis-33vm3.trevis.whamcloud.com: executing set_default_debug -1 all 4
OSS [ 2497.628831] Lustre: DEBUG MARKER: e2label /dev/mapper/ost2_flakey 				2>/dev/null | grep -E ':[a-zA-Z]{3}[0-9]{4}'
OSS [ 2498.028865] Lustre: DEBUG MARKER: sync; sleep 1; sync
OSS [ 2500.674301] Lustre: DEBUG MARKER: e2label /dev/mapper/ost2_flakey 2>/dev/null
LOG Started lustre-OST0001
Comment by Andreas Dilger [ 07/Dec/23 ]

Lai, could you please make a separate patch for the LMV changes, so that it does not disrupt the LOV patch from testing/landing.

Please do a "checkout" of Jian's patch "git fetch ssh://adilger@review.whamcloud.com:29418/fs/lustre-release refs/changes/35/53335/7 && git checkout FETCH_HEAD" and base your patch on top of it and then rebase the test patch https://review.whamcloud.com/53300 onto yours, so that it will test both patches together. It looks like it is sometimes failing in Gerrit Janitor with ENOSPC, but I'm not sure if that is a problem with the test script or some issue with Lustre file/object allocation, because the test should only be copying 1/2 of the initial OST size. I've reduced this to 1/4 of the OST size to see if that makes a difference.

Comment by Gerrit Updater [ 07/Dec/23 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53363
Subject: LU-17334 lmv: handle object created on newly added MDT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b4405f028cb03c3780f30f9efa88ccb33f3ee621

Comment by Gerrit Updater [ 20/Dec/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53335/
Subject: LU-17334 lov: handle object created on newly added OST
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f35f897ec8ec0752ea4d4830e72f5193375a474b

Comment by Gerrit Updater [ 30/Jan/24 ]

"Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53860
Subject: LU-17334 lmv: exclude newly added MDT in mkdir
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4de7657c74729530485ffa31d7ca179e9dadfe5d

Generated at Sat Feb 10 03:34:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.