[LU-14692] deprecate use of OST FID SEQ 0 for MDT0000 - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3
Affects Version/s: Lustre 2.15.0
Labels:
- medium
- scalability

Rank (Obsolete):
9223372036854775807

Description

Since Lustre 2.4.0 and DNE1, it has been possible to create OST objects using a different FID SEQ range for each MDT, to avoid contention during MDT object precreation.

Objects that are created by MDT0000 are put into FID SEQ 0 (O/0/d*) on all OSTs and have a filename that is the decimal FID OID in ASCII. However, SEQ=0 objects are remapped to IDIF FID SEQ (0x100000000 | (ost_idx << 16)) so that they are unique across all OSTs.

Objects that are created by other MDTs (or MDT0000 after 2^48 objects are created in SEQ 0) use a unique SEQ in the FID_SEQ_NORMAL range (> 0x200000400), and use a filename that is the hexadecimal FID OID in ASCII.

For compatibility with pre-DNE MDTs and OSTs, the use of SEQ=0 by MDT0000 was kept until now, but there has not been a reason to keep this compatibility for new filesystems. It would be better to have MDT0000 assigned a "regular" FID SEQ range at startup, so that the SEQ=0 compatibility can eventually be removed. That would ensure OST objects have "proper and unique" FIDs, and avoid the complexity of mapping between the old SEQ=0 48-bit OID values and the IDIF FIDs.

Older filesystems using SEQ=0 would eventually delete old objects in this range and/or could be forced to migrate to using new objects to clean up the remaining usage, if necessary.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

serial.txt
778 kB
28/Mar/23 4:20 AM
stdout.txt
484 kB
28/Mar/23 4:20 AM

Issue Links

is related to

LU-9054 sanity test_312: FAIL: blksz error: , expected: 4096

Reopened

LU-16692 replay-single: test_70c osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) )

Resolved

is related to

LU-10487 ostid_set_{seq,id}() badness

Open

LU-11912 reduce number of OST objects created per MDS Sequence

Resolved

Activity

[LU-14692] deprecate use of OST FID SEQ 0 for MDT0000

Gerrit Updater added a comment - 30/Mar/23 1:16 PM

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50477
Subject: ~~LU-14692~~ tests: wait for osp in conf-sanity/84
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2816476614a92ba675418c7434001d946c8ec81e

Gerrit Updater added a comment - 30/Mar/23 1:16 PM "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50477 Subject: LU-14692 tests: wait for osp in conf-sanity/84 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2816476614a92ba675418c7434001d946c8ec81e

Dongyang Li added a comment - 29/Mar/23 10:52 PM

I will update conf-sanity/84.
Alex, the new crash is a different issue, mostly because landing of https://review.whamcloud.com/c/fs/lustre-release/+/38424/
Now, the patch introduces a SEQ width of 16384 in Maloo, so the SEQ change will happen more frequently and randomly.
To make sure SEQ change doesn't happen after replay_barrier, the patch from 38424 actually has force_new_seq, to change the SEQ for test suites like replay-single starts. It did change the SEQ from the log,
but I think the seq width of 16384 is not enough for the whole replay-single, given we have only 2 OSTs, more objects will be created for each OST.

I think there are 2 things we could do: use force_new_seq for every replay_barrier, which I think is a bit too heavy, or we could enlarge the default 16384 SEQ width according to number of OSTs.

Note we don't really need force_new_seq for conf-sanity/84, the changing of IDIF seq to normal seq happens as soon as osp connects, we just need to wait for that before using replay_barrier.

Dongyang Li added a comment - 29/Mar/23 10:52 PM I will update conf-sanity/84. Alex, the new crash is a different issue, mostly because landing of https://review.whamcloud.com/c/fs/lustre-release/+/38424/ Now, the patch introduces a SEQ width of 16384 in Maloo, so the SEQ change will happen more frequently and randomly. To make sure SEQ change doesn't happen after replay_barrier, the patch from 38424 actually has force_new_seq, to change the SEQ for test suites like replay-single starts. It did change the SEQ from the log, but I think the seq width of 16384 is not enough for the whole replay-single, given we have only 2 OSTs, more objects will be created for each OST. I think there are 2 things we could do: use force_new_seq for every replay_barrier, which I think is a bit too heavy, or we could enlarge the default 16384 SEQ width according to number of OSTs. Note we don't really need force_new_seq for conf-sanity/84, the changing of IDIF seq to normal seq happens as soon as osp connects, we just need to wait for that before using replay_barrier.

Alex Zhuravlev added a comment - 29/Mar/23 4:45 PM

this time in Maloo: https://testing.whamcloud.com/test_sets/b36df675-87ec-4fb5-9c8b-57add55397ec

[11176.501594] LustreError: 567675:0:(osp_internal.h:538:osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x280000bd1:0x2c6d:0x0], fid2:[0x280000bd0:0x2c6c:0x0]

Alex Zhuravlev added a comment - 29/Mar/23 4:45 PM this time in Maloo: https://testing.whamcloud.com/test_sets/b36df675-87ec-4fb5-9c8b-57add55397ec [11176.501594] LustreError: 567675:0:(osp_internal.h:538:osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1: [0x280000bd1:0x2c6d:0x0] , fid2: [0x280000bd0:0x2c6c:0x0]

Andreas Dilger added a comment - 28/Mar/23 2:00 PM

The test already has something similar:

# make sure new superblock labels are sync'd before disabling writes
sync_all_data
sleep 5

so adding a file create on all OSTs is reasonable.

Andreas Dilger added a comment - 28/Mar/23 2:00 PM The test already has something similar: # make sure new superblock labels are sync'd before disabling writes sync_all_data sleep 5 so adding a file create on all OSTs is reasonable.

Andreas Dilger added a comment - 28/Mar/23 1:56 PM

Can the test be updated to do something simple like "lfs setstripe -i -1 $DIR/$tfile.tmp" to force the sequence update before the replay barrier?

Andreas Dilger added a comment - 28/Mar/23 1:56 PM Can the test be updated to do something simple like " lfs setstripe -i -1 $DIR/$tfile.tmp " to force the sequence update before the replay barrier?

Dongyang Li added a comment - 28/Mar/23 4:41 AM

From the console log:

[ 6684.760425] Lustre: Mounted lustre-client
[ 6686.111594] Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x00000002c0000400-0x0000000300000400]:1:ost
[ 6686.127442] Lustre: Skipped 2 previous similar messages
[ 6686.127744] Lustre: cli-lustre-OST0001-super: Allocated super-sequence [0x00000002c0000400-0x0000000300000400]:1:ost]
[ 6686.127895] Lustre: Skipped 1 previous similar message
[ 6691.011345] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[ 6691.028634] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
[ 6691.211977] Lustre: lustre-OST0001-osc-MDT0000: update sequence from 0x100010000 to 0x2c0000401
[ 6692.967973] systemd[1]: mnt-lustre\x2dmds1.mount: Succeeded.
[ 6693.003490] Lustre: Failing over lustre-MDT0000

The sequence update from 0x100010000 to 0x2c0000401 was lost after replay_barrier.

Dongyang Li added a comment - 28/Mar/23 4:41 AM From the console log: [ 6684.760425] Lustre: Mounted lustre-client [ 6686.111594] Lustre: ctl-lustre-MDT0000: super -sequence allocation rc = 0 [0x00000002c0000400-0x0000000300000400]:1:ost [ 6686.127442] Lustre: Skipped 2 previous similar messages [ 6686.127744] Lustre: cli-lustre-OST0001- super : Allocated super -sequence [0x00000002c0000400-0x0000000300000400]:1:ost] [ 6686.127895] Lustre: Skipped 1 previous similar message [ 6691.011345] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [ 6691.028634] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000 [ 6691.211977] Lustre: lustre-OST0001-osc-MDT0000: update sequence from 0x100010000 to 0x2c0000401 [ 6692.967973] systemd[1]: mnt-lustre\x2dmds1.mount: Succeeded. [ 6693.003490] Lustre: Failing over lustre-MDT0000 The sequence update from 0x100010000 to 0x2c0000401 was lost after replay_barrier.

Alex Zhuravlev added a comment - 28/Mar/23 4:20 AM

it's always conf-sanity/84, stdout/console are attached.

Alex Zhuravlev added a comment - 28/Mar/23 4:20 AM it's always conf-sanity/84, stdout/console are attached.

Andreas Dilger added a comment - 27/Mar/23 11:35 PM

Alex, what test was running when the failure was hit? There was some discussion about this issue with Dongyang, basically that replay_barrier is discarding the SEQ update (which is sync on the server and otherwise atomic) because the underlying storage was marked read-only.

The open question was whether this LASSERT() should be relaxed to handle the case of write loss (e.g. due to controller cache failure) at the same time as SEQ rollover? The SEQ rollover definitely is going to happen more often now (once per 32M OST objects vs. once per 4B objects), but if the storage is losing sync writes then there are a lot of things that will go badly.

Andreas Dilger added a comment - 27/Mar/23 11:35 PM Alex, what test was running when the failure was hit? There was some discussion about this issue with Dongyang, basically that replay_barrier is discarding the SEQ update (which is sync on the server and otherwise atomic) because the underlying storage was marked read-only. The open question was whether this LASSERT() should be relaxed to handle the case of write loss (e.g. due to controller cache failure) at the same time as SEQ rollover? The SEQ rollover definitely is going to happen more often now (once per 32M OST objects vs. once per 4B objects), but if the storage is losing sync writes then there are a lot of things that will go badly.

Dongyang Li added a comment - 27/Mar/23 10:55 PM

Alex, could you share the vmcore-dmesg from the crash?
I wonder if the change to "normal SEQ" happened after replay_barrier, when mdt starts again for recovery, it will see the old IDIF seq from disk.

Dongyang Li added a comment - 27/Mar/23 10:55 PM Alex, could you share the vmcore-dmesg from the crash? I wonder if the change to "normal SEQ" happened after replay_barrier, when mdt starts again for recovery, it will see the old IDIF seq from disk.

Alex Zhuravlev added a comment - 27/Mar/23 5:46 PM

not sure, but I haven't seen the following problem before the last wave of landings which include ~~LU-14692~~:

LustreError: 343158:0:(osp_internal.h:530:osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x2c0000401:0x2:0x0], fid2:[0x100010000:0x1:0x0] in conf-sanity / 84
...
PID: 343158  TASK: ffff8b7824a605c0  CPU: 1   COMMAND: "tgt_recover_0"
 #0 [ffff8b783d273578] panic at ffffffff8f0b9786
    /tmp/kernel/kernel/panic.c: 299
 #1 [ffff8b783d2735f8] osp_create at ffffffffc1198895 [osp]
    /home/lustre/master-mine/lustre/osp/osp_internal.h: 529
 #2 [ffff8b783d273680] lod_sub_create at ffffffffc113534e [lod]
    /home/lustre/master-mine/lustre/include/dt_object.h: 2333
 #3 [ffff8b783d2736f0] lod_striped_create at ffffffffc112076b [lod]
    /home/lustre/master-mine/lustre/lod/lod_object.c: 6338
 #4 [ffff8b783d273760] lod_xattr_set at ffffffffc1128200 [lod]
    /home/lustre/master-mine/lustre/lod/lod_object.c: 5068
 #5 [ffff8b783d273810] mdd_create_object at ffffffffc0f76a93 [mdd]
    /home/lustre/master-mine/lustre/include/dt_object.h: 2832
 #6 [ffff8b783d273940] mdd_create at ffffffffc0f81f98 [mdd]
    /home/lustre/master-mine/lustre/mdd/mdd_dir.c: 2827
 #7 [ffff8b783d273a40] mdt_reint_open at ffffffffc1038328 [mdt]
    /home/lustre/master-mine/lustre/mdt/mdt_open.c: 1574
 #8 [ffff8b783d273bf8] mdt_reint_rec at ffffffffc102731f [mdt]
    /home/lustre/master-mine/lustre/mdt/mdt_reint.c: 3240
 #9 [ffff8b783d273c20] mdt_reint_internal at ffffffffc0ff6ef6 [mdt]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155
#10 [ffff8b783d273c58] mdt_intent_open at ffffffffc1002982 [mdt]
    /home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4826
#11 [ffff8b783d273c98] mdt_intent_policy at ffffffffc0fffe79 [mdt]
    /home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4971
#12 [ffff8b783d273cf8] ldlm_lock_enqueue at ffffffffc08bdbdf [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c: 1794
#13 [ffff8b783d273d60] ldlm_handle_enqueue0 at ffffffffc08e5046 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lockd.c: 1441
#14 [ffff8b783d273dd8] tgt_enqueue at ffffffffc091fd1f [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/target/tgt_handler.c: 1446
#15 [ffff8b783d273df0] tgt_request_handle at ffffffffc0926147 [ptlrpc]
    /home/lustre/master-mine/lustre/include/lu_target.h: 645
#16 [ffff8b783d273e68] handle_recovery_req at ffffffffc08c8c3c [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2418
#17 [ffff8b783d273e98] target_recovery_thread at ffffffffc08d1300 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2677
#18 [ffff8b783d273f10] kthread at ffffffff8f0d5199
    /tmp/kernel/kernel/kthread.c: 340

Alex Zhuravlev added a comment - 27/Mar/23 5:46 PM not sure, but I haven't seen the following problem before the last wave of landings which include LU-14692 : LustreError: 343158:0:(osp_internal.h:530:osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x2c0000401:0x2:0x0], fid2:[0x100010000:0x1:0x0] in conf-sanity / 84 ... PID: 343158 TASK: ffff8b7824a605c0 CPU: 1 COMMAND: "tgt_recover_0" #0 [ffff8b783d273578] panic at ffffffff8f0b9786 /tmp/kernel/kernel/panic.c: 299 #1 [ffff8b783d2735f8] osp_create at ffffffffc1198895 [osp] /home/lustre/master-mine/lustre/osp/osp_internal.h: 529 #2 [ffff8b783d273680] lod_sub_create at ffffffffc113534e [lod] /home/lustre/master-mine/lustre/include/dt_object.h: 2333 #3 [ffff8b783d2736f0] lod_striped_create at ffffffffc112076b [lod] /home/lustre/master-mine/lustre/lod/lod_object.c: 6338 #4 [ffff8b783d273760] lod_xattr_set at ffffffffc1128200 [lod] /home/lustre/master-mine/lustre/lod/lod_object.c: 5068 #5 [ffff8b783d273810] mdd_create_object at ffffffffc0f76a93 [mdd] /home/lustre/master-mine/lustre/include/dt_object.h: 2832 #6 [ffff8b783d273940] mdd_create at ffffffffc0f81f98 [mdd] /home/lustre/master-mine/lustre/mdd/mdd_dir.c: 2827 #7 [ffff8b783d273a40] mdt_reint_open at ffffffffc1038328 [mdt] /home/lustre/master-mine/lustre/mdt/mdt_open.c: 1574 #8 [ffff8b783d273bf8] mdt_reint_rec at ffffffffc102731f [mdt] /home/lustre/master-mine/lustre/mdt/mdt_reint.c: 3240 #9 [ffff8b783d273c20] mdt_reint_internal at ffffffffc0ff6ef6 [mdt] /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155 #10 [ffff8b783d273c58] mdt_intent_open at ffffffffc1002982 [mdt] /home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4826 #11 [ffff8b783d273c98] mdt_intent_policy at ffffffffc0fffe79 [mdt] /home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4971 #12 [ffff8b783d273cf8] ldlm_lock_enqueue at ffffffffc08bdbdf [ptlrpc] /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c: 1794 #13 [ffff8b783d273d60] ldlm_handle_enqueue0 at ffffffffc08e5046 [ptlrpc] /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lockd.c: 1441 #14 [ffff8b783d273dd8] tgt_enqueue at ffffffffc091fd1f [ptlrpc] /home/lustre/master-mine/lustre/ptlrpc/../../lustre/target/tgt_handler.c: 1446 #15 [ffff8b783d273df0] tgt_request_handle at ffffffffc0926147 [ptlrpc] /home/lustre/master-mine/lustre/include/lu_target.h: 645 #16 [ffff8b783d273e68] handle_recovery_req at ffffffffc08c8c3c [ptlrpc] /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2418 #17 [ffff8b783d273e98] target_recovery_thread at ffffffffc08d1300 [ptlrpc] /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2677 #18 [ffff8b783d273f10] kthread at ffffffff8f0d5199 /tmp/kernel/kernel/kthread.c: 340

Peter Jones added a comment - 21/Mar/23 11:33 PM

Landed for 2.16

Peter Jones added a comment - 21/Mar/23 11:33 PM Landed for 2.16

People

Assignee:: Dongyang Li

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 18/May/21 10:10 PM

Updated:: 18/Mar/25 6:46 AM

Resolved:: 21/Mar/23 11:33 PM