[LU-14692] deprecate use of OST FID SEQ 0 for MDT0000 Created: 18/May/21  Updated: 08/Jan/24  Resolved: 21/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Dongyang Li
Resolution: Fixed Votes: 0
Labels: medium

Attachments: Text File serial.txt     Text File stdout.txt    
Issue Links:
Blocker
Duplicate
Related
is related to LU-10487 ostid_set_{seq,id}() badness Open
is related to LU-11912 reduce number of OST objects created ... Resolved
is related to LU-16692 replay-single: test_70c osp_fid_diff(... Open
is related to LU-9054 sanity test_312: FAIL: blksz error: ,... Reopened
Rank (Obsolete): 9223372036854775807

 Description   

Since Lustre 2.4.0 and DNE1, it has been possible to create OST objects using a different FID SEQ range for each MDT, to avoid contention during MDT object precreation.

Objects that are created by MDT0000 are put into FID SEQ 0 (O/0/d*) on all OSTs and have a filename that is the decimal FID OID in ASCII. However, SEQ=0 objects are remapped to IDIF FID SEQ (0x100000000 | (ost_idx << 16)) so that they are unique across all OSTs.

Objects that are created by other MDTs (or MDT0000 after 2^48 objects are created in SEQ 0) use a unique SEQ in the FID_SEQ_NORMAL range (> 0x200000400), and use a filename that is the hexadecimal FID OID in ASCII.

For compatibility with pre-DNE MDTs and OSTs, the use of SEQ=0 by MDT0000 was kept until now, but there has not been a reason to keep this compatibility for new filesystems. It would be better to have MDT0000 assigned a "regular" FID SEQ range at startup, so that the SEQ=0 compatibility can eventually be removed. That would ensure OST objects have "proper and unique" FIDs, and avoid the complexity of mapping between the old SEQ=0 48-bit OID values and the IDIF FIDs.

Older filesystems using SEQ=0 would eventually delete old objects in this range and/or could be forced to migrate to using new objects to clean up the remaining usage, if necessary.



 Comments   
Comment by Andreas Dilger [ 18/May/21 ]

This is somewhat related to implementing LU-11912, which would also speed up the use of a new SEQ range for MDT0000. However, that patch doesn't avoid the initial usage of SEQ=0 on a new filesystem (which is what this ticket is about), but only accelerate the move away from SEQ=0 after a few million files have been created in the filesystem.

Comment by Gerrit Updater [ 10/Dec/21 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45822
Subject: LU-14692 osp: deprecate IDIF sequence for MDT0000
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ee4af1e009ed7535d271e118040a83e57674cfbf

Comment by Gerrit Updater [ 25/Jan/22 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46293
Subject: LU-14692 tests: allow FID_SEQ_NORMAL for MDT0000
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fd4b785de75608c0652500625e82e3668f8a9495

Comment by Gerrit Updater [ 19/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46293/
Subject: LU-14692 tests: allow FID_SEQ_NORMAL for MDT0000
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: eaae4655567b16260237764dadb7ab57df8b0edd

Comment by Alex Zhuravlev [ 20/Jan/23 ]

with this patch landed 100% of my local tests fail:

== sanity test 312: make sure ZFS adjusts its block size by write pattern ========================================================== 05:05:02 (1674191102)
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.0159779 s, 256 kB/s
1+0 records in
1+0 records out
16384 bytes (16 kB, 16 KiB) copied, 0.0165552 s, 990 kB/s
1+0 records in
1+0 records out
65536 bytes (66 kB, 64 KiB) copied, 0.0225513 s, 2.9 MB/s
1+0 records in
1+0 records out
262144 bytes (262 kB, 256 KiB) copied, 0.0189756 s, 13.8 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227387 s, 46.1 MB/s
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.0551142 s, 74.3 kB/s
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.029839 s, 137 kB/s
 sanity test_312: @@@@@@ FAIL: blksz error, actual 4096,  expected: 2 * 1 * 4096 
  Trace dump:
  = ./../tests/test-framework.sh:6549:error()
  = sanity.sh:24840:test_312()
  = ./../tests/test-framework.sh:6887:run_one()
  = ./../tests/test-framework.sh:6937:run_one_logged()
  = ./../tests/test-framework.sh:6773:run_test()
  = sanity.sh:24863:main()
Comment by Gerrit Updater [ 20/Jan/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49720
Subject: LU-14692 tests: restore sanity/312 to always_except
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 73b154bd53f36da8907701077f2182c933364c62

Comment by Andreas Dilger [ 20/Jan/23 ]

Alex, the "allow FID_SEQ_NORMAL for MDT0000" patch removes "always_except LU-9054 312", but I'm not sure why, since it doesn't look related to the FID SEQ at all. It should be added back.

Comment by Gerrit Updater [ 25/Jan/23 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49754
Subject: LU-14692 tests: allow FID_SEQ_NORMAL for MDT0000
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 6b69a998e14917656556e62c6a4e4f33f80e2b4b

Comment by Gerrit Updater [ 31/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49720/
Subject: LU-14692 tests: restore sanity/312 to always_except
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8767d2e44110fc19e624e963d5ebc788409339d3

Comment by Gerrit Updater [ 21/Mar/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45822/
Subject: LU-14692 osp: deprecate IDIF sequence for MDT0000
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6d2e7d191a7b27cde62b605dbed14488cfd4d410

Comment by Peter Jones [ 21/Mar/23 ]

Landed for 2.16

Comment by Alex Zhuravlev [ 27/Mar/23 ]

not sure, but I haven't seen the following problem before the last wave of landings which include LU-14692:

LustreError: 343158:0:(osp_internal.h:530:osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x2c0000401:0x2:0x0], fid2:[0x100010000:0x1:0x0] in conf-sanity / 84
...
PID: 343158  TASK: ffff8b7824a605c0  CPU: 1   COMMAND: "tgt_recover_0"
 #0 [ffff8b783d273578] panic at ffffffff8f0b9786
    /tmp/kernel/kernel/panic.c: 299
 #1 [ffff8b783d2735f8] osp_create at ffffffffc1198895 [osp]
    /home/lustre/master-mine/lustre/osp/osp_internal.h: 529
 #2 [ffff8b783d273680] lod_sub_create at ffffffffc113534e [lod]
    /home/lustre/master-mine/lustre/include/dt_object.h: 2333
 #3 [ffff8b783d2736f0] lod_striped_create at ffffffffc112076b [lod]
    /home/lustre/master-mine/lustre/lod/lod_object.c: 6338
 #4 [ffff8b783d273760] lod_xattr_set at ffffffffc1128200 [lod]
    /home/lustre/master-mine/lustre/lod/lod_object.c: 5068
 #5 [ffff8b783d273810] mdd_create_object at ffffffffc0f76a93 [mdd]
    /home/lustre/master-mine/lustre/include/dt_object.h: 2832
 #6 [ffff8b783d273940] mdd_create at ffffffffc0f81f98 [mdd]
    /home/lustre/master-mine/lustre/mdd/mdd_dir.c: 2827
 #7 [ffff8b783d273a40] mdt_reint_open at ffffffffc1038328 [mdt]
    /home/lustre/master-mine/lustre/mdt/mdt_open.c: 1574
 #8 [ffff8b783d273bf8] mdt_reint_rec at ffffffffc102731f [mdt]
    /home/lustre/master-mine/lustre/mdt/mdt_reint.c: 3240
 #9 [ffff8b783d273c20] mdt_reint_internal at ffffffffc0ff6ef6 [mdt]
    /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155
#10 [ffff8b783d273c58] mdt_intent_open at ffffffffc1002982 [mdt]
    /home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4826
#11 [ffff8b783d273c98] mdt_intent_policy at ffffffffc0fffe79 [mdt]
    /home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4971
#12 [ffff8b783d273cf8] ldlm_lock_enqueue at ffffffffc08bdbdf [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c: 1794
#13 [ffff8b783d273d60] ldlm_handle_enqueue0 at ffffffffc08e5046 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lockd.c: 1441
#14 [ffff8b783d273dd8] tgt_enqueue at ffffffffc091fd1f [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/target/tgt_handler.c: 1446
#15 [ffff8b783d273df0] tgt_request_handle at ffffffffc0926147 [ptlrpc]
    /home/lustre/master-mine/lustre/include/lu_target.h: 645
#16 [ffff8b783d273e68] handle_recovery_req at ffffffffc08c8c3c [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2418
#17 [ffff8b783d273e98] target_recovery_thread at ffffffffc08d1300 [ptlrpc]
    /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2677
#18 [ffff8b783d273f10] kthread at ffffffff8f0d5199
    /tmp/kernel/kernel/kthread.c: 340
Comment by Dongyang Li [ 27/Mar/23 ]

Alex, could you share the vmcore-dmesg from the crash?
I wonder if the change to "normal SEQ" happened after replay_barrier, when mdt starts again for recovery, it will see the old IDIF seq from disk.

Comment by Andreas Dilger [ 27/Mar/23 ]

Alex, what test was running when the failure was hit? There was some discussion about this issue with Dongyang, basically that replay_barrier is discarding the SEQ update (which is sync on the server and otherwise atomic) because the underlying storage was marked read-only.

The open question was whether this LASSERT() should be relaxed to handle the case of write loss (e.g. due to controller cache failure) at the same time as SEQ rollover? The SEQ rollover definitely is going to happen more often now (once per 32M OST objects vs. once per 4B objects), but if the storage is losing sync writes then there are a lot of things that will go badly.

Comment by Alex Zhuravlev [ 28/Mar/23 ]

it's always conf-sanity/84, stdout/console are attached.

Comment by Dongyang Li [ 28/Mar/23 ]

From the console log:

[ 6684.760425] Lustre: Mounted lustre-client
[ 6686.111594] Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x00000002c0000400-0x0000000300000400]:1:ost
[ 6686.127442] Lustre: Skipped 2 previous similar messages
[ 6686.127744] Lustre: cli-lustre-OST0001-super: Allocated super-sequence [0x00000002c0000400-0x0000000300000400]:1:ost]
[ 6686.127895] Lustre: Skipped 1 previous similar message
[ 6691.011345] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[ 6691.028634] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
[ 6691.211977] Lustre: lustre-OST0001-osc-MDT0000: update sequence from 0x100010000 to 0x2c0000401
[ 6692.967973] systemd[1]: mnt-lustre\x2dmds1.mount: Succeeded.
[ 6693.003490] Lustre: Failing over lustre-MDT0000

The sequence update from 0x100010000 to 0x2c0000401 was lost after replay_barrier.

Comment by Andreas Dilger [ 28/Mar/23 ]

Can the test be updated to do something simple like "lfs setstripe -i -1 $DIR/$tfile.tmp" to force the sequence update before the replay barrier?

Comment by Andreas Dilger [ 28/Mar/23 ]

The test already has something similar:

# make sure new superblock labels are sync'd before disabling writes
sync_all_data
sleep 5

so adding a file create on all OSTs is reasonable.

Comment by Alex Zhuravlev [ 29/Mar/23 ]

this time in Maloo: https://testing.whamcloud.com/test_sets/b36df675-87ec-4fb5-9c8b-57add55397ec

[11176.501594] LustreError: 567675:0:(osp_internal.h:538:osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x280000bd1:0x2c6d:0x0], fid2:[0x280000bd0:0x2c6c:0x0]

Comment by Dongyang Li [ 29/Mar/23 ]

I will update conf-sanity/84.
Alex, the new crash is a different issue, mostly because landing of https://review.whamcloud.com/c/fs/lustre-release/+/38424/
Now, the patch introduces a SEQ width of 16384 in Maloo, so the SEQ change will happen more frequently and randomly.
To make sure SEQ change doesn't happen after replay_barrier, the patch from 38424 actually has force_new_seq, to change the SEQ for test suites like replay-single starts. It did change the SEQ from the log,
but I think the seq width of 16384 is not enough for the whole replay-single, given we have only 2 OSTs, more objects will be created for each OST.

I think there are 2 things we could do: use force_new_seq for every replay_barrier, which I think is a bit too heavy, or we could enlarge the default 16384 SEQ width according to number of OSTs.

Note we don't really need force_new_seq for conf-sanity/84, the changing of IDIF seq to normal seq happens as soon as osp connects, we just need to wait for that before using replay_barrier.

Comment by Gerrit Updater [ 30/Mar/23 ]

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50477
Subject: LU-14692 tests: wait for osp in conf-sanity/84
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2816476614a92ba675418c7434001d946c8ec81e

Comment by Etienne Aujames [ 31/Mar/23 ]

I hit the same issue that bzzz in (test replay-single 70c):
https://testing.whamcloud.com/test_sets/cbcbb9b2-656c-44bd-b324-31c9dc39539e

I have opened a new ticket for this: LU-16692

Comment by Gerrit Updater [ 11/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49754/
Subject: LU-14692 tests: allow FID_SEQ_NORMAL for MDT0000
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 1a337b4a5b138eb2846ed12b25f5e1725a647670

Comment by Gerrit Updater [ 01/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50477/
Subject: LU-14692 tests: wait for osp in conf-sanity/84
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a9b7d73964b8b655c6c628820464342309f11356

Generated at Sat Feb 10 03:11:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.