[LU-14692] deprecate use of OST FID SEQ 0 for MDT0000 Created: 18/May/21 Updated: 08/Jan/24 Resolved: 21/Mar/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.3 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Dongyang Li |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | medium | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
Since Lustre 2.4.0 and DNE1, it has been possible to create OST objects using a different FID SEQ range for each MDT, to avoid contention during MDT object precreation. Objects that are created by MDT0000 are put into FID SEQ 0 (O/0/d*) on all OSTs and have a filename that is the decimal FID OID in ASCII. However, SEQ=0 objects are remapped to IDIF FID SEQ (0x100000000 | (ost_idx << 16)) so that they are unique across all OSTs. Objects that are created by other MDTs (or MDT0000 after 2^48 objects are created in SEQ 0) use a unique SEQ in the FID_SEQ_NORMAL range (> 0x200000400), and use a filename that is the hexadecimal FID OID in ASCII. For compatibility with pre-DNE MDTs and OSTs, the use of SEQ=0 by MDT0000 was kept until now, but there has not been a reason to keep this compatibility for new filesystems. It would be better to have MDT0000 assigned a "regular" FID SEQ range at startup, so that the SEQ=0 compatibility can eventually be removed. That would ensure OST objects have "proper and unique" FIDs, and avoid the complexity of mapping between the old SEQ=0 48-bit OID values and the IDIF FIDs. Older filesystems using SEQ=0 would eventually delete old objects in this range and/or could be forced to migrate to using new objects to clean up the remaining usage, if necessary. |
| Comments |
| Comment by Andreas Dilger [ 18/May/21 ] |
|
This is somewhat related to implementing |
| Comment by Gerrit Updater [ 10/Dec/21 ] |
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45822 |
| Comment by Gerrit Updater [ 25/Jan/22 ] |
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/46293 |
| Comment by Gerrit Updater [ 19/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/46293/ |
| Comment by Alex Zhuravlev [ 20/Jan/23 ] |
|
with this patch landed 100% of my local tests fail: == sanity test 312: make sure ZFS adjusts its block size by write pattern ========================================================== 05:05:02 (1674191102) 1+0 records in 1+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.0159779 s, 256 kB/s 1+0 records in 1+0 records out 16384 bytes (16 kB, 16 KiB) copied, 0.0165552 s, 990 kB/s 1+0 records in 1+0 records out 65536 bytes (66 kB, 64 KiB) copied, 0.0225513 s, 2.9 MB/s 1+0 records in 1+0 records out 262144 bytes (262 kB, 256 KiB) copied, 0.0189756 s, 13.8 MB/s 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0227387 s, 46.1 MB/s 1+0 records in 1+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.0551142 s, 74.3 kB/s 1+0 records in 1+0 records out 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.029839 s, 137 kB/s sanity test_312: @@@@@@ FAIL: blksz error, actual 4096, expected: 2 * 1 * 4096 Trace dump: = ./../tests/test-framework.sh:6549:error() = sanity.sh:24840:test_312() = ./../tests/test-framework.sh:6887:run_one() = ./../tests/test-framework.sh:6937:run_one_logged() = ./../tests/test-framework.sh:6773:run_test() = sanity.sh:24863:main() |
| Comment by Gerrit Updater [ 20/Jan/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49720 |
| Comment by Andreas Dilger [ 20/Jan/23 ] |
|
Alex, the "allow FID_SEQ_NORMAL for MDT0000" patch removes "always_except LU-9054 312", but I'm not sure why, since it doesn't look related to the FID SEQ at all. It should be added back. |
| Comment by Gerrit Updater [ 25/Jan/23 ] |
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49754 |
| Comment by Gerrit Updater [ 31/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49720/ |
| Comment by Gerrit Updater [ 21/Mar/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/45822/ |
| Comment by Peter Jones [ 21/Mar/23 ] |
|
Landed for 2.16 |
| Comment by Alex Zhuravlev [ 27/Mar/23 ] |
|
not sure, but I haven't seen the following problem before the last wave of landings which include
LustreError: 343158:0:(osp_internal.h:530:osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x2c0000401:0x2:0x0], fid2:[0x100010000:0x1:0x0] in conf-sanity / 84
...
PID: 343158 TASK: ffff8b7824a605c0 CPU: 1 COMMAND: "tgt_recover_0"
#0 [ffff8b783d273578] panic at ffffffff8f0b9786
/tmp/kernel/kernel/panic.c: 299
#1 [ffff8b783d2735f8] osp_create at ffffffffc1198895 [osp]
/home/lustre/master-mine/lustre/osp/osp_internal.h: 529
#2 [ffff8b783d273680] lod_sub_create at ffffffffc113534e [lod]
/home/lustre/master-mine/lustre/include/dt_object.h: 2333
#3 [ffff8b783d2736f0] lod_striped_create at ffffffffc112076b [lod]
/home/lustre/master-mine/lustre/lod/lod_object.c: 6338
#4 [ffff8b783d273760] lod_xattr_set at ffffffffc1128200 [lod]
/home/lustre/master-mine/lustre/lod/lod_object.c: 5068
#5 [ffff8b783d273810] mdd_create_object at ffffffffc0f76a93 [mdd]
/home/lustre/master-mine/lustre/include/dt_object.h: 2832
#6 [ffff8b783d273940] mdd_create at ffffffffc0f81f98 [mdd]
/home/lustre/master-mine/lustre/mdd/mdd_dir.c: 2827
#7 [ffff8b783d273a40] mdt_reint_open at ffffffffc1038328 [mdt]
/home/lustre/master-mine/lustre/mdt/mdt_open.c: 1574
#8 [ffff8b783d273bf8] mdt_reint_rec at ffffffffc102731f [mdt]
/home/lustre/master-mine/lustre/mdt/mdt_reint.c: 3240
#9 [ffff8b783d273c20] mdt_reint_internal at ffffffffc0ff6ef6 [mdt]
/home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155
#10 [ffff8b783d273c58] mdt_intent_open at ffffffffc1002982 [mdt]
/home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4826
#11 [ffff8b783d273c98] mdt_intent_policy at ffffffffc0fffe79 [mdt]
/home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4971
#12 [ffff8b783d273cf8] ldlm_lock_enqueue at ffffffffc08bdbdf [ptlrpc]
/home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c: 1794
#13 [ffff8b783d273d60] ldlm_handle_enqueue0 at ffffffffc08e5046 [ptlrpc]
/home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lockd.c: 1441
#14 [ffff8b783d273dd8] tgt_enqueue at ffffffffc091fd1f [ptlrpc]
/home/lustre/master-mine/lustre/ptlrpc/../../lustre/target/tgt_handler.c: 1446
#15 [ffff8b783d273df0] tgt_request_handle at ffffffffc0926147 [ptlrpc]
/home/lustre/master-mine/lustre/include/lu_target.h: 645
#16 [ffff8b783d273e68] handle_recovery_req at ffffffffc08c8c3c [ptlrpc]
/home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2418
#17 [ffff8b783d273e98] target_recovery_thread at ffffffffc08d1300 [ptlrpc]
/home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2677
#18 [ffff8b783d273f10] kthread at ffffffff8f0d5199
/tmp/kernel/kernel/kthread.c: 340
|
| Comment by Dongyang Li [ 27/Mar/23 ] |
|
Alex, could you share the vmcore-dmesg from the crash? |
| Comment by Andreas Dilger [ 27/Mar/23 ] |
|
Alex, what test was running when the failure was hit? There was some discussion about this issue with Dongyang, basically that replay_barrier is discarding the SEQ update (which is sync on the server and otherwise atomic) because the underlying storage was marked read-only. The open question was whether this LASSERT() should be relaxed to handle the case of write loss (e.g. due to controller cache failure) at the same time as SEQ rollover? The SEQ rollover definitely is going to happen more often now (once per 32M OST objects vs. once per 4B objects), but if the storage is losing sync writes then there are a lot of things that will go badly. |
| Comment by Alex Zhuravlev [ 28/Mar/23 ] |
|
it's always conf-sanity/84, stdout/console are attached. |
| Comment by Dongyang Li [ 28/Mar/23 ] |
|
From the console log: [ 6684.760425] Lustre: Mounted lustre-client [ 6686.111594] Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x00000002c0000400-0x0000000300000400]:1:ost [ 6686.127442] Lustre: Skipped 2 previous similar messages [ 6686.127744] Lustre: cli-lustre-OST0001-super: Allocated super-sequence [0x00000002c0000400-0x0000000300000400]:1:ost] [ 6686.127895] Lustre: Skipped 1 previous similar message [ 6691.011345] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [ 6691.028634] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000 [ 6691.211977] Lustre: lustre-OST0001-osc-MDT0000: update sequence from 0x100010000 to 0x2c0000401 [ 6692.967973] systemd[1]: mnt-lustre\x2dmds1.mount: Succeeded. [ 6693.003490] Lustre: Failing over lustre-MDT0000 The sequence update from 0x100010000 to 0x2c0000401 was lost after replay_barrier. |
| Comment by Andreas Dilger [ 28/Mar/23 ] |
|
Can the test be updated to do something simple like "lfs setstripe -i -1 $DIR/$tfile.tmp" to force the sequence update before the replay barrier? |
| Comment by Andreas Dilger [ 28/Mar/23 ] |
|
The test already has something similar:
# make sure new superblock labels are sync'd before disabling writes
sync_all_data
sleep 5
so adding a file create on all OSTs is reasonable. |
| Comment by Alex Zhuravlev [ 29/Mar/23 ] |
|
this time in Maloo: https://testing.whamcloud.com/test_sets/b36df675-87ec-4fb5-9c8b-57add55397ec
|
| Comment by Dongyang Li [ 29/Mar/23 ] |
|
I will update conf-sanity/84. I think there are 2 things we could do: use force_new_seq for every replay_barrier, which I think is a bit too heavy, or we could enlarge the default 16384 SEQ width according to number of OSTs. Note we don't really need force_new_seq for conf-sanity/84, the changing of IDIF seq to normal seq happens as soon as osp connects, we just need to wait for that before using replay_barrier. |
| Comment by Gerrit Updater [ 30/Mar/23 ] |
|
"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50477 |
| Comment by Etienne Aujames [ 31/Mar/23 ] |
|
I hit the same issue that bzzz in (test replay-single 70c): I have opened a new ticket for this: LU-16692 |
| Comment by Gerrit Updater [ 11/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49754/ |
| Comment by Gerrit Updater [ 01/May/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50477/ |