Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14692

deprecate use of OST FID SEQ 0 for MDT0000

Details

    • 9223372036854775807

    Description

      Since Lustre 2.4.0 and DNE1, it has been possible to create OST objects using a different FID SEQ range for each MDT, to avoid contention during MDT object precreation.

      Objects that are created by MDT0000 are put into FID SEQ 0 (O/0/d*) on all OSTs and have a filename that is the decimal FID OID in ASCII. However, SEQ=0 objects are remapped to IDIF FID SEQ (0x100000000 | (ost_idx << 16)) so that they are unique across all OSTs.

      Objects that are created by other MDTs (or MDT0000 after 2^48 objects are created in SEQ 0) use a unique SEQ in the FID_SEQ_NORMAL range (> 0x200000400), and use a filename that is the hexadecimal FID OID in ASCII.

      For compatibility with pre-DNE MDTs and OSTs, the use of SEQ=0 by MDT0000 was kept until now, but there has not been a reason to keep this compatibility for new filesystems. It would be better to have MDT0000 assigned a "regular" FID SEQ range at startup, so that the SEQ=0 compatibility can eventually be removed. That would ensure OST objects have "proper and unique" FIDs, and avoid the complexity of mapping between the old SEQ=0 48-bit OID values and the IDIF FIDs.

      Older filesystems using SEQ=0 would eventually delete old objects in this range and/or could be forced to migrate to using new objects to clean up the remaining usage, if necessary.

      Attachments

        1. serial.txt
          778 kB
        2. stdout.txt
          484 kB

        Issue Links

          Activity

            [LU-14692] deprecate use of OST FID SEQ 0 for MDT0000

            "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50477
            Subject: LU-14692 tests: wait for osp in conf-sanity/84
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2816476614a92ba675418c7434001d946c8ec81e

            gerrit Gerrit Updater added a comment - "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50477 Subject: LU-14692 tests: wait for osp in conf-sanity/84 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2816476614a92ba675418c7434001d946c8ec81e
            dongyang Dongyang Li added a comment -

            I will update conf-sanity/84.
            Alex, the new crash is a different issue, mostly because landing of https://review.whamcloud.com/c/fs/lustre-release/+/38424/
            Now, the patch introduces a SEQ width of 16384 in Maloo, so the SEQ change will happen more frequently and randomly.
            To make sure SEQ change doesn't happen after replay_barrier, the patch from 38424 actually has force_new_seq, to change the SEQ for test suites like replay-single starts. It did change the SEQ from the log,
            but I think the seq width of 16384 is not enough for the whole replay-single, given we have only 2 OSTs, more objects will be created for each OST.

            I think there are 2 things we could do: use force_new_seq for every replay_barrier, which I think is a bit too heavy, or we could enlarge the default 16384 SEQ width according to number of OSTs.

            Note we don't really need force_new_seq for conf-sanity/84, the changing of IDIF seq to normal seq happens as soon as osp connects, we just need to wait for that before using replay_barrier.

            dongyang Dongyang Li added a comment - I will update conf-sanity/84. Alex, the new crash is a different issue, mostly because landing of https://review.whamcloud.com/c/fs/lustre-release/+/38424/ Now, the patch introduces a SEQ width of 16384 in Maloo, so the SEQ change will happen more frequently and randomly. To make sure SEQ change doesn't happen after replay_barrier, the patch from 38424 actually has force_new_seq, to change the SEQ for test suites like replay-single starts. It did change the SEQ from the log, but I think the seq width of 16384 is not enough for the whole replay-single, given we have only 2 OSTs, more objects will be created for each OST. I think there are 2 things we could do: use force_new_seq for every replay_barrier, which I think is a bit too heavy, or we could enlarge the default 16384 SEQ width according to number of OSTs. Note we don't really need force_new_seq for conf-sanity/84, the changing of IDIF seq to normal seq happens as soon as osp connects, we just need to wait for that before using replay_barrier.

            this time in Maloo: https://testing.whamcloud.com/test_sets/b36df675-87ec-4fb5-9c8b-57add55397ec

            [11176.501594] LustreError: 567675:0:(osp_internal.h:538:osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x280000bd1:0x2c6d:0x0], fid2:[0x280000bd0:0x2c6c:0x0]

            bzzz Alex Zhuravlev added a comment - this time in Maloo: https://testing.whamcloud.com/test_sets/b36df675-87ec-4fb5-9c8b-57add55397ec [11176.501594] LustreError: 567675:0:(osp_internal.h:538:osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1: [0x280000bd1:0x2c6d:0x0] , fid2: [0x280000bd0:0x2c6c:0x0]

            The test already has something similar:

            # make sure new superblock labels are sync'd before disabling writes
            sync_all_data
            sleep 5
            

            so adding a file create on all OSTs is reasonable.

            adilger Andreas Dilger added a comment - The test already has something similar: # make sure new superblock labels are sync'd before disabling writes sync_all_data sleep 5 so adding a file create on all OSTs is reasonable.

            Can the test be updated to do something simple like "lfs setstripe -i -1 $DIR/$tfile.tmp" to force the sequence update before the replay barrier?

            adilger Andreas Dilger added a comment - Can the test be updated to do something simple like " lfs setstripe -i -1 $DIR/$tfile.tmp " to force the sequence update before the replay barrier?
            dongyang Dongyang Li added a comment -

            From the console log:

            [ 6684.760425] Lustre: Mounted lustre-client
            [ 6686.111594] Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x00000002c0000400-0x0000000300000400]:1:ost
            [ 6686.127442] Lustre: Skipped 2 previous similar messages
            [ 6686.127744] Lustre: cli-lustre-OST0001-super: Allocated super-sequence [0x00000002c0000400-0x0000000300000400]:1:ost]
            [ 6686.127895] Lustre: Skipped 1 previous similar message
            [ 6691.011345] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
            [ 6691.028634] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
            [ 6691.211977] Lustre: lustre-OST0001-osc-MDT0000: update sequence from 0x100010000 to 0x2c0000401
            [ 6692.967973] systemd[1]: mnt-lustre\x2dmds1.mount: Succeeded.
            [ 6693.003490] Lustre: Failing over lustre-MDT0000
            

            The sequence update from 0x100010000 to 0x2c0000401 was lost after replay_barrier.

            dongyang Dongyang Li added a comment - From the console log: [ 6684.760425] Lustre: Mounted lustre-client [ 6686.111594] Lustre: ctl-lustre-MDT0000: super -sequence allocation rc = 0 [0x00000002c0000400-0x0000000300000400]:1:ost [ 6686.127442] Lustre: Skipped 2 previous similar messages [ 6686.127744] Lustre: cli-lustre-OST0001- super : Allocated super -sequence [0x00000002c0000400-0x0000000300000400]:1:ost] [ 6686.127895] Lustre: Skipped 1 previous similar message [ 6691.011345] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 [ 6691.028634] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000 [ 6691.211977] Lustre: lustre-OST0001-osc-MDT0000: update sequence from 0x100010000 to 0x2c0000401 [ 6692.967973] systemd[1]: mnt-lustre\x2dmds1.mount: Succeeded. [ 6693.003490] Lustre: Failing over lustre-MDT0000 The sequence update from 0x100010000 to 0x2c0000401 was lost after replay_barrier.

            it's always conf-sanity/84, stdout/console are attached.

            bzzz Alex Zhuravlev added a comment - it's always conf-sanity/84, stdout/console are attached.

            Alex, what test was running when the failure was hit? There was some discussion about this issue with Dongyang, basically that replay_barrier is discarding the SEQ update (which is sync on the server and otherwise atomic) because the underlying storage was marked read-only.

            The open question was whether this LASSERT() should be relaxed to handle the case of write loss (e.g. due to controller cache failure) at the same time as SEQ rollover? The SEQ rollover definitely is going to happen more often now (once per 32M OST objects vs. once per 4B objects), but if the storage is losing sync writes then there are a lot of things that will go badly.

            adilger Andreas Dilger added a comment - Alex, what test was running when the failure was hit? There was some discussion about this issue with Dongyang, basically that replay_barrier is discarding the SEQ update (which is sync on the server and otherwise atomic) because the underlying storage was marked read-only. The open question was whether this LASSERT() should be relaxed to handle the case of write loss (e.g. due to controller cache failure) at the same time as SEQ rollover? The SEQ rollover definitely is going to happen more often now (once per 32M OST objects vs. once per 4B objects), but if the storage is losing sync writes then there are a lot of things that will go badly.
            dongyang Dongyang Li added a comment -

            Alex, could you share the vmcore-dmesg from the crash?
            I wonder if the change to "normal SEQ" happened after replay_barrier, when mdt starts again for recovery, it will see the old IDIF seq from disk.

            dongyang Dongyang Li added a comment - Alex, could you share the vmcore-dmesg from the crash? I wonder if the change to "normal SEQ" happened after replay_barrier, when mdt starts again for recovery, it will see the old IDIF seq from disk.

            not sure, but I haven't seen the following problem before the last wave of landings which include LU-14692:

            LustreError: 343158:0:(osp_internal.h:530:osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x2c0000401:0x2:0x0], fid2:[0x100010000:0x1:0x0] in conf-sanity / 84
            ...
            PID: 343158  TASK: ffff8b7824a605c0  CPU: 1   COMMAND: "tgt_recover_0"
             #0 [ffff8b783d273578] panic at ffffffff8f0b9786
                /tmp/kernel/kernel/panic.c: 299
             #1 [ffff8b783d2735f8] osp_create at ffffffffc1198895 [osp]
                /home/lustre/master-mine/lustre/osp/osp_internal.h: 529
             #2 [ffff8b783d273680] lod_sub_create at ffffffffc113534e [lod]
                /home/lustre/master-mine/lustre/include/dt_object.h: 2333
             #3 [ffff8b783d2736f0] lod_striped_create at ffffffffc112076b [lod]
                /home/lustre/master-mine/lustre/lod/lod_object.c: 6338
             #4 [ffff8b783d273760] lod_xattr_set at ffffffffc1128200 [lod]
                /home/lustre/master-mine/lustre/lod/lod_object.c: 5068
             #5 [ffff8b783d273810] mdd_create_object at ffffffffc0f76a93 [mdd]
                /home/lustre/master-mine/lustre/include/dt_object.h: 2832
             #6 [ffff8b783d273940] mdd_create at ffffffffc0f81f98 [mdd]
                /home/lustre/master-mine/lustre/mdd/mdd_dir.c: 2827
             #7 [ffff8b783d273a40] mdt_reint_open at ffffffffc1038328 [mdt]
                /home/lustre/master-mine/lustre/mdt/mdt_open.c: 1574
             #8 [ffff8b783d273bf8] mdt_reint_rec at ffffffffc102731f [mdt]
                /home/lustre/master-mine/lustre/mdt/mdt_reint.c: 3240
             #9 [ffff8b783d273c20] mdt_reint_internal at ffffffffc0ff6ef6 [mdt]
                /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155
            #10 [ffff8b783d273c58] mdt_intent_open at ffffffffc1002982 [mdt]
                /home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4826
            #11 [ffff8b783d273c98] mdt_intent_policy at ffffffffc0fffe79 [mdt]
                /home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4971
            #12 [ffff8b783d273cf8] ldlm_lock_enqueue at ffffffffc08bdbdf [ptlrpc]
                /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c: 1794
            #13 [ffff8b783d273d60] ldlm_handle_enqueue0 at ffffffffc08e5046 [ptlrpc]
                /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lockd.c: 1441
            #14 [ffff8b783d273dd8] tgt_enqueue at ffffffffc091fd1f [ptlrpc]
                /home/lustre/master-mine/lustre/ptlrpc/../../lustre/target/tgt_handler.c: 1446
            #15 [ffff8b783d273df0] tgt_request_handle at ffffffffc0926147 [ptlrpc]
                /home/lustre/master-mine/lustre/include/lu_target.h: 645
            #16 [ffff8b783d273e68] handle_recovery_req at ffffffffc08c8c3c [ptlrpc]
                /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2418
            #17 [ffff8b783d273e98] target_recovery_thread at ffffffffc08d1300 [ptlrpc]
                /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2677
            #18 [ffff8b783d273f10] kthread at ffffffff8f0d5199
                /tmp/kernel/kernel/kthread.c: 340
            
            bzzz Alex Zhuravlev added a comment - not sure, but I haven't seen the following problem before the last wave of landings which include LU-14692 : LustreError: 343158:0:(osp_internal.h:530:osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) ) failed: fid1:[0x2c0000401:0x2:0x0], fid2:[0x100010000:0x1:0x0] in conf-sanity / 84 ... PID: 343158 TASK: ffff8b7824a605c0 CPU: 1 COMMAND: "tgt_recover_0" #0 [ffff8b783d273578] panic at ffffffff8f0b9786 /tmp/kernel/kernel/panic.c: 299 #1 [ffff8b783d2735f8] osp_create at ffffffffc1198895 [osp] /home/lustre/master-mine/lustre/osp/osp_internal.h: 529 #2 [ffff8b783d273680] lod_sub_create at ffffffffc113534e [lod] /home/lustre/master-mine/lustre/include/dt_object.h: 2333 #3 [ffff8b783d2736f0] lod_striped_create at ffffffffc112076b [lod] /home/lustre/master-mine/lustre/lod/lod_object.c: 6338 #4 [ffff8b783d273760] lod_xattr_set at ffffffffc1128200 [lod] /home/lustre/master-mine/lustre/lod/lod_object.c: 5068 #5 [ffff8b783d273810] mdd_create_object at ffffffffc0f76a93 [mdd] /home/lustre/master-mine/lustre/include/dt_object.h: 2832 #6 [ffff8b783d273940] mdd_create at ffffffffc0f81f98 [mdd] /home/lustre/master-mine/lustre/mdd/mdd_dir.c: 2827 #7 [ffff8b783d273a40] mdt_reint_open at ffffffffc1038328 [mdt] /home/lustre/master-mine/lustre/mdt/mdt_open.c: 1574 #8 [ffff8b783d273bf8] mdt_reint_rec at ffffffffc102731f [mdt] /home/lustre/master-mine/lustre/mdt/mdt_reint.c: 3240 #9 [ffff8b783d273c20] mdt_reint_internal at ffffffffc0ff6ef6 [mdt] /home/lustre/master-mine/libcfs/include/libcfs/libcfs_debug.h: 155 #10 [ffff8b783d273c58] mdt_intent_open at ffffffffc1002982 [mdt] /home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4826 #11 [ffff8b783d273c98] mdt_intent_policy at ffffffffc0fffe79 [mdt] /home/lustre/master-mine/lustre/mdt/mdt_handler.c: 4971 #12 [ffff8b783d273cf8] ldlm_lock_enqueue at ffffffffc08bdbdf [ptlrpc] /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c: 1794 #13 [ffff8b783d273d60] ldlm_handle_enqueue0 at ffffffffc08e5046 [ptlrpc] /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lockd.c: 1441 #14 [ffff8b783d273dd8] tgt_enqueue at ffffffffc091fd1f [ptlrpc] /home/lustre/master-mine/lustre/ptlrpc/../../lustre/target/tgt_handler.c: 1446 #15 [ffff8b783d273df0] tgt_request_handle at ffffffffc0926147 [ptlrpc] /home/lustre/master-mine/lustre/include/lu_target.h: 645 #16 [ffff8b783d273e68] handle_recovery_req at ffffffffc08c8c3c [ptlrpc] /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2418 #17 [ffff8b783d273e98] target_recovery_thread at ffffffffc08d1300 [ptlrpc] /home/lustre/master-mine/lustre/ptlrpc/../../lustre/ldlm/ldlm_lib.c: 2677 #18 [ffff8b783d273f10] kthread at ffffffff8f0d5199 /tmp/kernel/kernel/kthread.c: 340
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            People

              dongyang Dongyang Li
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: