Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16235

cdt_agent_record_hash_add() ASSERTION( carl0->carl_cat_idx == carl1->carl_cat_idx ) failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When adding new HSM actions llog records in mdt_agent_record_add(), cdt_state might be in CDT_INIT, so the HSM actions llog may not have been fully processed to set cdt_last_cookie to an appropriately large value, leading to cookie values being reused and triggering the assertions in cdt_agent_record_hash_add(). LU-13689 attempted to fix this, but there might be a simpler solution.

      Attachments

        Issue Links

          Activity

            [LU-16235] cdt_agent_record_hash_add() ASSERTION( carl0->carl_cat_idx == carl1->carl_cat_idx ) failed

            This doesn't explain anywhere what "RAoLU" stands for.

            RAoLU: Remove Action on Last Unlink

            eaujames Etienne Aujames added a comment - This doesn't explain anywhere what "RAoLU" stands for. RAoLU: Remove Action on Last Unlink

            hsm: get a valid cookie for RAoLU request

            This doesn't explain anywhere what "RAoLU" stands for.

            adilger Andreas Dilger added a comment - hsm: get a valid cookie for RAoLU request This doesn't explain anywhere what "RAoLU" stands for.
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51850/
            Subject: LU-16235 hsm: get a valid cookie for RAoLU request
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 241cf3c6d08277c4a401ec8bd109274123bf9cdf

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51850/ Subject: LU-16235 hsm: get a valid cookie for RAoLU request Project: fs/lustre-release Branch: master Current Patch Set: Commit: 241cf3c6d08277c4a401ec8bd109274123bf9cdf

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53660
            Subject: LU-16235 hsm: check CDT state before adding actions llog
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 2217ee7e5b18aee5fe0c9d6135194487afa38db5

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53660 Subject: LU-16235 hsm: check CDT state before adding actions llog Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 2217ee7e5b18aee5fe0c9d6135194487afa38db5

            I have abandoned the revert and merge it with the https://review.whamcloud.com/51256 ("LU-16356 hsm: add running ref to the coordinator ")

            eaujames Etienne Aujames added a comment - I have abandoned the revert and merge it with the https://review.whamcloud.com/51256 (" LU-16356 hsm: add running ref to the coordinator ")
            nangelinas Nikitas Angelinas added a comment - - edited

            eaujames, could you please see some questions re if we need the revert patch in https://review.whamcloud.com/#/c/fs/lustre-release/+/52222?

            nangelinas Nikitas Angelinas added a comment - - edited eaujames , could you please see some questions re if we need the revert patch in https://review.whamcloud.com/#/c/fs/lustre-release/+/52222?
            gerrit Gerrit Updater added a comment - - edited

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52222
            Subject: Revert "LU-16235 hsm: check CDT state before adding actions llog"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: da1da220fe5176331145bc967d0b1a86f8f26433

            gerrit Gerrit Updater added a comment - - edited "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52222 Subject: Revert " LU-16235 hsm: check CDT state before adding actions llog" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: da1da220fe5176331145bc967d0b1a86f8f26433

            this helped:

            diff --git a/lustre/mdt/mdt_coordinator.c b/lustre/mdt/mdt_coordinator.c
            index 439e0cc130..90f2c270df 100644
            --- a/lustre/mdt/mdt_coordinator.c
            +++ b/lustre/mdt/mdt_coordinator.c
            @@ -605,6 +605,7 @@ static int mdt_coordinator(void *data)
             
                    cdt_start_pending_restore(mdt, cdt);
                    set_cdt_state(cdt, CDT_RUNNING);
            +       wake_up(&cdt->cdt_waitq);
             
                    while (1) {
                            int i;
            @@ -1227,6 +1228,7 @@ int mdt_hsm_cdt_stop(struct mdt_device *mdt)
                    int rc;
             
                    ENTRY;
            +       wait_event(cdt->cdt_waitq, cdt->cdt_state != CDT_INIT);
                    /* stop coordinator thread */
                    rc = set_cdt_state(cdt, CDT_STOPPING);
                    if (rc == 0) {
            
            bzzz Alex Zhuravlev added a comment - this helped: diff --git a/lustre/mdt/mdt_coordinator.c b/lustre/mdt/mdt_coordinator.c index 439e0cc130..90f2c270df 100644 --- a/lustre/mdt/mdt_coordinator.c +++ b/lustre/mdt/mdt_coordinator.c @@ -605,6 +605,7 @@ static int mdt_coordinator(void *data) cdt_start_pending_restore(mdt, cdt); set_cdt_state(cdt, CDT_RUNNING); + wake_up(&cdt->cdt_waitq); while (1) { int i; @@ -1227,6 +1228,7 @@ int mdt_hsm_cdt_stop(struct mdt_device *mdt) int rc; ENTRY; + wait_event(cdt->cdt_waitq, cdt->cdt_state != CDT_INIT); /* stop coordinator thread */ rc = set_cdt_state(cdt, CDT_STOPPING); if (rc == 0) {

            this patch causes GFP:

            [ 9128.588763] Lustre: DEBUG MARKER: == conf-sanity test 132: hsm_actions processed after failover ========================================================== 05:32:30 (1693546350)
            ...
            [ 9196.856259] Lustre: Found index 0 for lustre-MDT0000, updating log
            [ 9197.308569] systemd[1]: mnt-lustre\x2dmds1.mount: Succeeded.
            [ 9197.541121] Lustre: server umount lustre-MDT0000 complete
            [ 9197.920157] BUG: unable to handle kernel paging request at ffff89866f0e2698
            [ 9197.920511] PGD 76e01067 P4D 76e01067 PUD 176f48067 PMD 176dcf067 PTE 800ffffed0f1d060
            [ 9197.920558] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
            [ 9197.920586] CPU: 1 PID: 481978 Comm: hsm_cdtr Tainted: G        W  O     --------- -  - 4.18.0 #2
            [ 9197.920636] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
            [ 9197.920690] RIP: 0010:mdt_coordinator+0xd7/0x1a10 [mdt]
            [ 9197.920728] Code: ff 01 db 74 31 31 db eb 0c 8b 05 7c 39 58 ff 01 c0 39 c3 73 21 bf 00 ca 9a 3b 83 c3 01 e8 01 81 28 e7 48 89 c7 e8 59 8a 71 e7 <49> 8b 84 24 98 06 00 00 a8 01 74 d3 49 8b 84 24 98 06 00 00 a8 01
            [ 9197.920826] RSP: 0000:ffff898681d17e00 EFLAGS: 00010282
            [ 9197.920853] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
            [ 9197.920891] RDX: 0000000000000000 RSI: ffffffffa8117897 RDI: 0000000000000246
            [ 9197.920936] RBP: ffff8985a1c14740 R08: 0000000000000000 R09: ffff8986b13e98c0
            [ 9197.920974] R10: 0000000000000000 R11: 000000000000004f R12: ffff89866f0e2000
            [ 9197.921012] R13: ffff89868a508380 R14: ffffffffc0e8eb10 R15: ffff898648cbc000
            [ 9197.921051] FS:  0000000000000000(0000) GS:ffff8986b1200000(0000) knlGS:0000000000000000
            [ 9197.921089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [ 9197.921120] CR2: ffff89866f0e2698 CR3: 0000000139335000 CR4: 00000000000006a0
            [ 9197.921161] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [ 9197.921208] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            [ 9197.921246] Call Trace:
            [ 9197.921265]  ? _raw_spin_lock_irqsave+0x46/0x80
            [ 9197.921302]  ? finish_task_switch+0x1f1/0x280
            [ 9197.921350]  ? set_cdt_state+0x40/0x40 [mdt]
            [ 9197.921386]  kthread+0x129/0x140
            [ 9197.921415]  ? kthread_flush_work_fn+0x10/0x10
            [ 9197.921451]  ret_from_fork+0x1f/0x30
            [ 9197.921472] Modules linked in: lustre(O) ofd(O) osp(O) lod(O) ost(O) mdt(O) mdd(O) mgs(O) osd_ldiskfs(O) ldiskfs(O) lquota(O) lfsck(O) obdecho(O) mgc(O) mdc(O) lov(O) osc(O) lmv(O) fid(O) fld(O) ptlrpc(O) obdclass(O) ksocklnd(O) lnet(O) libcfs(O) zfs(O) zunicode(O) zzstd(O) zlua(O) zcommon(O) znvpair(O) zavl(O) icp(O) spl(O) [last unloaded: libcfs]
            [ 9197.921647] CR2: ffff89866f0e2698
            [ 9197.921668] ---[ end trace a7d48f6687796264 ]---
            

            here is bt:

            PID: 481978   TASK: ffff89868a508380  CPU: 1    COMMAND: "hsm_cdtr"
             #0 [ffff898681d17c68] panic at ffffffffa80b9786
                /tmp/kernel/kernel/panic.c: 299
             #1 [ffff898681d17d00] no_context at ffffffffa80a9563
                /tmp/kernel/arch/x86/mm/fault.c: 799
             #2 [ffff898681d17d50] page_fault at ffffffffa8600f0e
                /tmp/kernel/arch/x86/entry/entry_64.S: 1220
                [exception RIP: mdt_coordinator+215]
                RIP: ffffffffc0e8ebe7  RSP: ffff898681d17e00  RFLAGS: 00010282
                RAX: 0000000000000000  RBX: 0000000000000001  RCX: 0000000000000000
                RDX: 0000000000000000  RSI: ffffffffa8117897  RDI: 0000000000000246
                RBP: ffff8985a1c14740   R8: 0000000000000000   R9: ffff8986b13e98c0
                R10: 0000000000000000  R11: 000000000000004f  R12: ffff89866f0e2000
                R13: ffff89868a508380  R14: ffffffffc0e8eb10  R15: ffff898648cbc000
                ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
                /home/lustre/linux-4.18.0-305.25.1.el8_4/./arch/x86/include/asm/bitops.h: 324
             #3 [ffff898681d17f10] kthread at ffffffffa80d5199
                /tmp/kernel/kernel/kthread.c: 340
             #4 [ffff898681d17f50] ret_from_fork at ffffffffa860019f
                /tmp/kernel/arch/x86/entry/entry_64.S: 325
            

            GFP was hit at:

                    while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout) {
            
            bzzz Alex Zhuravlev added a comment - this patch causes GFP: [ 9128.588763] Lustre: DEBUG MARKER: == conf-sanity test 132: hsm_actions processed after failover ========================================================== 05:32:30 (1693546350) ... [ 9196.856259] Lustre: Found index 0 for lustre-MDT0000, updating log [ 9197.308569] systemd[1]: mnt-lustre\x2dmds1.mount: Succeeded. [ 9197.541121] Lustre: server umount lustre-MDT0000 complete [ 9197.920157] BUG: unable to handle kernel paging request at ffff89866f0e2698 [ 9197.920511] PGD 76e01067 P4D 76e01067 PUD 176f48067 PMD 176dcf067 PTE 800ffffed0f1d060 [ 9197.920558] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 9197.920586] CPU: 1 PID: 481978 Comm: hsm_cdtr Tainted: G W O --------- - - 4.18.0 #2 [ 9197.920636] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 9197.920690] RIP: 0010:mdt_coordinator+0xd7/0x1a10 [mdt] [ 9197.920728] Code: ff 01 db 74 31 31 db eb 0c 8b 05 7c 39 58 ff 01 c0 39 c3 73 21 bf 00 ca 9a 3b 83 c3 01 e8 01 81 28 e7 48 89 c7 e8 59 8a 71 e7 <49> 8b 84 24 98 06 00 00 a8 01 74 d3 49 8b 84 24 98 06 00 00 a8 01 [ 9197.920826] RSP: 0000:ffff898681d17e00 EFLAGS: 00010282 [ 9197.920853] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 [ 9197.920891] RDX: 0000000000000000 RSI: ffffffffa8117897 RDI: 0000000000000246 [ 9197.920936] RBP: ffff8985a1c14740 R08: 0000000000000000 R09: ffff8986b13e98c0 [ 9197.920974] R10: 0000000000000000 R11: 000000000000004f R12: ffff89866f0e2000 [ 9197.921012] R13: ffff89868a508380 R14: ffffffffc0e8eb10 R15: ffff898648cbc000 [ 9197.921051] FS: 0000000000000000(0000) GS:ffff8986b1200000(0000) knlGS:0000000000000000 [ 9197.921089] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9197.921120] CR2: ffff89866f0e2698 CR3: 0000000139335000 CR4: 00000000000006a0 [ 9197.921161] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9197.921208] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9197.921246] Call Trace: [ 9197.921265] ? _raw_spin_lock_irqsave+0x46/0x80 [ 9197.921302] ? finish_task_switch+0x1f1/0x280 [ 9197.921350] ? set_cdt_state+0x40/0x40 [mdt] [ 9197.921386] kthread+0x129/0x140 [ 9197.921415] ? kthread_flush_work_fn+0x10/0x10 [ 9197.921451] ret_from_fork+0x1f/0x30 [ 9197.921472] Modules linked in: lustre(O) ofd(O) osp(O) lod(O) ost(O) mdt(O) mdd(O) mgs(O) osd_ldiskfs(O) ldiskfs(O) lquota(O) lfsck(O) obdecho(O) mgc(O) mdc(O) lov(O) osc(O) lmv(O) fid(O) fld(O) ptlrpc(O) obdclass(O) ksocklnd(O) lnet(O) libcfs(O) zfs(O) zunicode(O) zzstd(O) zlua(O) zcommon(O) znvpair(O) zavl(O) icp(O) spl(O) [last unloaded: libcfs] [ 9197.921647] CR2: ffff89866f0e2698 [ 9197.921668] ---[ end trace a7d48f6687796264 ]--- here is bt: PID: 481978 TASK: ffff89868a508380 CPU: 1 COMMAND: "hsm_cdtr" #0 [ffff898681d17c68] panic at ffffffffa80b9786 /tmp/kernel/kernel/panic.c: 299 #1 [ffff898681d17d00] no_context at ffffffffa80a9563 /tmp/kernel/arch/x86/mm/fault.c: 799 #2 [ffff898681d17d50] page_fault at ffffffffa8600f0e /tmp/kernel/arch/x86/entry/entry_64.S: 1220 [exception RIP: mdt_coordinator+215] RIP: ffffffffc0e8ebe7 RSP: ffff898681d17e00 RFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffffa8117897 RDI: 0000000000000246 RBP: ffff8985a1c14740 R8: 0000000000000000 R9: ffff8986b13e98c0 R10: 0000000000000000 R11: 000000000000004f R12: ffff89866f0e2000 R13: ffff89868a508380 R14: ffffffffc0e8eb10 R15: ffff898648cbc000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 /home/lustre/linux-4.18.0-305.25.1.el8_4/./arch/x86/include/asm/bitops.h: 324 #3 [ffff898681d17f10] kthread at ffffffffa80d5199 /tmp/kernel/kernel/kthread.c: 340 #4 [ffff898681d17f50] ret_from_fork at ffffffffa860019f /tmp/kernel/arch/x86/entry/entry_64.S: 325 GFP was hit at: while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout) {

            People

              nangelinas Nikitas Angelinas
              nangelinas Nikitas Angelinas
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: