[LU-16235] cdt_agent_record_hash_add() ASSERTION( carl0->carl_cat_idx == carl1->carl_cat_idx ) failed Created: 12/Oct/22  Updated: 12/Jan/24

Status: In Progress
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Nikitas Angelinas Assignee: Nikitas Angelinas
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-13689 Replace cdt_state_lock with cdt_llog_... Open
Related
is related to LU-16356 high contention on cdt_request_lock c... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When adding new HSM actions llog records in mdt_agent_record_add(), cdt_state might be in CDT_INIT, so the HSM actions llog may not have been fully processed to set cdt_last_cookie to an appropriately large value, leading to cookie values being reused and triggering the assertions in cdt_agent_record_hash_add(). LU-13689 attempted to fix this, but there might be a simpler solution.



 Comments   
Comment by Gerrit Updater [ 12/Oct/22 ]

"Nikitas Angelinas <nikitas.angelinas@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48842
Subject: LU-16235 hsm: check cdt_state before adding actions llog record
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 10fd63dfe8cd712abc848a72719b44c8759f85e5

Comment by Gerrit Updater [ 02/Aug/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51850
Subject: LU-16235 hsm: get a valid cookie for RAoLU request
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 923f1081659826a10d3a10c43ed60453e934954c

Comment by Gerrit Updater [ 31/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48842/
Subject: LU-16235 hsm: check CDT state before adding actions llog
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fe5706e0c19f96e4f821790004f05ab265002e9d

Comment by Alex Zhuravlev [ 01/Sep/23 ]

this patch causes GFP:

[ 9128.588763] Lustre: DEBUG MARKER: == conf-sanity test 132: hsm_actions processed after failover ========================================================== 05:32:30 (1693546350)
...
[ 9196.856259] Lustre: Found index 0 for lustre-MDT0000, updating log
[ 9197.308569] systemd[1]: mnt-lustre\x2dmds1.mount: Succeeded.
[ 9197.541121] Lustre: server umount lustre-MDT0000 complete
[ 9197.920157] BUG: unable to handle kernel paging request at ffff89866f0e2698
[ 9197.920511] PGD 76e01067 P4D 76e01067 PUD 176f48067 PMD 176dcf067 PTE 800ffffed0f1d060
[ 9197.920558] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 9197.920586] CPU: 1 PID: 481978 Comm: hsm_cdtr Tainted: G        W  O     --------- -  - 4.18.0 #2
[ 9197.920636] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 9197.920690] RIP: 0010:mdt_coordinator+0xd7/0x1a10 [mdt]
[ 9197.920728] Code: ff 01 db 74 31 31 db eb 0c 8b 05 7c 39 58 ff 01 c0 39 c3 73 21 bf 00 ca 9a 3b 83 c3 01 e8 01 81 28 e7 48 89 c7 e8 59 8a 71 e7 <49> 8b 84 24 98 06 00 00 a8 01 74 d3 49 8b 84 24 98 06 00 00 a8 01
[ 9197.920826] RSP: 0000:ffff898681d17e00 EFLAGS: 00010282
[ 9197.920853] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[ 9197.920891] RDX: 0000000000000000 RSI: ffffffffa8117897 RDI: 0000000000000246
[ 9197.920936] RBP: ffff8985a1c14740 R08: 0000000000000000 R09: ffff8986b13e98c0
[ 9197.920974] R10: 0000000000000000 R11: 000000000000004f R12: ffff89866f0e2000
[ 9197.921012] R13: ffff89868a508380 R14: ffffffffc0e8eb10 R15: ffff898648cbc000
[ 9197.921051] FS:  0000000000000000(0000) GS:ffff8986b1200000(0000) knlGS:0000000000000000
[ 9197.921089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9197.921120] CR2: ffff89866f0e2698 CR3: 0000000139335000 CR4: 00000000000006a0
[ 9197.921161] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9197.921208] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9197.921246] Call Trace:
[ 9197.921265]  ? _raw_spin_lock_irqsave+0x46/0x80
[ 9197.921302]  ? finish_task_switch+0x1f1/0x280
[ 9197.921350]  ? set_cdt_state+0x40/0x40 [mdt]
[ 9197.921386]  kthread+0x129/0x140
[ 9197.921415]  ? kthread_flush_work_fn+0x10/0x10
[ 9197.921451]  ret_from_fork+0x1f/0x30
[ 9197.921472] Modules linked in: lustre(O) ofd(O) osp(O) lod(O) ost(O) mdt(O) mdd(O) mgs(O) osd_ldiskfs(O) ldiskfs(O) lquota(O) lfsck(O) obdecho(O) mgc(O) mdc(O) lov(O) osc(O) lmv(O) fid(O) fld(O) ptlrpc(O) obdclass(O) ksocklnd(O) lnet(O) libcfs(O) zfs(O) zunicode(O) zzstd(O) zlua(O) zcommon(O) znvpair(O) zavl(O) icp(O) spl(O) [last unloaded: libcfs]
[ 9197.921647] CR2: ffff89866f0e2698
[ 9197.921668] ---[ end trace a7d48f6687796264 ]---

here is bt:

PID: 481978   TASK: ffff89868a508380  CPU: 1    COMMAND: "hsm_cdtr"
 #0 [ffff898681d17c68] panic at ffffffffa80b9786
    /tmp/kernel/kernel/panic.c: 299
 #1 [ffff898681d17d00] no_context at ffffffffa80a9563
    /tmp/kernel/arch/x86/mm/fault.c: 799
 #2 [ffff898681d17d50] page_fault at ffffffffa8600f0e
    /tmp/kernel/arch/x86/entry/entry_64.S: 1220
    [exception RIP: mdt_coordinator+215]
    RIP: ffffffffc0e8ebe7  RSP: ffff898681d17e00  RFLAGS: 00010282
    RAX: 0000000000000000  RBX: 0000000000000001  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffffffffa8117897  RDI: 0000000000000246
    RBP: ffff8985a1c14740   R8: 0000000000000000   R9: ffff8986b13e98c0
    R10: 0000000000000000  R11: 000000000000004f  R12: ffff89866f0e2000
    R13: ffff89868a508380  R14: ffffffffc0e8eb10  R15: ffff898648cbc000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
    /home/lustre/linux-4.18.0-305.25.1.el8_4/./arch/x86/include/asm/bitops.h: 324
 #3 [ffff898681d17f10] kthread at ffffffffa80d5199
    /tmp/kernel/kernel/kthread.c: 340
 #4 [ffff898681d17f50] ret_from_fork at ffffffffa860019f
    /tmp/kernel/arch/x86/entry/entry_64.S: 325

GFP was hit at:

        while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout) {
Comment by Alex Zhuravlev [ 01/Sep/23 ]

this helped:

diff --git a/lustre/mdt/mdt_coordinator.c b/lustre/mdt/mdt_coordinator.c
index 439e0cc130..90f2c270df 100644
--- a/lustre/mdt/mdt_coordinator.c
+++ b/lustre/mdt/mdt_coordinator.c
@@ -605,6 +605,7 @@ static int mdt_coordinator(void *data)
 
        cdt_start_pending_restore(mdt, cdt);
        set_cdt_state(cdt, CDT_RUNNING);
+       wake_up(&cdt->cdt_waitq);
 
        while (1) {
                int i;
@@ -1227,6 +1228,7 @@ int mdt_hsm_cdt_stop(struct mdt_device *mdt)
        int rc;
 
        ENTRY;
+       wait_event(cdt->cdt_waitq, cdt->cdt_state != CDT_INIT);
        /* stop coordinator thread */
        rc = set_cdt_state(cdt, CDT_STOPPING);
        if (rc == 0) {
Comment by Gerrit Updater [ 01/Sep/23 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52222
Subject: Revert "LU-16235 hsm: check CDT state before adding actions llog"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: da1da220fe5176331145bc967d0b1a86f8f26433

Comment by Nikitas Angelinas [ 17/Oct/23 ]

eaujames, could you please see some questions re if we need the revert patch in https://review.whamcloud.com/#/c/fs/lustre-release/+/52222?

Comment by Etienne Aujames [ 18/Oct/23 ]

I have abandoned the revert and merge it with the https://review.whamcloud.com/51256 ("LU-16356 hsm: add running ref to the coordinator ")

Comment by Gerrit Updater [ 12/Jan/24 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53660
Subject: LU-16235 hsm: check CDT state before adding actions llog
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 2217ee7e5b18aee5fe0c9d6135194487afa38db5

Generated at Sat Feb 10 03:25:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.