[LU-16235] cdt_agent_record_hash_add() ASSERTION( carl0->carl_cat_idx == carl1->carl_cat_idx ) failed Created: 12/Oct/22 Updated: 12/Jan/24 |
|
| Status: | In Progress |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Nikitas Angelinas | Assignee: | Nikitas Angelinas |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
When adding new HSM actions llog records in mdt_agent_record_add(), cdt_state might be in CDT_INIT, so the HSM actions llog may not have been fully processed to set cdt_last_cookie to an appropriately large value, leading to cookie values being reused and triggering the assertions in cdt_agent_record_hash_add(). LU-13689 attempted to fix this, but there might be a simpler solution. |
| Comments |
| Comment by Gerrit Updater [ 12/Oct/22 ] |
|
"Nikitas Angelinas <nikitas.angelinas@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48842 |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51850 |
| Comment by Gerrit Updater [ 31/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48842/ |
| Comment by Alex Zhuravlev [ 01/Sep/23 ] |
|
this patch causes GFP:
[ 9128.588763] Lustre: DEBUG MARKER: == conf-sanity test 132: hsm_actions processed after failover ========================================================== 05:32:30 (1693546350)
...
[ 9196.856259] Lustre: Found index 0 for lustre-MDT0000, updating log
[ 9197.308569] systemd[1]: mnt-lustre\x2dmds1.mount: Succeeded.
[ 9197.541121] Lustre: server umount lustre-MDT0000 complete
[ 9197.920157] BUG: unable to handle kernel paging request at ffff89866f0e2698
[ 9197.920511] PGD 76e01067 P4D 76e01067 PUD 176f48067 PMD 176dcf067 PTE 800ffffed0f1d060
[ 9197.920558] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 9197.920586] CPU: 1 PID: 481978 Comm: hsm_cdtr Tainted: G W O --------- - - 4.18.0 #2
[ 9197.920636] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 9197.920690] RIP: 0010:mdt_coordinator+0xd7/0x1a10 [mdt]
[ 9197.920728] Code: ff 01 db 74 31 31 db eb 0c 8b 05 7c 39 58 ff 01 c0 39 c3 73 21 bf 00 ca 9a 3b 83 c3 01 e8 01 81 28 e7 48 89 c7 e8 59 8a 71 e7 <49> 8b 84 24 98 06 00 00 a8 01 74 d3 49 8b 84 24 98 06 00 00 a8 01
[ 9197.920826] RSP: 0000:ffff898681d17e00 EFLAGS: 00010282
[ 9197.920853] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[ 9197.920891] RDX: 0000000000000000 RSI: ffffffffa8117897 RDI: 0000000000000246
[ 9197.920936] RBP: ffff8985a1c14740 R08: 0000000000000000 R09: ffff8986b13e98c0
[ 9197.920974] R10: 0000000000000000 R11: 000000000000004f R12: ffff89866f0e2000
[ 9197.921012] R13: ffff89868a508380 R14: ffffffffc0e8eb10 R15: ffff898648cbc000
[ 9197.921051] FS: 0000000000000000(0000) GS:ffff8986b1200000(0000) knlGS:0000000000000000
[ 9197.921089] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9197.921120] CR2: ffff89866f0e2698 CR3: 0000000139335000 CR4: 00000000000006a0
[ 9197.921161] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9197.921208] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9197.921246] Call Trace:
[ 9197.921265] ? _raw_spin_lock_irqsave+0x46/0x80
[ 9197.921302] ? finish_task_switch+0x1f1/0x280
[ 9197.921350] ? set_cdt_state+0x40/0x40 [mdt]
[ 9197.921386] kthread+0x129/0x140
[ 9197.921415] ? kthread_flush_work_fn+0x10/0x10
[ 9197.921451] ret_from_fork+0x1f/0x30
[ 9197.921472] Modules linked in: lustre(O) ofd(O) osp(O) lod(O) ost(O) mdt(O) mdd(O) mgs(O) osd_ldiskfs(O) ldiskfs(O) lquota(O) lfsck(O) obdecho(O) mgc(O) mdc(O) lov(O) osc(O) lmv(O) fid(O) fld(O) ptlrpc(O) obdclass(O) ksocklnd(O) lnet(O) libcfs(O) zfs(O) zunicode(O) zzstd(O) zlua(O) zcommon(O) znvpair(O) zavl(O) icp(O) spl(O) [last unloaded: libcfs]
[ 9197.921647] CR2: ffff89866f0e2698
[ 9197.921668] ---[ end trace a7d48f6687796264 ]---
here is bt:
PID: 481978 TASK: ffff89868a508380 CPU: 1 COMMAND: "hsm_cdtr"
#0 [ffff898681d17c68] panic at ffffffffa80b9786
/tmp/kernel/kernel/panic.c: 299
#1 [ffff898681d17d00] no_context at ffffffffa80a9563
/tmp/kernel/arch/x86/mm/fault.c: 799
#2 [ffff898681d17d50] page_fault at ffffffffa8600f0e
/tmp/kernel/arch/x86/entry/entry_64.S: 1220
[exception RIP: mdt_coordinator+215]
RIP: ffffffffc0e8ebe7 RSP: ffff898681d17e00 RFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffa8117897 RDI: 0000000000000246
RBP: ffff8985a1c14740 R8: 0000000000000000 R9: ffff8986b13e98c0
R10: 0000000000000000 R11: 000000000000004f R12: ffff89866f0e2000
R13: ffff89868a508380 R14: ffffffffc0e8eb10 R15: ffff898648cbc000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
/home/lustre/linux-4.18.0-305.25.1.el8_4/./arch/x86/include/asm/bitops.h: 324
#3 [ffff898681d17f10] kthread at ffffffffa80d5199
/tmp/kernel/kernel/kthread.c: 340
#4 [ffff898681d17f50] ret_from_fork at ffffffffa860019f
/tmp/kernel/arch/x86/entry/entry_64.S: 325
GFP was hit at:
while (!test_bit(MDT_FL_CFGLOG, &mdt->mdt_state) && i < obd_timeout) {
|
| Comment by Alex Zhuravlev [ 01/Sep/23 ] |
|
this helped: diff --git a/lustre/mdt/mdt_coordinator.c b/lustre/mdt/mdt_coordinator.c index 439e0cc130..90f2c270df 100644 --- a/lustre/mdt/mdt_coordinator.c +++ b/lustre/mdt/mdt_coordinator.c @@ -605,6 +605,7 @@ static int mdt_coordinator(void *data) cdt_start_pending_restore(mdt, cdt); set_cdt_state(cdt, CDT_RUNNING); + wake_up(&cdt->cdt_waitq); while (1) { int i; @@ -1227,6 +1228,7 @@ int mdt_hsm_cdt_stop(struct mdt_device *mdt) int rc; ENTRY; + wait_event(cdt->cdt_waitq, cdt->cdt_state != CDT_INIT); /* stop coordinator thread */ rc = set_cdt_state(cdt, CDT_STOPPING); if (rc == 0) { |
| Comment by Gerrit Updater [ 01/Sep/23 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52222 |
| Comment by Nikitas Angelinas [ 17/Oct/23 ] |
|
eaujames, could you please see some questions re if we need the revert patch in https://review.whamcloud.com/#/c/fs/lustre-release/+/52222? |
| Comment by Etienne Aujames [ 18/Oct/23 ] |
|
I have abandoned the revert and merge it with the https://review.whamcloud.com/51256 ("LU-16356 hsm: add running ref to the coordinator ") |
| Comment by Gerrit Updater [ 12/Jan/24 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53660 |