[LU-13146] crash in lod_sub_recovery_thread Created: 16/Jan/20 Updated: 16/Jan/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
hit this oops in my boilpot: [195498.474047] BUG: unable to handle kernel paging request at ffff8802fd732fc4 [195498.474922] IP: [<ffffffffa1283d70>] lod_sub_recovery_thread+0x4b0/0xd00 [lod] [195498.474922] PGD 241c067 PUD 33e9f9067 PMD 33e80d067 PTE 80000002fd732060 [195498.474922] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC [195498.474922] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_zfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) pcc_cpufreq zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) crc_t10dif crct10dif_generic sb_edac edac_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd virtio_balloon pcspkr virtio_console i2c_piix4 ip_tables rpcsec_gss_krb5 ata_generic drm_kms_helper pata_acpi ttm drm crct10dif_pclmul crct10dif_common ata_piix drm_panel_orientation_quirks crc32c_intel serio_raw virtio_blk i2c_core libata floppy [last unloaded: libcfs] [195498.474922] CPU: 6 PID: 16186 Comm: lod0000_rec0002 Kdump: loaded Tainted: P OE ------------ 3.10.0-7.7-debug #1 [195498.474922] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [195498.474922] task: ffff88009f09c4c0 ti: ffff880323750000 task.ti: ffff880323750000 [195498.474922] RIP: 0010:[<ffffffffa1283d70>] [<ffffffffa1283d70>] lod_sub_recovery_thread+0x4b0/0xd00 [lod] [195498.474922] RSP: 0018:ffff880323753e10 EFLAGS: 00010286 [195498.474922] RAX: ffff8802fd732f00 RBX: 0000000000000000 RCX: 0000000000000000 [195498.474922] RDX: ffff8802c085b0b8 RSI: 000000000000006b RDI: 0000000000000286 [195498.474922] RBP: ffff880323753ea0 R08: 0000000000000010 R09: ffff8802b63aa900 [195498.474922] R10: 0000000000000000 R11: 000000000000000f R12: ffff8802edc34000 [195498.474922] R13: ffff8802dd5b2000 R14: ffff8802b63aa900 R15: 00000000fffffffb [195498.474922] FS: 0000000000000000(0000) GS:ffff88033db80000(0000) knlGS:0000000000000000 [195498.474922] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [195498.474922] CR2: ffff8802fd732fc4 CR3: 00000002b6bb6000 CR4: 00000000001607e0 [195498.474922] Call Trace: [195498.474922] [<ffffffffa12838c0>] ? lod_trans_stop+0x340/0x340 [lod] [195498.474922] [<ffffffff810b8254>] kthread+0xe4/0xf0 [195498.474922] [<ffffffff810b8170>] ? kthread_create_on_node+0x140/0x140 [195498.474922] [<ffffffff817e0ddd>] ret_from_fork_nospec_begin+0x7/0x21 [195498.474922] [<ffffffff810b8170>] ? kthread_create_on_node+0x140/0x140 [195498.474922] Code: f6 05 70 dd 5c ff 04 0f 85 f6 03 00 00 4c 89 f7 e8 86 3d f9 df 48 8b 45 88 48 8b 95 78 ff ff ff c7 40 18 01 00 00 00 48 8b 42 48 <f0> ff 88 c4 00 00 00 48 8b 7a 48 31 c9 ba 01 00 00 00 be 03 00 it's in this code: 0x2d9e is in lod_sub_recovery_thread (/home/green/git/lustre-release/lustre/lod/lod_dev.c:478). 473 EXIT; 474 475 out: 476 OBD_FREE_PTR(lrd); 477 thread->t_flags = SVC_STOPPED; 478 atomic_dec(&lut->lut_tdtd->tdtd_recovery_threads_count); 479 wake_up(&lut->lut_tdtd->tdtd_recovery_threads_waitq); 480 wake_up(&thread->t_ctl_waitq); 481 lu_env_fini(&env); 482 return rc; examining logs it's a use after free: : 00000004:00000010:6.0:1578997461.530712:0:16186:0:(lod_dev.c:476:lod_sub_recover y_thread()) kfreed 'lrd': 32 at ffff8802b63aa900. 00000004:00000010:1.0:1578997461.538400:0:17416:0:(lod_dev.c:869:lod_fini_distribute_txn()) kfreed 'lut->lut_tdtd': 256 at ffff8802fd732f00. the oops time converted from since bootup to wallclock is 1578997461.539770 the call to lod_fini_distribute_txn looks like this:
lod_sub_stop_recovery_threads(env, lod);
lod_fini_distribute_txn(env, lod);
and lod_sub_stop_recovery_threads:
lod_getref(&lod->lod_mdt_descs);
lod_foreach_mdt(lod, mdt) {
thread = mdt->ltd_recovery_thread;
if (thread && thread->t_flags & SVC_RUNNING) {
thread->t_flags = SVC_STOPPING;
wake_up(&thread->t_ctl_waitq);
wait_event(thread->t_ctl_waitq,
thread->t_flags & SVC_STOPPED);
OBD_FREE_PTR(mdt->ltd_recovery_thread);
mdt->ltd_recovery_thread = NULL;
}
}
lod_putref(lod, &lod->lod_mdt_descs);
there's no mdt->ltd_recovery_thread freeing in the logs from lod_sub_stop_recovery_threads so it appears by the time we got to the flags check the lod_sub_recovery_thread has already set it to stopped (it appears that the allocation would also be leaked in this case as I don't see anything else that would free it) hm, the fix is pobably going to be annoying if this code could be entered more than once. |