Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
I just recorded a following crash in my testing:
[127672.042005] Lustre: DEBUG MARKER: == recovery-small test 132: long punch =============================================================== 11:06:23 (1469977583) [127672.479285] Lustre: Mounted lustre-client [127672.479815] Lustre: Skipped 1 previous similar message [127745.047388] LNet: Service thread pid 22124 was inactive for 72.05s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [127745.064157] Pid: 22124, comm: ll_ost_io01_006 [127745.064689] Call Trace: [127745.065676] [<ffffffff81704339>] schedule+0x29/0x70 [127745.066206] [<ffffffff81700972>] schedule_timeout+0x162/0x2a0 [127745.066737] [<ffffffff810879f0>] ? process_timeout+0x0/0x10 [127745.067287] [<ffffffffa02401c3>] __cfs_fail_timeout_set+0xe3/0x170 [libcfs] [127745.067844] [<ffffffffa0d46c0e>] ofd_punch_hdl+0x11e/0xa10 [ofd] [127745.080775] [<ffffffffa05f7325>] tgt_request_handle+0x925/0x1330 [ptlrpc] [127745.081379] [<ffffffffa05a4921>] ptlrpc_server_handle_request+0x231/0xac0 [ptlrpc] [127745.082402] [<ffffffffa05a24d8>] ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc] [127745.082976] [<ffffffffa05a8738>] ptlrpc_main+0xa58/0x1dd0 [ptlrpc] [127745.084901] [<ffffffffa05a7ce0>] ? ptlrpc_main+0x0/0x1dd0 [ptlrpc] [127745.085451] [<ffffffff810a2eda>] kthread+0xea/0xf0 [127745.085952] [<ffffffff810a2df0>] ? kthread+0x0/0xf0 [127745.086495] [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90 [127745.087007] [<ffffffff810a2df0>] ? kthread+0x0/0xf0 [127745.087522] [127745.087965] LustreError: dumping log to /tmp/lustre-log.1469977656.22124 [127792.994255] LustreError: 22124:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 236 awake [127792.996636] LNet: Service thread pid 22124 completed after 120.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [127794.212798] LustreError: 6493:0:(obd_config.c:1262:class_process_config()) no device for: lustre-OST0000-osc-ffff8800655a9800 [127794.214028] BUG: unable to handle kernel paging request at ffff88000c218000 [127794.214770] IP: [<ffffffff8137fa54>] strnlen+0x24/0x40 [127794.215885] PGD 2e75067 PUD 2e76067 PMD bcdb9067 PTE 800000000c218060 [127794.216455] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [127794.216970] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) loop mbcache jbd2 sha512_generic crypto_null rpcsec_gss_krb5 syscopyarea sysfillrect sysimgblt ttm ata_generic pata_acpi drm_kms_helper i2c_piix4 drm ata_piix floppy virtio_console libata virtio_blk i2c_core virtio_balloon pcspkr serio_raw nfsd ip_tables [last unloaded: libcfs] [127794.221635] CPU: 7 PID: 6493 Comm: umount Tainted: G OE ------------ 3.10.0-debug #1 [127794.225176] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [127794.225765] task: ffff88008d3b0940 ti: ffff880066bd8000 task.ti: ffff880066bd8000 [127794.228638] RIP: 0010:[<ffffffff8137fa54>] [<ffffffff8137fa54>] strnlen+0x24/0x40 [127794.229695] RSP: 0018:ffff880066bdba08 EFLAGS: 00010212 [127794.230509] RAX: ffff88000c218000 RBX: ffff88008d39d3fc RCX: fffffffffffffffe [127794.231624] RDX: 0000000000001980 RSI: ffffffffffffffff RDI: ffff88000c216680 [127794.232753] RBP: ffff880066bdba08 R08: 000000000000ffff R09: 000000000000ffff [127794.236251] R10: 0000000000000000 R11: ffff880066bdb9ce R12: ffff88000c216680 [127794.237295] R13: ffff88008d39e000 R14: 00000000ffffffff R15: 0000000000000000 [127794.238825] FS: 00007fc07d02a880(0000) GS:ffff8800bc7c0000(0000) knlGS:0000000000000000 [127794.239945] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [127794.240535] CR2: ffff88000c218000 CR3: 000000003e53e000 CR4: 00000000000006e0 [127794.241721] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [127794.242733] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [127794.243846] Stack: [127794.244433] ffff880066bdba40 ffffffff81381f0b ffff88008d39d3fc ffff88008d39e000 [127794.246021] ffff880066bdbb50 ffffffffa03a6937 ffffffffa03a6937 ffff880066bdbab0 [127794.247081] ffffffff813836a1 00000000579effff 000000001ab70886 0000000000000002 [127794.248371] Call Trace: [127794.262160] [<ffffffff81381f0b>] string.isra.7+0x3b/0xf0 [127794.263031] [<ffffffff813836a1>] vsnprintf+0x201/0x6a0 [127794.263722] [<ffffffffa024214b>] libcfs_debug_vmsg2+0x31b/0xc10 [libcfs] [127794.264364] [<ffffffffa0242a97>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [127794.265366] [<ffffffffa023f6e3>] ? libcfs_log_goto+0x23/0x30 [libcfs] [127794.266180] [<ffffffffa03634fc>] ? class_process_config+0x1b9c/0x3110 [obdclass] [127794.267557] [<ffffffffa0242a97>] libcfs_debug_msg+0x57/0x80 [libcfs] [127794.268426] [<ffffffffa0365099>] class_manual_cleanup+0x629/0xab0 [obdclass] [127794.270864] [<ffffffffa0dc34c8>] ll_put_super+0x128/0x990 [lustre] [127794.271749] [<ffffffff8112b4ed>] ? call_rcu_sched+0x1d/0x20 [127794.272278] [<ffffffffa0ded2ec>] ? ll_destroy_inode+0x1c/0x20 [lustre] [127794.272887] [<ffffffff8120a3c8>] ? destroy_inode+0x38/0x60 [127794.274341] [<ffffffff8120a4f6>] ? evict+0x106/0x170 [127794.275097] [<ffffffff8120a59e>] ? dispose_list+0x3e/0x50 [127794.275868] [<ffffffff8120b244>] ? evict_inodes+0x114/0x140 [127794.276390] [<ffffffff811efa66>] generic_shutdown_super+0x56/0xe0 [127794.277036] [<ffffffff811efe42>] kill_anon_super+0x12/0x20 [127794.277765] [<ffffffffa03689b5>] lustre_kill_super+0x45/0x50 [obdclass] [127794.278808] [<ffffffff811f0349>] deactivate_locked_super+0x49/0x60 [127794.279437] [<ffffffff811f0946>] deactivate_super+0x46/0x60 [127794.280373] [<ffffffff8120f135>] mntput_no_expire+0xc5/0x120 [127794.281073] [<ffffffff812102bf>] SyS_umount+0x9f/0x3c0 [127794.281674] [<ffffffff8170fc89>] system_call_fastpath+0x16/0x1b [127794.282515] Code: 1f 84 00 00 00 00 00 55 48 85 f6 48 8d 4e ff 48 89 e5 74 2a 80 3f 00 74 25 48 89 f8 31 d2 eb 10 0f 1f 80 00 00 00 00 48 83 c2 01 <80> 38 00 74 09 48 83 c0 01 48 39 d1 75 ee 48 29 f8 5d c3 31 c0 [127794.284816] RIP [<ffffffff8137fa54>] strnlen+0x24/0x40 [127794.285410] RSP <ffff880066bdba08> [127794.285886] CR2: ffff88000c218000
The crash location is in:
(gdb) l *(class_manual_cleanup+0x620)
0x42090 is in class_manual_cleanup (/home/green/git/lustre-release/lustre/obdclass/obd_config.c:1943).
1938 if (lcfg == NULL)
1939 RETURN(-ENOMEM);
1940
1941 rc = class_process_config(lcfg);
1942 if (rc) {
1943 CERROR("cleanup failed %d: %s\n", rc, obd->obd_name);
1944 GOTO(out, rc);
1945 }
1946
1947 /* the lcfg is almost the same for both ops */
(gdb) quit
So it appears that when we entered this function, obd was still valid, then somebody freed it and we indeed see that class_process_config that was called from inside already cannot find this obd.
The callsite for this is in ll_put_super and does not look safe:
while ((obd = class_devices_in_group(&sbi->ll_sb_uuid, &next)) !=NULL) {
class_manual_cleanup(obd);
}
So this walks OBDs without any refcounting? does not look good,
though I am not sure what would be the other caller that would drop the final obd recount as is.
Crashump is available:
/exports/crashdumps/192.168.10.223-2016-07-31-11:08:32
tag in my source tree: master-20160731