Loading...

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.18.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

I just recorded a following crash in my testing:

[127672.042005] Lustre: DEBUG MARKER: == recovery-small test 132: long punch =============================================================== 11:06:23 (1469977583)
[127672.479285] Lustre: Mounted lustre-client
[127672.479815] Lustre: Skipped 1 previous similar message
[127745.047388] LNet: Service thread pid 22124 was inactive for 72.05s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[127745.064157] Pid: 22124, comm: ll_ost_io01_006
[127745.064689] 
Call Trace:
[127745.065676]  [<ffffffff81704339>] schedule+0x29/0x70
[127745.066206]  [<ffffffff81700972>] schedule_timeout+0x162/0x2a0
[127745.066737]  [<ffffffff810879f0>] ? process_timeout+0x0/0x10
[127745.067287]  [<ffffffffa02401c3>] __cfs_fail_timeout_set+0xe3/0x170 [libcfs]
[127745.067844]  [<ffffffffa0d46c0e>] ofd_punch_hdl+0x11e/0xa10 [ofd]
[127745.080775]  [<ffffffffa05f7325>] tgt_request_handle+0x925/0x1330 [ptlrpc]
[127745.081379]  [<ffffffffa05a4921>] ptlrpc_server_handle_request+0x231/0xac0 [ptlrpc]
[127745.082402]  [<ffffffffa05a24d8>] ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc]
[127745.082976]  [<ffffffffa05a8738>] ptlrpc_main+0xa58/0x1dd0 [ptlrpc]
[127745.084901]  [<ffffffffa05a7ce0>] ? ptlrpc_main+0x0/0x1dd0 [ptlrpc]
[127745.085451]  [<ffffffff810a2eda>] kthread+0xea/0xf0
[127745.085952]  [<ffffffff810a2df0>] ? kthread+0x0/0xf0
[127745.086495]  [<ffffffff8170fbd8>] ret_from_fork+0x58/0x90
[127745.087007]  [<ffffffff810a2df0>] ? kthread+0x0/0xf0
[127745.087522] 
[127745.087965] LustreError: dumping log to /tmp/lustre-log.1469977656.22124
[127792.994255] LustreError: 22124:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 236 awake
[127792.996636] LNet: Service thread pid 22124 completed after 120.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[127794.212798] LustreError: 6493:0:(obd_config.c:1262:class_process_config()) no device for: lustre-OST0000-osc-ffff8800655a9800
[127794.214028] BUG: unable to handle kernel paging request at ffff88000c218000
[127794.214770] IP: [<ffffffff8137fa54>] strnlen+0x24/0x40
[127794.215885] PGD 2e75067 PUD 2e76067 PMD bcdb9067 PTE 800000000c218060
[127794.216455] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[127794.216970] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) loop mbcache jbd2 sha512_generic crypto_null rpcsec_gss_krb5 syscopyarea sysfillrect sysimgblt ttm ata_generic pata_acpi drm_kms_helper i2c_piix4 drm ata_piix floppy virtio_console libata virtio_blk i2c_core virtio_balloon pcspkr serio_raw nfsd ip_tables [last unloaded: libcfs]
[127794.221635] CPU: 7 PID: 6493 Comm: umount Tainted: G           OE  ------------   3.10.0-debug #1
[127794.225176] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[127794.225765] task: ffff88008d3b0940 ti: ffff880066bd8000 task.ti: ffff880066bd8000
[127794.228638] RIP: 0010:[<ffffffff8137fa54>]  [<ffffffff8137fa54>] strnlen+0x24/0x40
[127794.229695] RSP: 0018:ffff880066bdba08  EFLAGS: 00010212
[127794.230509] RAX: ffff88000c218000 RBX: ffff88008d39d3fc RCX: fffffffffffffffe
[127794.231624] RDX: 0000000000001980 RSI: ffffffffffffffff RDI: ffff88000c216680
[127794.232753] RBP: ffff880066bdba08 R08: 000000000000ffff R09: 000000000000ffff
[127794.236251] R10: 0000000000000000 R11: ffff880066bdb9ce R12: ffff88000c216680
[127794.237295] R13: ffff88008d39e000 R14: 00000000ffffffff R15: 0000000000000000
[127794.238825] FS:  00007fc07d02a880(0000) GS:ffff8800bc7c0000(0000) knlGS:0000000000000000
[127794.239945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[127794.240535] CR2: ffff88000c218000 CR3: 000000003e53e000 CR4: 00000000000006e0
[127794.241721] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[127794.242733] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[127794.243846] Stack:
[127794.244433]  ffff880066bdba40 ffffffff81381f0b ffff88008d39d3fc ffff88008d39e000
[127794.246021]  ffff880066bdbb50 ffffffffa03a6937 ffffffffa03a6937 ffff880066bdbab0
[127794.247081]  ffffffff813836a1 00000000579effff 000000001ab70886 0000000000000002
[127794.248371] Call Trace:
[127794.262160]  [<ffffffff81381f0b>] string.isra.7+0x3b/0xf0
[127794.263031]  [<ffffffff813836a1>] vsnprintf+0x201/0x6a0
[127794.263722]  [<ffffffffa024214b>] libcfs_debug_vmsg2+0x31b/0xc10 [libcfs]
[127794.264364]  [<ffffffffa0242a97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[127794.265366]  [<ffffffffa023f6e3>] ? libcfs_log_goto+0x23/0x30 [libcfs]
[127794.266180]  [<ffffffffa03634fc>] ? class_process_config+0x1b9c/0x3110 [obdclass]
[127794.267557]  [<ffffffffa0242a97>] libcfs_debug_msg+0x57/0x80 [libcfs]
[127794.268426]  [<ffffffffa0365099>] class_manual_cleanup+0x629/0xab0 [obdclass]
[127794.270864]  [<ffffffffa0dc34c8>] ll_put_super+0x128/0x990 [lustre]
[127794.271749]  [<ffffffff8112b4ed>] ? call_rcu_sched+0x1d/0x20
[127794.272278]  [<ffffffffa0ded2ec>] ? ll_destroy_inode+0x1c/0x20 [lustre]
[127794.272887]  [<ffffffff8120a3c8>] ? destroy_inode+0x38/0x60
[127794.274341]  [<ffffffff8120a4f6>] ? evict+0x106/0x170
[127794.275097]  [<ffffffff8120a59e>] ? dispose_list+0x3e/0x50
[127794.275868]  [<ffffffff8120b244>] ? evict_inodes+0x114/0x140
[127794.276390]  [<ffffffff811efa66>] generic_shutdown_super+0x56/0xe0
[127794.277036]  [<ffffffff811efe42>] kill_anon_super+0x12/0x20
[127794.277765]  [<ffffffffa03689b5>] lustre_kill_super+0x45/0x50 [obdclass]
[127794.278808]  [<ffffffff811f0349>] deactivate_locked_super+0x49/0x60
[127794.279437]  [<ffffffff811f0946>] deactivate_super+0x46/0x60
[127794.280373]  [<ffffffff8120f135>] mntput_no_expire+0xc5/0x120
[127794.281073]  [<ffffffff812102bf>] SyS_umount+0x9f/0x3c0
[127794.281674]  [<ffffffff8170fc89>] system_call_fastpath+0x16/0x1b
[127794.282515] Code: 1f 84 00 00 00 00 00 55 48 85 f6 48 8d 4e ff 48 89 e5 74 2a 80 3f 00 74 25 48 89 f8 31 d2 eb 10 0f 1f 80 00 00 00 00 48 83 c2 01 <80> 38 00 74 09 48 83 c0 01 48 39 d1 75 ee 48 29 f8 5d c3 31 c0 
[127794.284816] RIP  [<ffffffff8137fa54>] strnlen+0x24/0x40
[127794.285410]  RSP <ffff880066bdba08>
[127794.285886] CR2: ffff88000c218000

The crash location is in:

(gdb) l *(class_manual_cleanup+0x620)
0x42090 is in class_manual_cleanup (/home/green/git/lustre-release/lustre/obdclass/obd_config.c:1943).
1938		if (lcfg == NULL)
1939			RETURN(-ENOMEM);
1940	
1941	        rc = class_process_config(lcfg);
1942	        if (rc) {
1943	                CERROR("cleanup failed %d: %s\n", rc, obd->obd_name);
1944	                GOTO(out, rc);
1945	        }
1946	
1947	        /* the lcfg is almost the same for both ops */
(gdb) quit

So it appears that when we entered this function, obd was still valid, then somebody freed it and we indeed see that class_process_config that was called from inside already cannot find this obd.

The callsite for this is in ll_put_super and does not look safe:

        while ((obd = class_devices_in_group(&sbi->ll_sb_uuid, &next)) !=NULL) {
                class_manual_cleanup(obd);
        }

So this walks OBDs without any refcounting? does not look good,
though I am not sure what would be the other caller that would drop the final obd recount as is.

Crashump is available:
/exports/crashdumps/192.168.10.223-2016-07-31-11:08:32
tag in my source tree: master-20160731

Details

Description

Attachments

Activity

People

Dates