Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.6.0, Lustre 2.7.0, Lustre 2.11.0
-
None
-
3
-
14985
Description
This seems to be reproducing from time to time on my systems.
A crash like below that's likely due to lctl still trying to read the proc file while a parallel unmount frees stats.
<4>[ 7018.594828] Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash == 17:20:23 (1405891223) <0>[ 7021.841908] BUG: spinlock bad magic on CPU#0, lctl/27044 (Not tainted) <4>[ 7021.842559] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC <4>[ 7021.843161] last sysfs file: /sys/devices/system/cpu/possible <4>[ 7021.844006] CPU 0 <4>[ 7021.844006] Modules linked in: lustre ofd osp lod ost mdt mdd mgs nodemap osd_ldiskfs ldiskfs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 jbd2 mbcache virtio_balloon virtio_console i2c_piix4 i2c_core virtio_net virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs] <4>[ 7021.844006] <4>[ 7021.844006] Pid: 27044, comm: lctl Not tainted 2.6.32-rhe6.5-debug #2 Red Hat KVM <4>[ 7021.844006] RIP: 0010:[<ffffffff81292771>] [<ffffffff81292771>] spin_bug+0x81/0x100 <4>[ 7021.844006] RSP: 0018:ffff880089aa1cd8 EFLAGS: 00010002 <4>[ 7021.844006] RAX: 0000000000000050 RBX: ffff88007aeec348 RCX: 00000000ffffffff <4>[ 7021.844006] RDX: 0000000000000000 RSI: 0000000000000096 RDI: 0000000000000046 <4>[ 7021.844006] RBP: ffff880089aa1cf8 R08: 0000000000000000 R09: 000000006b6b6b6b <4>[ 7021.844006] R10: 0736072e07340735 R11: 073907360720075b R12: 6b6b6b6b6b6b6b6b <4>[ 7021.844006] R13: ffffffff817e1c57 R14: 0000000000000000 R15: 6b6b6b6b6b6b6b6b <4>[ 7021.844006] FS: 00007f92cc6d4700(0000) GS:ffff880006200000(0000) knlGS:0000000000000000 <4>[ 7021.844006] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 7021.844006] CR2: 00007f92cbf57800 CR3: 00000000748fa000 CR4: 00000000000006f0 <4>[ 7021.844006] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[ 7021.844006] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>[ 7021.844006] Process lctl (pid: 27044, threadinfo ffff880089aa0000, task ffff8800502183c0) <4>[ 7021.844006] Stack: <4>[ 7021.844006] 0000006cffffffff ffff88007aeec348 0000000000000000 ffff880089aa1da8 <4>[ 7021.844006] <d> ffff880089aa1d48 ffffffff81292935 0000000000000000 ffff880089aa1e60 <4>[ 7021.844006] <d> ffff880089aa1d38 0000000000000292 0000000000000000 ffff880089aa1da8 <4>[ 7021.844006] Call Trace: <4>[ 7021.844006] [<ffffffff81292935>] _raw_spin_lock+0xa5/0x180 <4>[ 7021.844006] [<ffffffff81516894>] _spin_lock_irqsave+0x24/0x30 <4>[ 7021.844006] [<ffffffffa09f0a51>] lprocfs_stats_collect+0x161/0x180 [obdclass] <4>[ 7021.844006] [<ffffffffa09f0ae6>] lprocfs_stats_seq_show+0x76/0x150 [obdclass] <4>[ 7021.844006] [<ffffffff81170393>] ? kmem_cache_alloc_trace+0x143/0x250 <4>[ 7021.844006] [<ffffffff811ae778>] seq_read+0xf8/0x420 <4>[ 7021.844006] [<ffffffff811ae680>] ? seq_read+0x0/0x420 <4>[ 7021.844006] [<ffffffff811f4ae5>] proc_reg_read+0x85/0xc0 <4>[ 7021.844006] [<ffffffff81189c95>] vfs_read+0xb5/0x1a0 <4>[ 7021.844006] [<ffffffff81189dd1>] sys_read+0x51/0x90 <4>[ 7021.844006] [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b <4>[ 7021.844006] Code: 8d 8e a0 06 00 00 49 89 c1 4c 89 ee 31 c0 48 c7 c7 f8 1f 7e 81 65 8b 14 25 d8 e0 00 00 e8 72 06 28 00 4d 85 e4 44 8b 4b 08 74 6b <45> 8b 84 24 a8 04 00 00 49 8d 8c 24 a0 06 00 00 8b 53 04 48 89 <1>[ 7021.844006] RIP [<ffffffff81292771>] spin_bug+0x81/0x100
There's also unmount that does the freeing:
PID: 27018 TASK: ffff8800836f80c0 CPU: 5 COMMAND: "umount" #0 [ffff880049b216f8] schedule at ffffffff815133ca #1 [ffff880049b217c0] schedule_timeout at ffffffff815142b5 #2 [ffff880049b21870] wait_for_common at ffffffff81513f2b #3 [ffff880049b21900] wait_for_completion at ffffffff8151403d #4 [ffff880049b21910] remove_proc_entry at ffffffff811fb7a7 #5 [ffff880049b219b0] lprocfs_remove_nolock at ffffffffa09efa20 [obdclass] #6 [ffff880049b219f0] lprocfs_remove at ffffffffa09efc15 [obdclass] #7 [ffff880049b21a10] lprocfs_obd_cleanup at ffffffffa09efc84 [obdclass] #8 [ffff880049b21a30] osc_precleanup at ffffffffa0ef94ec [osc] #9 [ffff880049b21a60] class_cleanup at ffffffffa0a085c3 [obdclass] #10 [ffff880049b21ae0] class_process_config at ffffffffa0a0a67a [obdclass] #11 [ffff880049b21b70] class_manual_cleanup at ffffffffa0a0ad59 [obdclass] #12 [ffff880049b21c30] lov_putref at ffffffffa0f83276 [lov] #13 [ffff880049b21cb0] lov_disconnect at ffffffffa0f8a8a2 [lov] #14 [ffff880049b21ce0] ll_put_super at ffffffffa07f51ce [lustre] #15 [ffff880049b21e30] generic_shutdown_super at ffffffff8118bb0b #16 [ffff880049b21e50] kill_anon_super at ffffffff8118bbf6 #17 [ffff880049b21e70] lustre_kill_super at ffffffffa0a0cbda [obdclass] #18 [ffff880049b21e90] deactivate_super at ffffffff8118c397 #19 [ffff880049b21eb0] mntput_no_expire at ffffffff811ab40f #20 [ffff880049b21ee0] sys_umount at ffffffff811abf7b #21 [ffff880049b21f80] system_call_fastpath at ffffffff8100b0b2
This is probably somewhat similar to LU-106
Sample crashdump in /exports/crashdumps/192.168.10.223-2014-07-20-17:20:28
tag in my tree: master-20140720
Attachments
Issue Links
- duplicates
-
LU-10224 recovery-small test_57: timeout
-
- Resolved
-