[LU-5378] recovery-small test 57 crash on invalid spinlock. Created: 21/Jul/14 Updated: 05/Aug/20 Resolved: 05/Aug/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.7.0, Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14985 | ||||||||
| Description |
|
This seems to be reproducing from time to time on my systems. <4>[ 7018.594828] Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash == 17:20:23 (1405891223) <0>[ 7021.841908] BUG: spinlock bad magic on CPU#0, lctl/27044 (Not tainted) <4>[ 7021.842559] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC <4>[ 7021.843161] last sysfs file: /sys/devices/system/cpu/possible <4>[ 7021.844006] CPU 0 <4>[ 7021.844006] Modules linked in: lustre ofd osp lod ost mdt mdd mgs nodemap osd_ldiskfs ldiskfs lquota lfsck obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 jbd2 mbcache virtio_balloon virtio_console i2c_piix4 i2c_core virtio_net virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs] <4>[ 7021.844006] <4>[ 7021.844006] Pid: 27044, comm: lctl Not tainted 2.6.32-rhe6.5-debug #2 Red Hat KVM <4>[ 7021.844006] RIP: 0010:[<ffffffff81292771>] [<ffffffff81292771>] spin_bug+0x81/0x100 <4>[ 7021.844006] RSP: 0018:ffff880089aa1cd8 EFLAGS: 00010002 <4>[ 7021.844006] RAX: 0000000000000050 RBX: ffff88007aeec348 RCX: 00000000ffffffff <4>[ 7021.844006] RDX: 0000000000000000 RSI: 0000000000000096 RDI: 0000000000000046 <4>[ 7021.844006] RBP: ffff880089aa1cf8 R08: 0000000000000000 R09: 000000006b6b6b6b <4>[ 7021.844006] R10: 0736072e07340735 R11: 073907360720075b R12: 6b6b6b6b6b6b6b6b <4>[ 7021.844006] R13: ffffffff817e1c57 R14: 0000000000000000 R15: 6b6b6b6b6b6b6b6b <4>[ 7021.844006] FS: 00007f92cc6d4700(0000) GS:ffff880006200000(0000) knlGS:0000000000000000 <4>[ 7021.844006] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 7021.844006] CR2: 00007f92cbf57800 CR3: 00000000748fa000 CR4: 00000000000006f0 <4>[ 7021.844006] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>[ 7021.844006] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>[ 7021.844006] Process lctl (pid: 27044, threadinfo ffff880089aa0000, task ffff8800502183c0) <4>[ 7021.844006] Stack: <4>[ 7021.844006] 0000006cffffffff ffff88007aeec348 0000000000000000 ffff880089aa1da8 <4>[ 7021.844006] <d> ffff880089aa1d48 ffffffff81292935 0000000000000000 ffff880089aa1e60 <4>[ 7021.844006] <d> ffff880089aa1d38 0000000000000292 0000000000000000 ffff880089aa1da8 <4>[ 7021.844006] Call Trace: <4>[ 7021.844006] [<ffffffff81292935>] _raw_spin_lock+0xa5/0x180 <4>[ 7021.844006] [<ffffffff81516894>] _spin_lock_irqsave+0x24/0x30 <4>[ 7021.844006] [<ffffffffa09f0a51>] lprocfs_stats_collect+0x161/0x180 [obdclass] <4>[ 7021.844006] [<ffffffffa09f0ae6>] lprocfs_stats_seq_show+0x76/0x150 [obdclass] <4>[ 7021.844006] [<ffffffff81170393>] ? kmem_cache_alloc_trace+0x143/0x250 <4>[ 7021.844006] [<ffffffff811ae778>] seq_read+0xf8/0x420 <4>[ 7021.844006] [<ffffffff811ae680>] ? seq_read+0x0/0x420 <4>[ 7021.844006] [<ffffffff811f4ae5>] proc_reg_read+0x85/0xc0 <4>[ 7021.844006] [<ffffffff81189c95>] vfs_read+0xb5/0x1a0 <4>[ 7021.844006] [<ffffffff81189dd1>] sys_read+0x51/0x90 <4>[ 7021.844006] [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b <4>[ 7021.844006] Code: 8d 8e a0 06 00 00 49 89 c1 4c 89 ee 31 c0 48 c7 c7 f8 1f 7e 81 65 8b 14 25 d8 e0 00 00 e8 72 06 28 00 4d 85 e4 44 8b 4b 08 74 6b <45> 8b 84 24 a8 04 00 00 49 8d 8c 24 a0 06 00 00 8b 53 04 48 89 <1>[ 7021.844006] RIP [<ffffffff81292771>] spin_bug+0x81/0x100 There's also unmount that does the freeing: PID: 27018 TASK: ffff8800836f80c0 CPU: 5 COMMAND: "umount" #0 [ffff880049b216f8] schedule at ffffffff815133ca #1 [ffff880049b217c0] schedule_timeout at ffffffff815142b5 #2 [ffff880049b21870] wait_for_common at ffffffff81513f2b #3 [ffff880049b21900] wait_for_completion at ffffffff8151403d #4 [ffff880049b21910] remove_proc_entry at ffffffff811fb7a7 #5 [ffff880049b219b0] lprocfs_remove_nolock at ffffffffa09efa20 [obdclass] #6 [ffff880049b219f0] lprocfs_remove at ffffffffa09efc15 [obdclass] #7 [ffff880049b21a10] lprocfs_obd_cleanup at ffffffffa09efc84 [obdclass] #8 [ffff880049b21a30] osc_precleanup at ffffffffa0ef94ec [osc] #9 [ffff880049b21a60] class_cleanup at ffffffffa0a085c3 [obdclass] #10 [ffff880049b21ae0] class_process_config at ffffffffa0a0a67a [obdclass] #11 [ffff880049b21b70] class_manual_cleanup at ffffffffa0a0ad59 [obdclass] #12 [ffff880049b21c30] lov_putref at ffffffffa0f83276 [lov] #13 [ffff880049b21cb0] lov_disconnect at ffffffffa0f8a8a2 [lov] #14 [ffff880049b21ce0] ll_put_super at ffffffffa07f51ce [lustre] #15 [ffff880049b21e30] generic_shutdown_super at ffffffff8118bb0b #16 [ffff880049b21e50] kill_anon_super at ffffffff8118bbf6 #17 [ffff880049b21e70] lustre_kill_super at ffffffffa0a0cbda [obdclass] #18 [ffff880049b21e90] deactivate_super at ffffffff8118c397 #19 [ffff880049b21eb0] mntput_no_expire at ffffffff811ab40f #20 [ffff880049b21ee0] sys_umount at ffffffff811abf7b #21 [ffff880049b21f80] system_call_fastpath at ffffffff8100b0b2 This is probably somewhat similar to Sample crashdump in /exports/crashdumps/192.168.10.223-2014-07-20-17:20:28 |
| Comments |
| Comment by Oleg Drokin [ 09/Dec/14 ] |
|
I just had another hit of this with current master |
| Comment by Vinayak (Inactive) [ 07/Jan/16 ] |
|
We also have hit with this issue in recovery-small test 57 at lprocfs_stats_collect+177. Have any body found solution for this issue ? |
| Comment by Jian Yu [ 15/Nov/17 ] |
|
While testing patch https://review.whamcloud.com/29088 based on master branch, recovery-small test 57 hit the following LBUG on client: Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash =================================== 10:10:34 (1510654234) Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts Lustre: DEBUG MARKER: lsof -t /mnt/lustre Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1 BUG: unable to handle kernel paging request at 000000000000a078 IP: [<ffffffffc0806cd1>] lprocfs_stats_collect+0xc1/0x140 [obdclass] PGD 6b37a067 PUD 5df6f067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul ppdev glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl auth_rpcgss lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix virtio_blk drm libata 8139too crct10dif_pclmul crct10dif_common crc32c_intel virtio_pci serio_raw virtio_ring virtio 8139cp i2c_core mii floppy [last unloaded: libcfs] CPU: 1 PID: 1905 Comm: lctl Tainted: G W OE ------------ 3.10.0-693.5.2.el7.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 task: ffff88007c370fd0 ti: ffff880068c98000 task.ti: ffff880068c98000 RIP: 0010:[<ffffffffc0806cd1>] [<ffffffffc0806cd1>] lprocfs_stats_collect+0xc1/0x140 [obdclass] RSP: 0018:ffff880068c9bdd8 EFLAGS: 00010246 RAX: 00000000000068c9 RBX: ffff880068c9be20 RCX: 0000000000000007 RDX: 000000000000a078 RSI: 000000000000a090 RDI: 0000000000000007 RBP: ffff880068c9be00 R08: 0000000000000078 R09: 0000000000000018 R10: 0000000000000000 R11: ffff880068c9bca6 R12: ffff880068c95300 R13: 0000000000000003 R14: ffff88007a3cc000 R15: ffff880065354380 FS: 00007f8896e2f740(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000a078 CR3: 00000000685ff000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: 0000000000000000 0000000068eb0a32 0000000000000003 ffff880065354380 ffff880068c95300 ffff880068c9be70 ffffffffc08078b7 000000005a0ac11b 0000000035b69261 0000000000000000 0000000000000000 0000000000000000[17354.102007] Call Trace: [<ffffffffc08078b7>] lprocfs_stats_seq_show+0x47/0x140 [obdclass] [<ffffffff81226550>] seq_read+0x250/0x3b0 [<ffffffff812702cd>] proc_reg_read+0x3d/0x80 [<ffffffff81200b1c>] vfs_read+0x9c/0x170 [<ffffffff812019df>] SyS_read+0x7f/0xe0 [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b Code: c1 e0 03 0f 1f 80 00 00 00 00 48 63 d1 49 83 7c d4 18 00 74 48 4c 89 c2 49 03 54 fc 18 41 f6 44 24 04 02 4a 8d 34 0a 48 0f 45 d6 <48> 8b 32 48 01 33 48 8b 72 20 48 01 73 20 48 8b 72 08 48 3b 73 RIP [<ffffffffc0806cd1>] lprocfs_stats_collect+0xc1/0x140 [obdclass] RSP <ffff880068c9bdd8> CR2: 000000000000a078 Maloo report: https://testing.hpdd.intel.com/test_sets/dcbc4ef6-c937-11e7-9c63-52540065bddc |
| Comment by Jian Yu [ 21/Nov/17 ] |
|
More failure instances on master branch: |
| Comment by Bruno Faccini (Inactive) [ 23/Nov/17 ] |
|
I believe this is the same problem than the one I am working on with |
| Comment by Bruno Faccini (Inactive) [ 02/Feb/18 ] |
|
Oleg, now that my patch for |