Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.4.0
-
None
-
rhel6, kernel 2.6.32-279.2.1.el6
-
3
-
7056
Description
Running 2.3.62 tag with replay-single in a loop I got test 73c locked up.
dmesg is full of soft lockup messages like this:
[267156.100008] BUG: soft lockup - CPU#4 stuck for 67s! [umount:7615] [267156.100273] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 i2c_piix4 i2c_core virtio_balloon virtio_console virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs] [267156.101054] CPU 4 [267156.101054] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 i2c_piix4 i2c_core virtio_balloon virtio_console virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache nfs_acl auth_rpcgss sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs] [267156.101054] [267156.101054] Pid: 7615, comm: umount Not tainted 2.6.32-debug #6 Bochs Bochs [267156.101054] RIP: 0010:[<ffffffff810a78fa>] [<ffffffff810a78fa>] smp_call_function_many+0x1ea/0x260 [267156.101054] RSP: 0018:ffff88007cc23968 EFLAGS: 00000202 [267156.101054] RAX: 0000000000000011 RBX: ffff88007cc239a8 RCX: ffff88007cc23828 [267156.101054] RDX: 0000000000000010 RSI: 800000006d1ad160 RDI: 0000000000000282 [267156.101054] RBP: ffffffff8100bc0e R08: 0000000000000001 R09: ffff880000000000 [267156.101054] R10: 0000000000000000 R11: 0000000087654321 R12: 0000000000000282 [267156.101054] R13: ffff88007cc23918 R14: ffffffff811adb30 R15: ffff8800ba6b2ef0 [267156.101054] FS: 00007fb972b2c740(0000) GS:ffff880006300000(0000) knlGS:0000000000000000 [267156.101054] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [267156.101054] CR2: 00007fb9722481a0 CR3: 0000000048165000 CR4: 00000000000006e0 [267156.101054] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [267156.101054] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [267156.101054] Process umount (pid: 7615, threadinfo ffff88007cc22000, task ffff88008800e440) [267156.101054] Stack: [267156.101054] 01ff88007cc239d8 0000000000000000 ffffffff81a91560 ffffffff811adb30 [267156.101054] <d> 0000000000000000 ffff8800989699c0 ffff880098969a40 0000000000000000 [267156.101054] <d> ffff88007cc239b8 ffffffff810a7992 ffff88007cc239e8 ffffffff810714f4 [267156.101054] Call Trace: [267156.101054] [<ffffffff811adb30>] ? invalidate_bh_lru+0x0/0x50 [267156.101054] [<ffffffff810a7992>] ? smp_call_function+0x22/0x30 [267156.101054] [<ffffffff810714f4>] ? on_each_cpu+0x24/0x50 [267156.101054] [<ffffffff811ad7ac>] ? invalidate_bh_lrus+0x1c/0x20 [267156.101054] [<ffffffff811ae665>] ? invalidate_bdev+0x25/0x50 [267156.101054] [<ffffffffa043fa64>] ? ldiskfs_put_super+0x1f4/0x3e0 [ldiskfs] [267156.101054] [<ffffffff8117d6ab>] ? generic_shutdown_super+0x5b/0xe0 [267156.101054] [<ffffffff8117d761>] ? kill_block_super+0x31/0x50 [267156.101054] [<ffffffff8117e825>] ? deactivate_super+0x85/0xa0 [267156.101054] [<ffffffff8119a89f>] ? mntput_no_expire+0xbf/0x110 [267156.101054] [<ffffffffa050b5af>] ? osd_device_fini+0x32f/0x380 [osd_ldiskfs] [267156.101054] [<ffffffffa0a35cc7>] ? class_cleanup+0x577/0xda0 [obdclass] [267156.101054] [<ffffffffa0a0be9c>] ? class_name2dev+0x7c/0xe0 [obdclass] [267156.101054] [<ffffffffa0a375ac>] ? class_process_config+0x10bc/0x1c80 [obdclass] [267156.101054] [<ffffffffa0a30f93>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] [267156.101054] [<ffffffffa0a382e9>] ? class_manual_cleanup+0x179/0x6e0 [obdclass] [267156.101054] [<ffffffffa0a09ce1>] ? class_export_put+0x101/0x2c0 [obdclass] [267156.101054] [<ffffffffa0512ce4>] ? osd_obd_disconnect+0x174/0x1e0 [osd_ldiskfs] [267156.101054] [<ffffffffa0a3ab2e>] ? lustre_put_lsi+0x17e/0xe20 [obdclass] [267156.101054] [<ffffffffa0a42ad8>] ? lustre_common_put_super+0x5d8/0xc20 [obdclass] [267156.101054] [<ffffffffa0a43eea>] ? server_put_super+0x1ca/0xe60 [obdclass] [267156.101054] [<ffffffff8117d6ab>] ? generic_shutdown_super+0x5b/0xe0 [267156.101054] [<ffffffff8117d796>] ? kill_anon_super+0x16/0x60 [267156.101054] [<ffffffffa0a3a0e6>] ? lustre_kill_super+0x36/0x60 [obdclass] [267156.101054] [<ffffffff8117e825>] ? deactivate_super+0x85/0xa0 [267156.101054] [<ffffffff8119a89f>] ? mntput_no_expire+0xbf/0x110 [267156.101054] [<ffffffff8119b34b>] ? sys_umount+0x7b/0x3a0 [267156.101054] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b [267156.101054] Code: 36 45 00 0f ae f0 48 8b 7b 30 ff 15 29 c4 98 00 80 7d c7 00 0f 84 9f fe ff ff f6 43 20 01 0f 84 95 fe ff ff 0f 1f 44 00 00 f3 90 <f6> 43 20 01 75 f8 e9 83 fe ff ff 0f 1f 00 4c 89 ea 4c 89 f6 44
examining the information I found there's also another thread on a different pu in jbd2:
(gdb) bt
#0 __ticket_spin_trylock (lock=0xffff88006b9338e8)
at /home/green/bk/linux-2.6.32-279.2.1.el6-debug/arch/x86/include/asm/spinlock.h:155
#1 __raw_spin_trylock (lock=0xffff88006b9338e8)
at /home/green/bk/linux-2.6.32-279.2.1.el6-debug/arch/x86/include/asm/spinlock.h:217
#2 __spin_lock_debug (lock=0xffff88006b9338e8) at lib/spinlock_debug.c:109
#3 _raw_spin_lock (lock=0xffff88006b9338e8) at lib/spinlock_debug.c:132
#4 0xffffffff814fb054 in __spin_lock_irqsave (lock=<optimized out>)
at include/linux/spinlock_api_smp.h:256
#5 _spin_lock_irqsave (lock=<optimized out>) at kernel/spinlock.c:66
#6 0xffffffff81051f52 in __wake_up (q=0xffff88006b9338e8, mode=3,
nr_exclusive=1, key=0x0) at kernel/sched.c:6203
#7 0xffffffffa0385663 in kjournald2 ()
#8 0xffffffff8108fa16 in kthread (_create=0xffff880070a175e8)
at kernel/kthread.c:78
#9 0xffffffff8100c14a in child_rip () at arch/x86/kernel/entry_64.S:1211
I have a crashdump in /exports/crashdumps/t2/hung.dmp though I am not sure if anybody would liek to dig into this.