Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.5.3
-
None
-
autotest
-
3
-
17579
Description
lustre-rsync-test test 6 failed with 'test failed to respond and timed out' and the test suite could not progress beyond test 6. Test results are at https://testing.hpdd.intel.com/test_sets/4e4a5590-b9b9-11e4-ba8b-5254006e85c2 .
The client test_log is empty or just missing. The last entries in the suite_stdout are:
07:21:25:== lustre-rsync-test test 6: lustre_rsync large no of hard links == 23:21:22 (1424503282) 07:21:25:CMD: onyx-34vm7 lctl --device lustre-MDT0000 changelog_register -n 07:21:25:lustre-MDT0000: Registered changelog user cl12 07:21:26:CMD: onyx-34vm7 lctl get_param -n mdd.lustre-MDT0000.changelog_users 07:21:58:Lustre filesystem: lustre 07:21:58:MDT device: lustre-MDT0000 07:21:58:Source: /mnt/lustre 07:21:58:Target: /tmp/target 07:21:58:Target: /tmp/target2 07:21:58:Statuslog: /tmp/lustre_rsync.log 07:21:58:Changelog registration: cl12 07:21:58:Starting changelog record: 0 07:21:58:Clear changelog after use: no 07:21:58:Errors: 0 07:21:58:lustre_rsync took 24 seconds 07:21:58:Changelog records consumed: 128 07:21:58:CMD: onyx-34vm7 lctl --device lustre-MDT0000 changelog_deregister cl12 07:21:58:lustre-MDT0000: Deregistered changelog user 'cl12'
From the client console, we see the soft lockup error with call trace:
07:23:16:Lustre: DEBUG MARKER: == lustre-rsync-test test 6: lustre_rsync large no of hard links == 23:21:22 (1424503282) 07:23:16:BUG: soft lockup - CPU#0 stuck for 67s! [ll_sa_26763:26764] 07:23:16:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 sha512_generic sha256_generic nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] 07:23:16:CPU 0 … 07:23:16:irq 11: nobody cared (try booting with the "irqpoll" option) 07:23:16:Pid: 1, comm: swapper Not tainted 2.6.32-431.29.2.el6.x86_64 #1 07:23:16:Call Trace: 07:23:16: <IRQ> [<ffffffff810e8d8b>] ? __report_bad_irq+0x2b/0xa0 07:23:16: [<ffffffff810e8f8c>] ? note_interrupt+0x18c/0x1d0 07:23:16: [<ffffffff810e972d>] ? handle_fasteoi_irq+0xcd/0xf0 07:23:16: [<ffffffff8100faf9>] ? handle_irq+0x49/0xa0 07:23:16: [<ffffffff8153251c>] ? do_IRQ+0x6c/0xf0 07:23:16: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11 07:23:16: [<ffffffff8107a5a3>] ? __do_softirq+0x73/0x1e0 07:23:16: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30 07:23:16: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0 07:23:16: [<ffffffff8107a4a5>] ? irq_exit+0x85/0x90 07:23:16: [<ffffffff81532525>] ? do_IRQ+0x75/0xf0 07:23:16: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11 07:23:16: <EOI> [<ffffffff8152b897>] ? _spin_unlock_irqrestore+0x17/0x20 07:23:16: [<ffffffff810e7b83>] ? __setup_irq+0x1b3/0x3c0 07:23:16: [<ffffffff813c4bb0>] ? usb_hcd_irq+0x0/0x90 07:23:16: [<ffffffff810e8553>] ? request_threaded_irq+0x133/0x230 07:23:16: [<ffffffff813c6b8e>] ? usb_add_hcd+0x50e/0x890 07:23:16: [<ffffffff813d73da>] ? usb_hcd_pci_probe+0x16a/0x3e0 07:23:16: [<ffffffff812a5877>] ? local_pci_probe+0x17/0x20 07:23:16: [<ffffffff812a6a61>] ? pci_device_probe+0x101/0x120 07:23:16: [<ffffffff8136e232>] ? driver_sysfs_add+0x62/0x90 07:23:16: [<ffffffff8136e3d0>] ? driver_probe_device+0xa0/0x2a0 07:23:16: [<ffffffff8136e67b>] ? __driver_attach+0xab/0xb0 07:23:16: [<ffffffff8136e5d0>] ? __driver_attach+0x0/0xb0 07:23:16: [<ffffffff8136d984>] ? bus_for_each_dev+0x64/0x90 07:23:16: [<ffffffff8136e16e>] ? driver_attach+0x1e/0x20 07:23:16: [<ffffffff8136d1b8>] ? bus_add_driver+0x1e8/0x2b0 07:23:16: [<ffffffff8136e9c6>] ? driver_register+0x76/0x140 07:23:16: [<ffffffff812a6cc6>] ? __pci_register_driver+0x56/0xd0 07:23:16: [<ffffffff81c62471>] ? uhci_hcd_init+0x0/0xca 07:23:16: [<ffffffff81c62471>] ? uhci_hcd_init+0x0/0xca 07:23:16: [<ffffffff81c624fd>] ? uhci_hcd_init+0x8c/0xca 07:23:16: [<ffffffff81c62442>] ? ohci_hcd_mod_init+0x61/0x90 07:23:16: [<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0 07:23:16: [<ffffffff81c268e4>] ? kernel_init+0x29b/0x2f7 07:23:16: [<ffffffff8100c20a>] ? child_rip+0xa/0x20 07:23:16: [<ffffffff81c26649>] ? kernel_init+0x0/0x2f7 07:23:16: [<ffffffff8100c200>] ? child_rip+0x0/0x20 07:23:16:handlers: 07:23:16:[<ffffffff813c4bb0>] (usb_hcd_irq+0x0/0x90) 07:23:16:Disabling IRQ #11
This is probably caused by the client problem, after about an hour, the OST dmesg log has:
Lustre: DEBUG MARKER: == lustre-rsync-test test 6: lustre_rsync large no of hard links == 23:21:22 (1424503282) Lustre: lustre-OST0000: haven't heard from client 9b13a821-4bea-aec2-907b-0135ad711c6b (at 10.2.4.130@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff880073e07000, cur 1424503359 expire 1424503329 last 1424503309 Lustre: Skipped 6 previous similar messages Lustre: lustre-OST0004: haven't heard from client 9b13a821-4bea-aec2-907b-0135ad711c6b (at 10.2.4.130@tcp) in 55 seconds. I think it's dead, and I am evicting it. exp ffff880066c78400, cur 1424503364 expire 1424503334 last 1424503309 Lustre: Skipped 3 previous similar messages LustreError: 9994:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_statfs: tot_granted 2379776 != fo_tot_granted 100167680 SysRq : Show State task PC stack pid father init S 0000000000000001 0 1 0 0x00000000 ffff88007e4c5908 0000000000000086 ffff88007e4c58d0 ffff88007e4c58cc 0000000000000000 ffff88007f823240 ffff880002216880 0000000000000400 ffff88007e4c3ab8 ffff88007e4c5fd8 000000000000fbc8 ffff88007e4c3ab8 Call Trace: [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff8152b21d>] schedule_hrtimeout_range+0x13d/0x160 [<ffffffff8109b346>] ? add_wait_queue+0x46/0x60 [<ffffffff811a0ca5>] ? __pollwait+0x75/0xf0 [<ffffffff811a0ca5>] ? __pollwait+0x75/0xf0 [<ffffffff811a0b39>] poll_schedule_timeout+0x39/0x60 [<ffffffff811a1bfc>] do_select+0x57c/0x6c0 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 [<ffffffff811a0c30>] ? __pollwait+0x0/0xf0 [<ffffffff811a0d20>] ? pollwake+0x0/0x60 [<ffffffff811a0d20>] ? pollwake+0x0/0x60 [<ffffffff811a0d20>] ? pollwake+0x0/0x60 [<ffffffff811a0d20>] ? pollwake+0x0/0x60 [<ffffffff811a0d20>] ? pollwake+0x0/0x60 [<ffffffff811a4e25>] ? d_lookup+0x35/0x60 [<ffffffff8152ad0e>] ? mutex_lock+0x1e/0x50 [<ffffffff81194787>] ? pipe_read+0x2a7/0x4e0 [<ffffffff811a1eca>] core_sys_select+0x18a/0x2c0 [<ffffffff81227806>] ? security_task_wait+0x16/0x20 [<ffffffff8107557d>] ? wait_consider_task+0x9d/0xb20 [<ffffffff8109b39c>] ? remove_wait_queue+0x3c/0x50 [<ffffffff8107617f>] ? do_wait+0x17f/0x240 [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0 [<ffffffff811a2257>] sys_select+0x47/0x110 [<ffffffff81098091>] ? posix_ktime_get_ts+0x11/0x20 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b …
Attachments
Issue Links
- duplicates
-
LU-4410 sanityn test 40a: BUG: soft lockup - CPU#0 stuck for 67s! [ptlrpcd_0:2892]
- Resolved