Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6270

lustre-rsync-test test 6: “BUG: soft lockup - CPU#0 stuck for 67s!”

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.5.3
    • None
    • autotest
    • 3
    • 17579

    Description

      lustre-rsync-test test 6 failed with 'test failed to respond and timed out' and the test suite could not progress beyond test 6. Test results are at https://testing.hpdd.intel.com/test_sets/4e4a5590-b9b9-11e4-ba8b-5254006e85c2 .

      The client test_log is empty or just missing. The last entries in the suite_stdout are:

      07:21:25:== lustre-rsync-test test 6: lustre_rsync large no of hard links == 23:21:22 (1424503282)
      07:21:25:CMD: onyx-34vm7 lctl --device lustre-MDT0000 changelog_register -n
      07:21:25:lustre-MDT0000: Registered changelog user cl12
      07:21:26:CMD: onyx-34vm7 lctl get_param -n mdd.lustre-MDT0000.changelog_users
      07:21:58:Lustre filesystem: lustre
      07:21:58:MDT device: lustre-MDT0000
      07:21:58:Source: /mnt/lustre
      07:21:58:Target: /tmp/target
      07:21:58:Target: /tmp/target2
      07:21:58:Statuslog: /tmp/lustre_rsync.log
      07:21:58:Changelog registration: cl12
      07:21:58:Starting changelog record: 0
      07:21:58:Clear changelog after use: no
      07:21:58:Errors: 0
      07:21:58:lustre_rsync took 24 seconds
      07:21:58:Changelog records consumed: 128
      07:21:58:CMD: onyx-34vm7 lctl --device lustre-MDT0000 changelog_deregister cl12
      07:21:58:lustre-MDT0000: Deregistered changelog user 'cl12'
      

      From the client console, we see the soft lockup error with call trace:

      07:23:16:Lustre: DEBUG MARKER: == lustre-rsync-test test 6: lustre_rsync large no of hard links == 23:21:22 (1424503282)
      07:23:16:BUG: soft lockup - CPU#0 stuck for 67s! [ll_sa_26763:26764]
      07:23:16:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 sha512_generic sha256_generic nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      07:23:16:CPU 0 
      …
      07:23:16:irq 11: nobody cared (try booting with the "irqpoll" option)
      07:23:16:Pid: 1, comm: swapper Not tainted 2.6.32-431.29.2.el6.x86_64 #1
      07:23:16:Call Trace:
      07:23:16: <IRQ>  [<ffffffff810e8d8b>] ? __report_bad_irq+0x2b/0xa0
      07:23:16: [<ffffffff810e8f8c>] ? note_interrupt+0x18c/0x1d0
      07:23:16: [<ffffffff810e972d>] ? handle_fasteoi_irq+0xcd/0xf0
      07:23:16: [<ffffffff8100faf9>] ? handle_irq+0x49/0xa0
      07:23:16: [<ffffffff8153251c>] ? do_IRQ+0x6c/0xf0
      07:23:16: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
      07:23:16: [<ffffffff8107a5a3>] ? __do_softirq+0x73/0x1e0
      07:23:16: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
      07:23:16: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0
      07:23:16: [<ffffffff8107a4a5>] ? irq_exit+0x85/0x90
      07:23:16: [<ffffffff81532525>] ? do_IRQ+0x75/0xf0
      07:23:16: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
      07:23:16: <EOI>  [<ffffffff8152b897>] ? _spin_unlock_irqrestore+0x17/0x20
      07:23:16: [<ffffffff810e7b83>] ? __setup_irq+0x1b3/0x3c0
      07:23:16: [<ffffffff813c4bb0>] ? usb_hcd_irq+0x0/0x90
      07:23:16: [<ffffffff810e8553>] ? request_threaded_irq+0x133/0x230
      07:23:16: [<ffffffff813c6b8e>] ? usb_add_hcd+0x50e/0x890
      07:23:16: [<ffffffff813d73da>] ? usb_hcd_pci_probe+0x16a/0x3e0
      07:23:16: [<ffffffff812a5877>] ? local_pci_probe+0x17/0x20
      07:23:16: [<ffffffff812a6a61>] ? pci_device_probe+0x101/0x120
      07:23:16: [<ffffffff8136e232>] ? driver_sysfs_add+0x62/0x90
      07:23:16: [<ffffffff8136e3d0>] ? driver_probe_device+0xa0/0x2a0
      07:23:16: [<ffffffff8136e67b>] ? __driver_attach+0xab/0xb0
      07:23:16: [<ffffffff8136e5d0>] ? __driver_attach+0x0/0xb0
      07:23:16: [<ffffffff8136d984>] ? bus_for_each_dev+0x64/0x90
      07:23:16: [<ffffffff8136e16e>] ? driver_attach+0x1e/0x20
      07:23:16: [<ffffffff8136d1b8>] ? bus_add_driver+0x1e8/0x2b0
      07:23:16: [<ffffffff8136e9c6>] ? driver_register+0x76/0x140
      07:23:16: [<ffffffff812a6cc6>] ? __pci_register_driver+0x56/0xd0
      07:23:16: [<ffffffff81c62471>] ? uhci_hcd_init+0x0/0xca
      07:23:16: [<ffffffff81c62471>] ? uhci_hcd_init+0x0/0xca
      07:23:16: [<ffffffff81c624fd>] ? uhci_hcd_init+0x8c/0xca
      07:23:16: [<ffffffff81c62442>] ? ohci_hcd_mod_init+0x61/0x90
      07:23:16: [<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0
      07:23:16: [<ffffffff81c268e4>] ? kernel_init+0x29b/0x2f7
      07:23:16: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      07:23:16: [<ffffffff81c26649>] ? kernel_init+0x0/0x2f7
      07:23:16: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      07:23:16:handlers:
      07:23:16:[<ffffffff813c4bb0>] (usb_hcd_irq+0x0/0x90)
      07:23:16:Disabling IRQ #11
      

      This is probably caused by the client problem, after about an hour, the OST dmesg log has:

      Lustre: DEBUG MARKER: == lustre-rsync-test test 6: lustre_rsync large no of hard links == 23:21:22 (1424503282)
      Lustre: lustre-OST0000: haven't heard from client 9b13a821-4bea-aec2-907b-0135ad711c6b (at 10.2.4.130@tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff880073e07000, cur 1424503359 expire 1424503329 last 1424503309
      Lustre: Skipped 6 previous similar messages
      Lustre: lustre-OST0004: haven't heard from client 9b13a821-4bea-aec2-907b-0135ad711c6b (at 10.2.4.130@tcp) in 55 seconds. I think it's dead, and I am evicting it. exp ffff880066c78400, cur 1424503364 expire 1424503334 last 1424503309
      Lustre: Skipped 3 previous similar messages
      LustreError: 9994:0:(ofd_grant.c:163:ofd_grant_sanity_check()) ofd_statfs: tot_granted 2379776 != fo_tot_granted 100167680
      SysRq : Show State
        task                        PC stack   pid father
      init          S 0000000000000001     0     1      0 0x00000000
       ffff88007e4c5908 0000000000000086 ffff88007e4c58d0 ffff88007e4c58cc
       0000000000000000 ffff88007f823240 ffff880002216880 0000000000000400
       ffff88007e4c3ab8 ffff88007e4c5fd8 000000000000fbc8 ffff88007e4c3ab8
      Call Trace:
       [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
       [<ffffffff8152b21d>] schedule_hrtimeout_range+0x13d/0x160
       [<ffffffff8109b346>] ? add_wait_queue+0x46/0x60
       [<ffffffff811a0ca5>] ? __pollwait+0x75/0xf0
       [<ffffffff811a0ca5>] ? __pollwait+0x75/0xf0
       [<ffffffff811a0b39>] poll_schedule_timeout+0x39/0x60
       [<ffffffff811a1bfc>] do_select+0x57c/0x6c0
       [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
       [<ffffffff811a0c30>] ? __pollwait+0x0/0xf0
       [<ffffffff811a0d20>] ? pollwake+0x0/0x60
       [<ffffffff811a0d20>] ? pollwake+0x0/0x60
       [<ffffffff811a0d20>] ? pollwake+0x0/0x60
       [<ffffffff811a0d20>] ? pollwake+0x0/0x60
       [<ffffffff811a0d20>] ? pollwake+0x0/0x60
       [<ffffffff811a4e25>] ? d_lookup+0x35/0x60
       [<ffffffff8152ad0e>] ? mutex_lock+0x1e/0x50
       [<ffffffff81194787>] ? pipe_read+0x2a7/0x4e0
       [<ffffffff811a1eca>] core_sys_select+0x18a/0x2c0
       [<ffffffff81227806>] ? security_task_wait+0x16/0x20
       [<ffffffff8107557d>] ? wait_consider_task+0x9d/0xb20
       [<ffffffff8109b39c>] ? remove_wait_queue+0x3c/0x50
       [<ffffffff8107617f>] ? do_wait+0x17f/0x240
       [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
       [<ffffffff811a2257>] sys_select+0x47/0x110
       [<ffffffff81098091>] ? posix_ktime_get_ts+0x11/0x20
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      …
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: