Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5605

Interop 2.5.2<->2.7 lustre-rsync-test test_6: soft lockup on statahead

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.7.0, Lustre 2.5.4
    • None
    • server: lustre-master build # 2639
      client: 2.5.2
    • 3
    • 15678

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/7e94d1f8-32e1-11e4-9c61-5254006e85c2.

      The sub-test test_6 failed with the following error:

      test failed to respond and timed out

      test log

      22:57:14:Lustre: DEBUG MARKER: == lustre-rsync-test test 6: lustre_rsync large no of hard links == 22:55:14 (1409637314)
      22:57:14:BUG: soft lockup - CPU#0 stuck for 67s! [ll_sa_27678:27679]
      22:57:14:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      22:57:14:BUG: soft lockup - CPU#1 stuck for 67s! [ptlrpcd_0:1399]
      22:57:14:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      22:57:14:CPU 1 
      22:57:14:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      22:57:14:
      22:57:14:Pid: 1399, comm: ptlrpcd_0 Not tainted 2.6.32-431.17.1.el6.x86_64 #1 Red Hat KVM
      22:57:14:RIP: 0010:[<ffffffff8152a5ee>]  [<ffffffff8152a5ee>] _spin_lock+0x1e/0x30
      22:57:14:RSP: 0018:ffff88007c48dc20  EFLAGS: 00000206
      22:57:14:RAX: 0000000000000002 RBX: ffff88007c48dc20 RCX: 5a5a5a5a5a5a5a5a
      22:57:14:RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88005302f440
      22:57:14:RBP: ffffffff8100bb8e R08: 5a5a5a5a5a5a5a5a R09: 5a5a5a5a5a5a5a5a
      22:57:14:R10: 0000000000000000 R11: 0000000000000078 R12: 0000000000000000
      22:57:14:R13: ffff88007cbe12d8 R14: 0000000000000000 R15: ffffffff00000000
      22:57:14:FS:  0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
      22:57:14:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      22:57:14:CR2: 0000000001eb81c0 CR3: 0000000001a85000 CR4: 00000000000006e0
      22:57:14:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      22:57:14:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      22:57:14:Process ptlrpcd_0 (pid: 1399, threadinfo ffff88007c48c000, task ffff88007a29d540)
      22:57:14:Stack:
      22:57:14: ffff88007c48dc80 ffffffffa1083c5b 0000000000000000 ffff880049195f48
      22:57:14:<d> ffff88005302f440 ffff880060ac6800 ffff880060ac6800 ffff88007adc0480
      22:57:14:<d> ffff880060ac6800 ffff880049195e00 0000000000000000 ffff88007c641400
      22:57:14:Call Trace:
      22:57:14: [<ffffffffa1083c5b>] ? ll_statahead_interpret+0x7b/0x560 [lustre]
      22:57:14: [<ffffffffa0efa736>] ? mdc_intent_getattr_async_interpret+0x1f6/0x540 [mdc]
      22:57:14: [<ffffffffa0d3caec>] ? ptlrpc_check_set+0x2bc/0x1b50 [ptlrpc]
      22:57:14: [<ffffffffa0d6804b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
      22:57:14: [<ffffffffa0d6856b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
      22:57:14: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      22:57:14: [<ffffffffa0d68360>] ? ptlrpcd+0x0/0x370 [ptlrpc]
      22:57:14: [<ffffffff8109ab56>] ? kthread+0x96/0xa0
      22:57:14: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      22:57:14: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      22:57:14: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      22:57:14:Code: 00 00 00 01 74 05 e8 22 40 d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 
      22:57:14:Call Trace:
      22:57:14: [<ffffffffa1083c5b>] ? ll_statahead_interpret+0x7b/0x560 [lustre]
      22:57:14: [<ffffffffa0efa736>] ? mdc_intent_getattr_async_interpret+0x1f6/0x540 [mdc]
      22:57:14: [<ffffffffa0d3caec>] ? ptlrpc_check_set+0x2bc/0x1b50 [ptlrpc]
      22:57:14: [<ffffffffa0d6804b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
      22:57:14: [<ffffffffa0d6856b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
      22:57:14: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      22:57:14: [<ffffffffa0d68360>] ? ptlrpcd+0x0/0x370 [ptlrpc]
      22:57:14: [<ffffffff8109ab56>] ? kthread+0x96/0xa0
      22:57:14: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      22:57:14: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      22:57:14: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      22:57:14:CPU 0 
      22:58:35:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) sha512_generic sha256_generic nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
      22:58:35:
      22:58:35:Pid: 27679, comm: ll_sa_27678 Not tainted 2.6.32-431.17.1.el6.x86_64 #1 Red Hat KVM
      22:58:35:RIP: 0010:[<ffffffff8152a5ee>]  [<ffffffff8152a5ee>] _spin_lock+0x1e/0x30
      22:58:35:RSP: 0018:ffff880068113d50  EFLAGS: 00000206
      22:58:35:RAX: 0000000000000001 RBX: ffff880068113d50 RCX: 0000000000000000
      22:58:35:RDX: 0000000000000000 RSI: ffff8800531765c0 RDI: ffff88005302f440
      22:58:35:RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 00000000fffffffe
      22:58:35:R10: 0000000000000000 R11: 0000000000000001 R12: ffff880068113d30
      22:58:35:R13: ffff8800228a4400 R14: 0000000000001000 R15: 0000000000000000
      22:58:35:FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
      22:58:35:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      22:58:35:CR2: 00000000019ea0f8 CR3: 00000000253dd000 CR4: 00000000000006f0
      22:58:35:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      22:58:35:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      22:58:35:Process ll_sa_27678 (pid: 27679, threadinfo ffff880068112000, task ffff88002879aaa0)
      22:58:35:Stack:
      22:58:35: ffff880068113dc0 ffffffffa1084cc0 ffff88005302f178 ffff880033595670
      22:58:35:<d> 0000000000000000 ffff88005302f080 ffff88005dc395c0 ffff88002513f600
      22:58:35:<d> ffff880068113dc0 ffff880028433000 ffff880028433170 ffff88005302f400
      22:58:35:Call Trace:
      22:58:35: [<ffffffffa1084cc0>] ? ll_post_statahead+0x50/0xa20 [lustre]
      22:58:35: [<ffffffffa1088ce8>] ? ll_statahead_thread+0x228/0xfb0 [lustre]
      22:58:35: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      22:58:35: [<ffffffffa1088ac0>] ? ll_statahead_thread+0x0/0xfb0 [lustre]
      22:58:35: [<ffffffff8109ab56>] ? kthread+0x96/0xa0
      22:58:35: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      22:58:35: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      22:58:35: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      22:58:35:Code: 00 00 00 01 74 05 e8 22 40 d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 
      22:58:35:Call Trace:
      22:58:35: [<ffffffffa1084cc0>] ? ll_post_statahead+0x50/0xa20 [lustre]
      22:58:35: [<ffffffffa1088ce8>] ? ll_statahead_thread+0x228/0xfb0 [lustre]
      22:58:35: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      22:58:35: [<ffffffffa1088ac0>] ? ll_statahead_thread+0x0/0xfb0 [lustre]
      22:58:35: [<ffffffff8109ab56>] ? kthread+0x96/0xa0
      22:58:35: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      22:58:35: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      22:58:35: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Info required for matching: lustre-rsync-test 6

      Attachments

        Issue Links

          Activity

            [LU-5605] Interop 2.5.2<->2.7 lustre-rsync-test test_6: soft lockup on statahead
            yujian Jian Yu added a comment -

            This is likely fixed by the landing of http://review.whamcloud.com/9665 from LU-2272.

            The patch needs to be back-ported to Lustre b2_5 and b_ieel2_0 branches. The work will be tracked in LU-4410.

            yujian Jian Yu added a comment - This is likely fixed by the landing of http://review.whamcloud.com/9665 from LU-2272 . The patch needs to be back-ported to Lustre b2_5 and b_ieel2_0 branches. The work will be tracked in LU-4410 .
            jhammond John Hammond added a comment -

            This is likely fixed by the landing of http://review.whamcloud.com/9665 from LU-2272.

            jhammond John Hammond added a comment - This is likely fixed by the landing of http://review.whamcloud.com/9665 from LU-2272 .
            yujian Jian Yu added a comment - One more instance on master branch: https://testing.hpdd.intel.com/test_sets/8883d38c-5b57-11e4-9243-5254006e85c2
            jhammond John Hammond added a comment -

            https://testing.hpdd.intel.com/test_logs/89fdd20c-5b58-11e4-9243-5254006e85c2
            https://testing.hpdd.intel.com/test_nodes/301f0368-5b4f-11e4-9243-5254006e85c2
            export/scratch/dumps/shadow-19vm10.shadow.whamcloud.com/10.1.4.233-2014-10-23-20:48:50/vmcore-dmesg.txt

            Lustre: DEBUG MARKER: == lustre-rsync-test test 6: lustre_rsync large no of hard links == 20:47:00 (14140972
            20)
            BUG: soft lockup - CPU#0 stuck for 67s! [ptlrpcd_1:2595]
            ...
            CPU 0 
            BUG: soft lockup - CPU#1 stuck for 67s! [ll_sa_13692:13693]
            ...
            CPU 1 
            ...
            

            I don't know why it says procfs_read_helper below but it's really ll_statahead_interpret().

            ptlrpcd_0 is in ll_statahead_interpret() holding lli_sa_lock and calling ll_intent_drop_lock().

            crash> bt 2594
            PID: 2594   TASK: ffff88007ad7f500  CPU: 1   COMMAND: "ptlrpcd_0"
             #0 [ffff880043be78c0] schedule at ffffffff81528e00
             #1 [ffff880043be7988] __cond_resched at ffffffff810695fa
             #2 [ffff880043be79a8] _cond_resched at ffffffff81529840
             #3 [ffff880043be79b8] kmem_cache_alloc_trace at ffffffff8116f445
             #4 [ffff880043be7a08] ldlm_bl_to_thread at ffffffffa0e134f8 [ptlrpc]
             #5 [ffff880043be7b08] ldlm_bl_to_thread_lock at ffffffffa0e13a29 [ptlrpc]
             #6 [ffff880043be7b18] ldlm_lock_decref_internal at ffffffffa0dfefc5 [ptlrpc]
             #7 [ffff880043be7b78] ldlm_lock_decref at ffffffffa0e00079 [ptlrpc]
             #8 [ffff880043be7ba8] ll_intent_drop_lock at ffffffffa1157bd1 [lustre]
             #9 [ffff880043be7bd8] procfs_read_helper at ffffffffa11bc2ae [lustre]
            #10 [ffff880043be7c38] mdc_intent_getattr_async_interpret at ffffffffa103fa5a [mdc]
            #11 [ffff880043be7cb8] ptlrpc_check_set at ffffffffa0e28a63 [ptlrpc]
            #12 [ffff880043be7d78] ptlrpcd_check at ffffffffa0e55153 [ptlrpc]
            #13 [ffff880043be7dd8] ptlrpcd at ffffffffa0e5579b [ptlrpc]
            #14 [ffff880043be7ee8] kthread at ffffffff8109abf6
            #15 [ffff880043be7f48] kernel_thread at ffffffff8100c20a
            

            This is the problem since we have a non atomic kmalloc() in ldlm_bl_to_thread():

            static int ldlm_bl_to_thread(struct ldlm_namespace *ns,
                                         struct ldlm_lock_desc *ld,
                                         struct ldlm_lock *lock,
                                         struct list_head *cancels, int count,
                                         ldlm_cancel_flags_t cancel_flags)
            {
                    ENTRY;
            
                    if (cancels && count == 0)
                            RETURN(0);
            
                    if (cancel_flags & LCF_ASYNC) {
                            struct ldlm_bl_work_item *blwi;
            
                            OBD_ALLOC(blwi, sizeof(*blwi));
                            if (blwi == NULL)
                                    RETURN(-ENOMEM);
                            init_blwi(blwi, ns, ld, cancels, count, lock, cancel_flags);
            
                            RETURN(__ldlm_bl_to_thread(blwi, cancel_flags));
                   }
            

            ptlrpcd_1 is in ll_statahead_interpret() trying to take lli_sa_lock:

            crash> bt 2595
            PID: 2595   TASK: ffff88007ad7eaa0  CPU: 0   COMMAND: "ptlrpcd_1"
             #0 [ffff880002207e90] crash_nmi_callback at ffffffff8102fee6
             #1 [ffff880002207ea0] notifier_call_chain at ffffffff8152ea55
             #2 [ffff880002207ee0] atomic_notifier_call_chain at ffffffff8152eaba
             #3 [ffff880002207ef0] notify_die at ffffffff810a11de
             #4 [ffff880002207f20] do_nmi at ffffffff8152c71b
             #5 [ffff880002207f50] nmi at ffffffff8152bfe0
                [exception RIP: native_read_tsc]
                RIP: ffffffff810156b0  RSP: ffff880002203b00  RFLAGS: 00000046
                RAX: 00000000673d3da9  RBX: 0000000000000000  RCX: 00000000673d3da9
                RDX: 00000000000006cc  RSI: 0000000000291984  RDI: 0000000000000a86
                RBP: ffff880002203b28   R8: ffffffff81c068c0   R9: 0000000000000000
                R10: 696c2073656c7564  R11: 3a6e692064656b6e  R12: 00000000673d36dd
                R13: 0000000000000000  R14: 0000000000000a86  R15: ffffffff81364eb0
                ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
            --- <NMI exception stack> ---
             #6 [ffff880002203b00] native_read_tsc at ffffffff810156b0
             #7 [ffff880002203b00] delay_tsc at ffffffff8128e2ba
             #8 [ffff880002203b30] __const_udelay at ffffffff8128e266
             #9 [ffff880002203b40] wait_for_xmitr at ffffffff81364e32
            #10 [ffff880002203b70] serial8250_console_putchar at ffffffff81364ed6
            #11 [ffff880002203b90] uart_console_write at ffffffff8136086e
            #12 [ffff880002203bd0] serial8250_console_write at ffffffff8136523d
            #13 [ffff880002203c20] __call_console_drivers at ffffffff81071cd5
            #14 [ffff880002203c50] _call_console_drivers at ffffffff81071d3a
            #15 [ffff880002203c70] release_console_sem at ffffffff81072388
            #16 [ffff880002203cb0] vprintk at ffffffff81072a88
            #17 [ffff880002203d50] printk at ffffffff81528845
            #18 [ffff880002203db0] print_modules at ffffffff810b77b7
            #19 [ffff880002203de0] show_registers at ffffffff8100fd46
            #20 [ffff880002203e50] show_regs at ffffffff81009c16
            #21 [ffff880002203e70] watchdog_timer_fn at ffffffff810e6392
            #22 [ffff880002203ea0] __run_hrtimer at ffffffff8109f6be
            #23 [ffff880002203ef0] hrtimer_interrupt at ffffffff8109fa26
            #24 [ffff880002203f70] local_apic_timer_interrupt at ffffffff81031f1d
            #25 [ffff880002203f90] smp_apic_timer_interrupt at ffffffff815325e5
            #26 [ffff880002203fb0] apic_timer_interrupt at ffffffff8100bb93
            --- <IRQ stack> ---
            #27 [ffff8800698f1b28] apic_timer_interrupt at ffffffff8100bb93
                [exception RIP: _spin_lock+30]
                RIP: ffffffff8152b84e  RSP: ffff8800698f1bd0  RFLAGS: 00000206
                RAX: 0000000000000001  RBX: ffff8800698f1bd0  RCX: 5a5a5a5a5a5a5a5a
                RDX: 0000000000000000  RSI: ffff88003dd9bc00  RDI: ffff88007b4aa478
                RBP: ffffffff8100bb8e   R8: 5a5a5a5a5a5a5a5a   R9: 00000000fffffffe
                R10: 0000000000000000  R11: 00000000000000be  R12: 0000000000000006
                R13: ffff88006b5ed378  R14: ffff8800698f1b60  R15: ffff88007b9fea00
                ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
            #28 [ffff8800698f1bd8] procfs_read_helper at ffffffffa11bc089 [lustre]
            #29 [ffff8800698f1c38] mdc_intent_getattr_async_interpret at ffffffffa103fa5a [mdc]
            #30 [ffff8800698f1cb8] ptlrpc_check_set at ffffffffa0e28a63 [ptlrpc]
            #31 [ffff8800698f1d78] ptlrpcd_check at ffffffffa0e55153 [ptlrpc]
            #32 [ffff8800698f1dd8] ptlrpcd at ffffffffa0e5579b [ptlrpc]
            #33 [ffff8800698f1ee8] kthread at ffffffff8109abf6
            #34 [ffff8800698f1f48] kernel_thread at ffffffff8100c20a
            

            ll_sa_13692 is in ll_post_statahead() trying to take lli_sa_lock:

            crash> bt
            PID: 13693  TASK: ffff88007cfb9540  CPU: 1   COMMAND: "ll_sa_13692"
             #0 [ffff880002303cc0] machine_kexec at ffffffff81038f3b
             #1 [ffff880002303d20] crash_kexec at ffffffff810c5af2
             #2 [ffff880002303df0] panic at ffffffff81528743
             #3 [ffff880002303e70] watchdog_timer_fn at ffffffff810e63ca
             #4 [ffff880002303ea0] __run_hrtimer at ffffffff8109f6be
             #5 [ffff880002303ef0] hrtimer_interrupt at ffffffff8109fa26
             #6 [ffff880002303f70] local_apic_timer_interrupt at ffffffff81031f1d
             #7 [ffff880002303f90] smp_apic_timer_interrupt at ffffffff815325e5
             #8 [ffff880002303fb0] apic_timer_interrupt at ffffffff8100bb93
            --- <IRQ stack> ---
             #9 [ffff8800737c1c68] apic_timer_interrupt at ffffffff8100bb93
                [exception RIP: _spin_lock+30]
                RIP: ffffffff8152b84e  RSP: ffff8800737c1d10  RFLAGS: 00000206
                RAX: 0000000000000002  RBX: ffff8800737c1d10  RCX: 0000000000000008
                RDX: 0000000000000000  RSI: 0000000000000005  RDI: ffff88007b4aa478
                RBP: ffffffff8100bb8e   R8: 0000000000000000   R9: 00000000fffffffe
                R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
                R13: 0000000000000000  R14: 0000000000000003  R15: ffff88003dcc55a8
                ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
            #10 [ffff8800737c1d18] ll_post_statahead at ffffffffa11bceed [lustre]
            #11 [ffff8800737c1dc8] ta.73028 at ffffffffa11c158b [lustre]
            #12 [ffff8800737c1ee8] kthread at ffffffff8109abf6
            #13 [ffff8800737c1f48] kernel_thread at ffffffff8100c20a
            
            jhammond John Hammond added a comment - https://testing.hpdd.intel.com/test_logs/89fdd20c-5b58-11e4-9243-5254006e85c2 https://testing.hpdd.intel.com/test_nodes/301f0368-5b4f-11e4-9243-5254006e85c2 export/scratch/dumps/shadow-19vm10.shadow.whamcloud.com/10.1.4.233-2014-10-23-20:48:50/vmcore-dmesg.txt Lustre: DEBUG MARKER: == lustre-rsync-test test 6: lustre_rsync large no of hard links == 20:47:00 (14140972 20) BUG: soft lockup - CPU#0 stuck for 67s! [ptlrpcd_1:2595] ... CPU 0 BUG: soft lockup - CPU#1 stuck for 67s! [ll_sa_13692:13693] ... CPU 1 ... I don't know why it says procfs_read_helper below but it's really ll_statahead_interpret(). ptlrpcd_0 is in ll_statahead_interpret() holding lli_sa_lock and calling ll_intent_drop_lock(). crash> bt 2594 PID: 2594 TASK: ffff88007ad7f500 CPU: 1 COMMAND: "ptlrpcd_0" #0 [ffff880043be78c0] schedule at ffffffff81528e00 #1 [ffff880043be7988] __cond_resched at ffffffff810695fa #2 [ffff880043be79a8] _cond_resched at ffffffff81529840 #3 [ffff880043be79b8] kmem_cache_alloc_trace at ffffffff8116f445 #4 [ffff880043be7a08] ldlm_bl_to_thread at ffffffffa0e134f8 [ptlrpc] #5 [ffff880043be7b08] ldlm_bl_to_thread_lock at ffffffffa0e13a29 [ptlrpc] #6 [ffff880043be7b18] ldlm_lock_decref_internal at ffffffffa0dfefc5 [ptlrpc] #7 [ffff880043be7b78] ldlm_lock_decref at ffffffffa0e00079 [ptlrpc] #8 [ffff880043be7ba8] ll_intent_drop_lock at ffffffffa1157bd1 [lustre] #9 [ffff880043be7bd8] procfs_read_helper at ffffffffa11bc2ae [lustre] #10 [ffff880043be7c38] mdc_intent_getattr_async_interpret at ffffffffa103fa5a [mdc] #11 [ffff880043be7cb8] ptlrpc_check_set at ffffffffa0e28a63 [ptlrpc] #12 [ffff880043be7d78] ptlrpcd_check at ffffffffa0e55153 [ptlrpc] #13 [ffff880043be7dd8] ptlrpcd at ffffffffa0e5579b [ptlrpc] #14 [ffff880043be7ee8] kthread at ffffffff8109abf6 #15 [ffff880043be7f48] kernel_thread at ffffffff8100c20a This is the problem since we have a non atomic kmalloc() in ldlm_bl_to_thread(): static int ldlm_bl_to_thread(struct ldlm_namespace *ns, struct ldlm_lock_desc *ld, struct ldlm_lock *lock, struct list_head *cancels, int count, ldlm_cancel_flags_t cancel_flags) { ENTRY; if (cancels && count == 0) RETURN(0); if (cancel_flags & LCF_ASYNC) { struct ldlm_bl_work_item *blwi; OBD_ALLOC(blwi, sizeof(*blwi)); if (blwi == NULL) RETURN(-ENOMEM); init_blwi(blwi, ns, ld, cancels, count, lock, cancel_flags); RETURN(__ldlm_bl_to_thread(blwi, cancel_flags)); } ptlrpcd_1 is in ll_statahead_interpret() trying to take lli_sa_lock: crash> bt 2595 PID: 2595 TASK: ffff88007ad7eaa0 CPU: 0 COMMAND: "ptlrpcd_1" #0 [ffff880002207e90] crash_nmi_callback at ffffffff8102fee6 #1 [ffff880002207ea0] notifier_call_chain at ffffffff8152ea55 #2 [ffff880002207ee0] atomic_notifier_call_chain at ffffffff8152eaba #3 [ffff880002207ef0] notify_die at ffffffff810a11de #4 [ffff880002207f20] do_nmi at ffffffff8152c71b #5 [ffff880002207f50] nmi at ffffffff8152bfe0 [exception RIP: native_read_tsc] RIP: ffffffff810156b0 RSP: ffff880002203b00 RFLAGS: 00000046 RAX: 00000000673d3da9 RBX: 0000000000000000 RCX: 00000000673d3da9 RDX: 00000000000006cc RSI: 0000000000291984 RDI: 0000000000000a86 RBP: ffff880002203b28 R8: ffffffff81c068c0 R9: 0000000000000000 R10: 696c2073656c7564 R11: 3a6e692064656b6e R12: 00000000673d36dd R13: 0000000000000000 R14: 0000000000000a86 R15: ffffffff81364eb0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 --- <NMI exception stack> --- #6 [ffff880002203b00] native_read_tsc at ffffffff810156b0 #7 [ffff880002203b00] delay_tsc at ffffffff8128e2ba #8 [ffff880002203b30] __const_udelay at ffffffff8128e266 #9 [ffff880002203b40] wait_for_xmitr at ffffffff81364e32 #10 [ffff880002203b70] serial8250_console_putchar at ffffffff81364ed6 #11 [ffff880002203b90] uart_console_write at ffffffff8136086e #12 [ffff880002203bd0] serial8250_console_write at ffffffff8136523d #13 [ffff880002203c20] __call_console_drivers at ffffffff81071cd5 #14 [ffff880002203c50] _call_console_drivers at ffffffff81071d3a #15 [ffff880002203c70] release_console_sem at ffffffff81072388 #16 [ffff880002203cb0] vprintk at ffffffff81072a88 #17 [ffff880002203d50] printk at ffffffff81528845 #18 [ffff880002203db0] print_modules at ffffffff810b77b7 #19 [ffff880002203de0] show_registers at ffffffff8100fd46 #20 [ffff880002203e50] show_regs at ffffffff81009c16 #21 [ffff880002203e70] watchdog_timer_fn at ffffffff810e6392 #22 [ffff880002203ea0] __run_hrtimer at ffffffff8109f6be #23 [ffff880002203ef0] hrtimer_interrupt at ffffffff8109fa26 #24 [ffff880002203f70] local_apic_timer_interrupt at ffffffff81031f1d #25 [ffff880002203f90] smp_apic_timer_interrupt at ffffffff815325e5 #26 [ffff880002203fb0] apic_timer_interrupt at ffffffff8100bb93 --- <IRQ stack> --- #27 [ffff8800698f1b28] apic_timer_interrupt at ffffffff8100bb93 [exception RIP: _spin_lock+30] RIP: ffffffff8152b84e RSP: ffff8800698f1bd0 RFLAGS: 00000206 RAX: 0000000000000001 RBX: ffff8800698f1bd0 RCX: 5a5a5a5a5a5a5a5a RDX: 0000000000000000 RSI: ffff88003dd9bc00 RDI: ffff88007b4aa478 RBP: ffffffff8100bb8e R8: 5a5a5a5a5a5a5a5a R9: 00000000fffffffe R10: 0000000000000000 R11: 00000000000000be R12: 0000000000000006 R13: ffff88006b5ed378 R14: ffff8800698f1b60 R15: ffff88007b9fea00 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #28 [ffff8800698f1bd8] procfs_read_helper at ffffffffa11bc089 [lustre] #29 [ffff8800698f1c38] mdc_intent_getattr_async_interpret at ffffffffa103fa5a [mdc] #30 [ffff8800698f1cb8] ptlrpc_check_set at ffffffffa0e28a63 [ptlrpc] #31 [ffff8800698f1d78] ptlrpcd_check at ffffffffa0e55153 [ptlrpc] #32 [ffff8800698f1dd8] ptlrpcd at ffffffffa0e5579b [ptlrpc] #33 [ffff8800698f1ee8] kthread at ffffffff8109abf6 #34 [ffff8800698f1f48] kernel_thread at ffffffff8100c20a ll_sa_13692 is in ll_post_statahead() trying to take lli_sa_lock: crash> bt PID: 13693 TASK: ffff88007cfb9540 CPU: 1 COMMAND: "ll_sa_13692" #0 [ffff880002303cc0] machine_kexec at ffffffff81038f3b #1 [ffff880002303d20] crash_kexec at ffffffff810c5af2 #2 [ffff880002303df0] panic at ffffffff81528743 #3 [ffff880002303e70] watchdog_timer_fn at ffffffff810e63ca #4 [ffff880002303ea0] __run_hrtimer at ffffffff8109f6be #5 [ffff880002303ef0] hrtimer_interrupt at ffffffff8109fa26 #6 [ffff880002303f70] local_apic_timer_interrupt at ffffffff81031f1d #7 [ffff880002303f90] smp_apic_timer_interrupt at ffffffff815325e5 #8 [ffff880002303fb0] apic_timer_interrupt at ffffffff8100bb93 --- <IRQ stack> --- #9 [ffff8800737c1c68] apic_timer_interrupt at ffffffff8100bb93 [exception RIP: _spin_lock+30] RIP: ffffffff8152b84e RSP: ffff8800737c1d10 RFLAGS: 00000206 RAX: 0000000000000002 RBX: ffff8800737c1d10 RCX: 0000000000000008 RDX: 0000000000000000 RSI: 0000000000000005 RDI: ffff88007b4aa478 RBP: ffffffff8100bb8e R8: 0000000000000000 R9: 00000000fffffffe R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000003 R15: ffff88003dcc55a8 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff8800737c1d18] ll_post_statahead at ffffffffa11bceed [lustre] #11 [ffff8800737c1dc8] ta.73028 at ffffffffa11c158b [lustre] #12 [ffff8800737c1ee8] kthread at ffffffff8109abf6 #13 [ffff8800737c1f48] kernel_thread at ffffffff8100c20a
            jhammond John Hammond added a comment -

            There are now some crash dumps for this on shadow:

            [jlhammon@shadow-1 ~]$ find /export/scratch/dumps -name 'vmcore-dmesg.txt' -exec grep softlockup {} \; -print
            <0>Kernel panic - not syncing: softlockup: hung tasks
            /export/scratch/dumps/shadow-22vm5.shadow.whamcloud.com/10.1.5.13-2014-10-11-16:59:41/vmcore-dmesg.txt
            <0>Kernel panic - not syncing: softlockup: hung tasks
            /export/scratch/dumps/shadow-11vm5.shadow.whamcloud.com/10.1.4.124-2014-10-11-12:41:27/vmcore-dmesg.txt
            
            <4>Lustre: DEBUG MARKER: == lustre-rsync-test test 6: lustre_rsync large no of hard links
            == 16:57:53 (1413046673)
            <0>BUG: soft lockup - CPU#0 stuck for 68s! [ptlrpcd_0:8469]
            <4>Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 sha512_generic sha256_generic nfsd exportfs nfs lockd fscache auth_rpcgss nfs_\
            acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_a\
            cpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
            <4>CPU 0
            <4>Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 sha512_generic
            <0>BUG: soft lockup - CPU#1 stuck for 68s! [ll_sa_18280:18281]
            <4>Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 sha512_generic sha256_generic nfsd exportfs nfs lockd fscache auth_rpcgss nfs_\
            acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_a\
            cpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
            <4>CPU 1
            <4>Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 sha512_generic sha256_generic nfsd exportfs nfs lockd fscache auth_rpcgss nfs_\
            acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_a\
            cpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]
            <4>
            <4>Pid: 18281, comm: ll_sa_18280 Not tainted 2.6.32-431.29.2.el6.x86_64 #1 Red Hat KVM
            <4>RIP: 0010:[<ffffffff8152b84e>]  [<ffffffff8152b84e>] _spin_lock+0x1e/0x30
            <4>RSP: 0018:ffff880071027d50  EFLAGS: 00000206
            <4>RAX: 0000000000000001 RBX: ffff880071027d50 RCX: 0000000000000000
            <4>RDX: 0000000000000000 RSI: ffff88006f314140 RDI: ffff8800699cb440
            <4>RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000
            <4>R10: 0000000000000800 R11: 0000000000000400 R12: ffff880071027d30
            <4>R13: ffff880070618000 R14: 0000000000001000 R15: 0000000000000000
            <4>FS:  0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
            <4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
            <4>CR2: 000000000091362c CR3: 000000006e5d9000 CR4: 00000000000006e0
            <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            <4>Process ll_sa_18280 (pid: 18281, threadinfo ffff880071026000, task ffff88006e824040)
            <4>Stack:
            <4> ffff880071027dc0 ffffffffa10840b0 ffff8800699cb178 ffff880060b7b630
            <4><d> 0000000000000000 ffff8800699cb080 ffff88007a9a1280 ffff88006e44e800
            <4><d> ffff880071027dc0 ffff88006b166800 ffff88006b166970 ffff8800699cb400
            <4>Call Trace:
            <4> [<ffffffffa10840b0>] ? ll_post_statahead+0x50/0xa20 [lustre]
            <4> [<ffffffffa10880d8>] ? ll_statahead_thread+0x228/0xfb0 [lustre]
            <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
            <4> [<ffffffffa1087eb0>] ? ll_statahead_thread+0x0/0xfb0 [lustre]
            <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
            <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
            <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
            <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
            <4>Code: 00 00 00 01 74 05 e8 92 3a d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
            <4>Call Trace:
            <4> [<ffffffffa10840b0>] ? ll_post_statahead+0x50/0xa20 [lustre]
            <4> [<ffffffffa10880d8>] ? ll_statahead_thread+0x228/0xfb0 [lustre]
            <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
            <4> [<ffffffffa1087eb0>] ? ll_statahead_thread+0x0/0xfb0 [lustre]
            <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
            <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
            <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
            <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
            <0>Kernel panic - not syncing: softlockup: hung tasks
            
            jhammond John Hammond added a comment - There are now some crash dumps for this on shadow: [jlhammon@shadow-1 ~]$ find /export/scratch/dumps -name 'vmcore-dmesg.txt' -exec grep softlockup {} \; -print <0>Kernel panic - not syncing: softlockup: hung tasks /export/scratch/dumps/shadow-22vm5.shadow.whamcloud.com/10.1.5.13-2014-10-11-16:59:41/vmcore-dmesg.txt <0>Kernel panic - not syncing: softlockup: hung tasks /export/scratch/dumps/shadow-11vm5.shadow.whamcloud.com/10.1.4.124-2014-10-11-12:41:27/vmcore-dmesg.txt <4>Lustre: DEBUG MARKER: == lustre-rsync-test test 6: lustre_rsync large no of hard links == 16:57:53 (1413046673) <0>BUG: soft lockup - CPU#0 stuck for 68s! [ptlrpcd_0:8469] <4>Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 sha512_generic sha256_generic nfsd exportfs nfs lockd fscache auth_rpcgss nfs_\ acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_a\ cpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] <4>CPU 0 <4>Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 sha512_generic <0>BUG: soft lockup - CPU#1 stuck for 68s! [ll_sa_18280:18281] <4>Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 sha512_generic sha256_generic nfsd exportfs nfs lockd fscache auth_rpcgss nfs_\ acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_a\ cpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] <4>CPU 1 <4>Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) ext2 sha512_generic sha256_generic nfsd exportfs nfs lockd fscache auth_rpcgss nfs_\ acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_a\ cpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs] <4> <4>Pid: 18281, comm: ll_sa_18280 Not tainted 2.6.32-431.29.2.el6.x86_64 #1 Red Hat KVM <4>RIP: 0010:[<ffffffff8152b84e>] [<ffffffff8152b84e>] _spin_lock+0x1e/0x30 <4>RSP: 0018:ffff880071027d50 EFLAGS: 00000206 <4>RAX: 0000000000000001 RBX: ffff880071027d50 RCX: 0000000000000000 <4>RDX: 0000000000000000 RSI: ffff88006f314140 RDI: ffff8800699cb440 <4>RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000 <4>R10: 0000000000000800 R11: 0000000000000400 R12: ffff880071027d30 <4>R13: ffff880070618000 R14: 0000000000001000 R15: 0000000000000000 <4>FS: 0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000 <4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b <4>CR2: 000000000091362c CR3: 000000006e5d9000 CR4: 00000000000006e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process ll_sa_18280 (pid: 18281, threadinfo ffff880071026000, task ffff88006e824040) <4>Stack: <4> ffff880071027dc0 ffffffffa10840b0 ffff8800699cb178 ffff880060b7b630 <4><d> 0000000000000000 ffff8800699cb080 ffff88007a9a1280 ffff88006e44e800 <4><d> ffff880071027dc0 ffff88006b166800 ffff88006b166970 ffff8800699cb400 <4>Call Trace: <4> [<ffffffffa10840b0>] ? ll_post_statahead+0x50/0xa20 [lustre] <4> [<ffffffffa10880d8>] ? ll_statahead_thread+0x228/0xfb0 [lustre] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa1087eb0>] ? ll_statahead_thread+0x0/0xfb0 [lustre] <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4>Code: 00 00 00 01 74 05 e8 92 3a d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> 1f 44 00 00 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 <4>Call Trace: <4> [<ffffffffa10840b0>] ? ll_post_statahead+0x50/0xa20 [lustre] <4> [<ffffffffa10880d8>] ? ll_statahead_thread+0x228/0xfb0 [lustre] <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20 <4> [<ffffffffa1087eb0>] ? ll_statahead_thread+0x0/0xfb0 [lustre] <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <0>Kernel panic - not syncing: softlockup: hung tasks
            yujian Jian Yu added a comment - One more instance on master branch: https://testing.hpdd.intel.com/test_sets/72654654-4deb-11e4-8fdd-5254006e85c2
            bogl Bob Glossman (Inactive) added a comment - - edited

            another instance on master:
            https://testing.hpdd.intel.com/test_sets/37d77be2-4553-11e4-92e5-5254006e85c2

            not seen with interop, both client and server were master

            bogl Bob Glossman (Inactive) added a comment - - edited another instance on master: https://testing.hpdd.intel.com/test_sets/37d77be2-4553-11e4-92e5-5254006e85c2 not seen with interop, both client and server were master
            yujian Jian Yu added a comment -
            yujian Jian Yu added a comment - While testing patch http://review.whamcloud.com/11006 on Lustre b2_5 branch, the same failure occurred: https://testing.hpdd.intel.com/test_sets/61990438-416a-11e4-a787-5254006e85c2

            People

              laisiyao Lai Siyao
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: