Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9568

CPU soft lockups

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • None
    • Lustre 2.10.0
    • None
    • RHEL 7.3
      Lustre: Build Version: 2.9.58_8_gf7df236
    • 3
    • 9223372036854775807

    Description

      We are seeing CPU soft lockups on master:

      [ 9912.332205] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [lwp_notify_test:5462] 
      [ 9912.335476] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) ldiskfs(OE) binfmt_misc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate libcfs(OE) ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device ppdev snd_pcm sg pcspkr virtio_balloon snd_timer snd soundcore i2c_piix4 parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic pata_acpi virtio_scsi cirrus virtio_blk drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix virtio_net drm serio_raw libata floppy virtio_pci virtio_ring virtio i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet]
      [ 9912.371388] CPU: 1 PID: 5462 Comm: lwp_notify_test Tainted: P OE ------------ 3.10.0-514.16.1.el7_lustre.x86_64 #1
      [ 9912.376700] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 9912.379337] task: ffff880041fc0000 ti: ffff88004406c000 task.ti: ffff88004406c000
      [ 9912.382108] RIP: 0010:[<ffffffff81327529>] [<ffffffff81327529>] __write_lock_failed+0x9/0x20
      [ 9912.384974] RSP: 0018:ffff88004406fe40 EFLAGS: 00000297
      [ 9912.387509] RAX: ffff88006002cc00 RBX: ffff88004406fe18 RCX: 0000000000000000
      [ 9912.390164] RDX: 0000000000000016 RSI: ffff880046669bc7 RDI: ffff88002930b184
      [ 9912.392790] RBP: ffff88004406fe40 R08: 0000000000019b20 R09: ffffffffa089ebe1
      [ 9912.395369] R10: ffff88007fd19b20 R11: ffffea0001101a80 R12: ffff88004406fdd0
      [ 9912.397903] R13: ffff880040f8a780 R14: ffff88004406fe18 R15: 0000000000000028
      [ 9912.400429] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
      [ 9912.402946] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 9912.405252] CR2: 00007f0172657000 CR3: 00000000019be000 CR4: 00000000000006e0
      [ 9912.407703] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
      [ 9912.410157] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
      [ 9912.412536] Stack: 
      [ 9912.414420] ffff88004406fe50 ffffffff8168e3a7 ffff88004406fe70 ffffffffa0a2a95a 
      [ 9912.417000] ffff880046669b80 ffff880045622c00 ffff88004406fe98 ffffffffa08b86da 
      [ 9912.419580] ffff880045622c00 ffff88006019e800 ffff88006019e8b0 ffff88004406fec0 
      [ 9912.422016] Call Trace: 
      [ 9912.423918] [<ffffffff8168e3a7>] _raw_write_lock+0x17/0x20 
      [ 9912.426043] [<ffffffffa0a2a95a>] qsd_conn_callback+0x5a/0x160 [lquota] 
      [ 9912.428407] [<ffffffffa08b86da>] lustre_notify_lwp_list+0xba/0x100 [obdclass]
      [ 9912.430743] [<ffffffffa10157e6>] lwp_notify_main+0x56/0xc0 [osp]
      [ 9912.432914] [<ffffffffa1015790>] ? lwp_import_event+0xb0/0xb0 [osp]
      [ 9912.435170] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      [ 9912.437340] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      [ 9912.439581] [<ffffffff81697318>] ret_from_fork+0x58/0x90
      [ 9912.441709] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      [ 9912.443795] Code: 66 90 48 89 01 31 c0 66 66 90 c3 b8 f2 ff ff ff 66 66 90 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 f0 ff 07 f3 90 <83> 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00 
      
      
      

      Also:

      [ 9946.068084] INFO: rcu_sched self-detected stall on CPU { 1} (t=60001 jiffies g=958769 c=958768 q=177406)
      [ 9946.069092] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=60002 jiffies, g=958769, c=958768, q=177406)
      [ 9946.069093] Task dump for CPU 1:
      [ 9946.069095] lwp_notify_test R running task 0 5462 2 0x00000088
      [ 9946.069097] ffff88004406fe70 ffffffffa0a2a95a ffff880046669b80 ffff880045622c00
      [ 9946.069098] ffff88004406fe98 ffffffffa08b86da ffff880045622c00 ffff88006019e800
      [ 9946.069099] ffff88006019e8b0 ffff88004406fec0 ffffffffa10157e6 ffff880049fd7a98
      [ 9946.069100] Call Trace:
      [ 9946.069136] [<ffffffffa0a2a95a>] ? qsd_conn_callback+0x5a/0x160 [lquota]
      [ 9946.069211] [<ffffffffa08b86da>] lustre_notify_lwp_list+0xba/0x100 [obdclass]
      [ 9946.069243] [<ffffffffa10157e6>] lwp_notify_main+0x56/0xc0 [osp]
      [ 9946.069247] [<ffffffffa1015790>] ? lwp_import_event+0xb0/0xb0 [osp]
      [ 9946.069251] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      [ 9946.069252] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      [ 9946.069254] [<ffffffff81697318>] ret_from_fork+0x58/0x90
      [ 9946.069256] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      [ 9946.100247] Task dump for CPU 1:
      [ 9946.101766] lwp_notify_test R running task 0 5462 2 0x00000088
      [ 9946.103672] ffff880041fc0000 00000000f4aa25e6 ffff88007fd03db0 ffffffff810c47e8
      [ 9946.105519] 0000000000000001 ffffffff81a1e5c0 ffff88007fd03dc8 ffffffff810c8079
      [ 9946.107357] 0000000000000002 ffff88007fd03df8 ffffffff811378e0 ffff88007fd101c0
      [ 9946.109190] Call Trace:
      [ 9946.110570] <IRQ> [<ffffffff810c47e8>] sched_show_task+0xa8/0x110
      [ 9946.112300] [<ffffffff810c8079>] dump_cpu_task+0x39/0x70
      [ 9946.113884] [<ffffffff811378e0>] rcu_dump_cpu_stacks+0x90/0xd0
      [ 9946.115511] [<ffffffff8113b032>] rcu_check_callbacks+0x442/0x720
      [ 9946.117136] [<ffffffff810f35c0>] ? tick_sched_handle.isra.13+0x60/0x60
      [ 9946.118828] [<ffffffff81099697>] update_process_times+0x47/0x80
      [ 9946.120453] [<ffffffff810f3585>] tick_sched_handle.isra.13+0x25/0x60
      [ 9946.122129] [<ffffffff810f3601>] tick_sched_timer+0x41/0x70
      [ 9946.123694] [<ffffffff810b4d72>] __hrtimer_run_queues+0xd2/0x260
      [ 9946.125297] [<ffffffff810b5310>] hrtimer_interrupt+0xb0/0x1e0
      [ 9946.126877] [<ffffffff81698e5c>] ? call_softirq+0x1c/0x30
      [ 9946.128461] [<ffffffff81050fd7>] local_apic_timer_interrupt+0x37/0x60
      [ 9946.130105] [<ffffffff81699acf>] smp_apic_timer_interrupt+0x3f/0x60
      [ 9946.131845] [<ffffffff8169801d>] apic_timer_interrupt+0x6d/0x80
      [ 9946.133574] <EOI> [<ffffffffa089ebe1>] ? keys_fini+0xb1/0x1d0 [obdclass]
      [ 9946.135525] [<ffffffff81327529>] ? __write_lock_failed+0x9/0x20
      [ 9946.137226] [<ffffffff8168e3a7>] _raw_write_lock+0x17/0x20
      [ 9946.138858] [<ffffffffa0a2a95a>] qsd_conn_callback+0x5a/0x160 [lquota]
      [ 9946.140578] [<ffffffffa08b86da>] lustre_notify_lwp_list+0xba/0x100 [obdclass]
      [ 9946.142362] [<ffffffffa10157e6>] lwp_notify_main+0x56/0xc0 [osp]
      [ 9946.143999] [<ffffffffa1015790>] ? lwp_import_event+0xb0/0xb0 [osp]
      [ 9946.145672] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      [ 9946.147162] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      [ 9946.148768] [<ffffffff81697318>] ret_from_fork+0x58/0x90
      [ 9946.150263] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      
      
      

       

       

       

      Attachments

        Activity

          People

            wc-triage WC Triage
            brian Brian Murrell (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: