[LU-9568] CPU soft lockups Created: 28/May/17  Updated: 05/Aug/20  Resolved: 05/Aug/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Brian Murrell (Inactive) Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

RHEL 7.3
Lustre: Build Version: 2.9.58_8_gf7df236


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We are seeing CPU soft lockups on master:

[ 9912.332205] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [lwp_notify_test:5462] 
[ 9912.335476] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) ldiskfs(OE) binfmt_misc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate libcfs(OE) ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device ppdev snd_pcm sg pcspkr virtio_balloon snd_timer snd soundcore i2c_piix4 parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic pata_acpi virtio_scsi cirrus virtio_blk drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix virtio_net drm serio_raw libata floppy virtio_pci virtio_ring virtio i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet]
[ 9912.371388] CPU: 1 PID: 5462 Comm: lwp_notify_test Tainted: P OE ------------ 3.10.0-514.16.1.el7_lustre.x86_64 #1
[ 9912.376700] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 9912.379337] task: ffff880041fc0000 ti: ffff88004406c000 task.ti: ffff88004406c000
[ 9912.382108] RIP: 0010:[<ffffffff81327529>] [<ffffffff81327529>] __write_lock_failed+0x9/0x20
[ 9912.384974] RSP: 0018:ffff88004406fe40 EFLAGS: 00000297
[ 9912.387509] RAX: ffff88006002cc00 RBX: ffff88004406fe18 RCX: 0000000000000000
[ 9912.390164] RDX: 0000000000000016 RSI: ffff880046669bc7 RDI: ffff88002930b184
[ 9912.392790] RBP: ffff88004406fe40 R08: 0000000000019b20 R09: ffffffffa089ebe1
[ 9912.395369] R10: ffff88007fd19b20 R11: ffffea0001101a80 R12: ffff88004406fdd0
[ 9912.397903] R13: ffff880040f8a780 R14: ffff88004406fe18 R15: 0000000000000028
[ 9912.400429] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[ 9912.402946] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 9912.405252] CR2: 00007f0172657000 CR3: 00000000019be000 CR4: 00000000000006e0
[ 9912.407703] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
[ 9912.410157] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
[ 9912.412536] Stack: 
[ 9912.414420] ffff88004406fe50 ffffffff8168e3a7 ffff88004406fe70 ffffffffa0a2a95a 
[ 9912.417000] ffff880046669b80 ffff880045622c00 ffff88004406fe98 ffffffffa08b86da 
[ 9912.419580] ffff880045622c00 ffff88006019e800 ffff88006019e8b0 ffff88004406fec0 
[ 9912.422016] Call Trace: 
[ 9912.423918] [<ffffffff8168e3a7>] _raw_write_lock+0x17/0x20 
[ 9912.426043] [<ffffffffa0a2a95a>] qsd_conn_callback+0x5a/0x160 [lquota] 
[ 9912.428407] [<ffffffffa08b86da>] lustre_notify_lwp_list+0xba/0x100 [obdclass]
[ 9912.430743] [<ffffffffa10157e6>] lwp_notify_main+0x56/0xc0 [osp]
[ 9912.432914] [<ffffffffa1015790>] ? lwp_import_event+0xb0/0xb0 [osp]
[ 9912.435170] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 9912.437340] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 9912.439581] [<ffffffff81697318>] ret_from_fork+0x58/0x90
[ 9912.441709] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 9912.443795] Code: 66 90 48 89 01 31 c0 66 66 90 c3 b8 f2 ff ff ff 66 66 90 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 f0 ff 07 f3 90 <83> 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00 00 00 


Also:

[ 9946.068084] INFO: rcu_sched self-detected stall on CPU { 1} (t=60001 jiffies g=958769 c=958768 q=177406)
[ 9946.069092] INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 0, t=60002 jiffies, g=958769, c=958768, q=177406)
[ 9946.069093] Task dump for CPU 1:
[ 9946.069095] lwp_notify_test R running task 0 5462 2 0x00000088
[ 9946.069097] ffff88004406fe70 ffffffffa0a2a95a ffff880046669b80 ffff880045622c00
[ 9946.069098] ffff88004406fe98 ffffffffa08b86da ffff880045622c00 ffff88006019e800
[ 9946.069099] ffff88006019e8b0 ffff88004406fec0 ffffffffa10157e6 ffff880049fd7a98
[ 9946.069100] Call Trace:
[ 9946.069136] [<ffffffffa0a2a95a>] ? qsd_conn_callback+0x5a/0x160 [lquota]
[ 9946.069211] [<ffffffffa08b86da>] lustre_notify_lwp_list+0xba/0x100 [obdclass]
[ 9946.069243] [<ffffffffa10157e6>] lwp_notify_main+0x56/0xc0 [osp]
[ 9946.069247] [<ffffffffa1015790>] ? lwp_import_event+0xb0/0xb0 [osp]
[ 9946.069251] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 9946.069252] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 9946.069254] [<ffffffff81697318>] ret_from_fork+0x58/0x90
[ 9946.069256] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 9946.100247] Task dump for CPU 1:
[ 9946.101766] lwp_notify_test R running task 0 5462 2 0x00000088
[ 9946.103672] ffff880041fc0000 00000000f4aa25e6 ffff88007fd03db0 ffffffff810c47e8
[ 9946.105519] 0000000000000001 ffffffff81a1e5c0 ffff88007fd03dc8 ffffffff810c8079
[ 9946.107357] 0000000000000002 ffff88007fd03df8 ffffffff811378e0 ffff88007fd101c0
[ 9946.109190] Call Trace:
[ 9946.110570] <IRQ> [<ffffffff810c47e8>] sched_show_task+0xa8/0x110
[ 9946.112300] [<ffffffff810c8079>] dump_cpu_task+0x39/0x70
[ 9946.113884] [<ffffffff811378e0>] rcu_dump_cpu_stacks+0x90/0xd0
[ 9946.115511] [<ffffffff8113b032>] rcu_check_callbacks+0x442/0x720
[ 9946.117136] [<ffffffff810f35c0>] ? tick_sched_handle.isra.13+0x60/0x60
[ 9946.118828] [<ffffffff81099697>] update_process_times+0x47/0x80
[ 9946.120453] [<ffffffff810f3585>] tick_sched_handle.isra.13+0x25/0x60
[ 9946.122129] [<ffffffff810f3601>] tick_sched_timer+0x41/0x70
[ 9946.123694] [<ffffffff810b4d72>] __hrtimer_run_queues+0xd2/0x260
[ 9946.125297] [<ffffffff810b5310>] hrtimer_interrupt+0xb0/0x1e0
[ 9946.126877] [<ffffffff81698e5c>] ? call_softirq+0x1c/0x30
[ 9946.128461] [<ffffffff81050fd7>] local_apic_timer_interrupt+0x37/0x60
[ 9946.130105] [<ffffffff81699acf>] smp_apic_timer_interrupt+0x3f/0x60
[ 9946.131845] [<ffffffff8169801d>] apic_timer_interrupt+0x6d/0x80
[ 9946.133574] <EOI> [<ffffffffa089ebe1>] ? keys_fini+0xb1/0x1d0 [obdclass]
[ 9946.135525] [<ffffffff81327529>] ? __write_lock_failed+0x9/0x20
[ 9946.137226] [<ffffffff8168e3a7>] _raw_write_lock+0x17/0x20
[ 9946.138858] [<ffffffffa0a2a95a>] qsd_conn_callback+0x5a/0x160 [lquota]
[ 9946.140578] [<ffffffffa08b86da>] lustre_notify_lwp_list+0xba/0x100 [obdclass]
[ 9946.142362] [<ffffffffa10157e6>] lwp_notify_main+0x56/0xc0 [osp]
[ 9946.143999] [<ffffffffa1015790>] ? lwp_import_event+0xb0/0xb0 [osp]
[ 9946.145672] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 9946.147162] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 9946.148768] [<ffffffff81697318>] ret_from_fork+0x58/0x90
[ 9946.150263] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140


 

 

 



 Comments   
Comment by Brian Murrell (Inactive) [ 03/Oct/17 ]

Brian Murrell, are we still hitting this often?

I can't say with any certainty since I'm not sure how frequently others are looking for this kind of failure. This kind of failure does not present itself too obviously in our testing since we are not really testing for these kinds of failures.

could you verify on 2.10.1 GA?

We don't have any on-demand reproducer for it so can't really target any testing of it.

Comment by Andreas Dilger [ 05/Aug/20 ]

Closing old ticket that has not been seen in a long time.

Generated at Sat Feb 10 02:27:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.