[LU-12416] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mount.lustre:11956] Created: 10/Jun/19  Updated: 25/Nov/19  Resolved: 08/Jul/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Vladimir Saveliev Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Attachments: PNG File soft_lockup_during_mount.png    
Issue Links:
Duplicate
duplicates LU-10931 failed peer discovery still taking to... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After the multi-rail branch merge mount faces with:

[   80.154928] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mount.lustre:11956]
[   80.157527] Kernel panic - not syncing: softlockup: hung tasks
[   80.158948] CPU: 0 PID: 11956 Comm: mount.lustre Tainted: G           OEL ------------   3.10.0-862.14.4.el7.x86_64 #22
[   80.161585] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[   80.163751] Call Trace:
[   80.164430]  <IRQ>  [<ffffffff92c22b0e>] dump_stack+0x19/0x1b
[   80.165605]  [<ffffffff92c1d7af>] panic+0xe8/0x21f
[   80.166601]  [<ffffffff9262c838>] ? show_regs+0x58/0x210
[   80.167670]  [<ffffffff9273a761>] watchdog_timer_fn+0x231/0x240
[   80.169127]  [<ffffffff9273a530>] ? watchdog+0x40/0x40
[   80.170551]  [<ffffffff926bdd93>] __hrtimer_run_queues+0xf3/0x270
[   80.172526]  [<ffffffff926be31f>] hrtimer_interrupt+0xaf/0x1d0
[   80.173902]  [<ffffffff926573ab>] local_apic_timer_interrupt+0x3b/0x60
[   80.175189]  [<ffffffff92c38a13>] smp_apic_timer_interrupt+0x43/0x60
[   80.176421]  [<ffffffff92c352b2>] apic_timer_interrupt+0x162/0x170
[   80.178102]  <EOI>  [<ffffffffc074db71>] ? lnet_peer_ni_alloc+0x61/0x390 [lnet]
[   80.180384]  [<ffffffff92703494>] ? __raw_callee_save___pv_queued_spin_unlock+0x10/0x17
[   80.182187]  [<ffffffffc06c47b8>] cfs_percpt_unlock+0x38/0xb0 [libcfs]
[   80.183421]  [<ffffffffc0756757>] lnet_discover_peer_locked+0x77/0x3d0 [lnet]
[   80.184929]  [<ffffffff926bab40>] ? wake_up_atomic_t+0x30/0x30
[   80.186831]  [<ffffffffc0756b20>] LNetPrimaryNID+0x70/0x1a0 [lnet]
[   80.188202]  [<ffffffffc0b295ee>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
[   80.190003]  [<ffffffffc0b1d94c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
[   80.192225]  [<ffffffffc0aefcd2>] import_set_conn+0xb2/0x7a0 [ptlrpc]
[   80.193890]  [<ffffffffc0af1d49>] client_obd_setup+0xd19/0x1430 [ptlrpc]
[   80.195201]  [<ffffffffc06b994f>] ? cfs_hash_buckets_realloc+0x1bf/0x690 [libcfs]
[   80.196932]  [<ffffffffc0e85aae>] mgc_setup+0x3e/0x650 [mgc]
[   80.198397]  [<ffffffffc084259c>] obd_setup+0x15c/0x280 [obdclass]
[   80.199982]  [<ffffffffc06ba18c>] ? cfs_hash_create+0x36c/0xa20 [libcfs]
[   80.201576]  [<ffffffffc0843888>] class_setup+0x2a8/0x840 [obdclass]
[   80.203390]  [<ffffffffc0846b2e>] class_process_config+0x191e/0x2840 [obdclass]
[   80.205245]  [<ffffffffc0838e92>] ? class_add_uuid+0x282/0x4c0 [obdclass]
[   80.206614]  [<ffffffffc084ae78>] do_lcfg+0x258/0x500 [obdclass]
[   80.207822]  [<ffffffffc084f6a8>] lustre_start_simple+0x88/0x210 [obdclass]
[   80.209462]  [<ffffffffc08504a5>] lustre_start_mgc+0xc75/0x2420 [obdclass]
[   80.210917]  [<ffffffffc084f6a8>] ? lustre_start_simple+0x88/0x210 [obdclass]
[   80.213187]  [<ffffffffc087d2eb>] server_fill_super+0xbfb/0x1890 [obdclass]
[   80.214570]  [<ffffffffc08526b8>] lustre_fill_super+0x328/0x950 [obdclass]
[   80.216089]  [<ffffffffc0852390>] ? lustre_common_put_super+0x270/0x270 [obdclass]
[   80.218447]  [<ffffffff928100bf>] mount_nodev+0x4f/0xb0
[   80.219779]  [<ffffffffc084a888>] lustre_mount+0x38/0x60 [obdclass]
[   80.221244]  [<ffffffff92810c3e>] mount_fs+0x3e/0x1b0
[   80.222538]  [<ffffffff9282e177>] vfs_kern_mount+0x67/0x110
[   80.224375]  [<ffffffff9283079f>] do_mount+0x1ef/0xce0
[   80.226543]  [<ffffffff927e836c>] ? kmem_cache_alloc_trace+0x3c/0x200
[   80.228959]  [<ffffffff928315d3>] SyS_mount+0x83/0xd0
[   80.230329]  [<ffffffff92c3429b>] system_call_fastpath+0x22/0x27

Important: this is in case when there is only one CPU.



 Comments   
Comment by Arshad Hussain [ 11/Jun/19 ]

I am also facing this issue on a single CPU VM after doing latest pull. From commit "f9ad0d1" it is failing and anything below this commit it working fine. That is till commit deb31c2 it is working fine.

 

1aae733 LU-11297 lnet: MR Routing Feature - Failed (HEAD)
...
00a2932 LU-11297 lnet: handle router health off - Failed
f9ad0d1 LU-11641 lnet: handle discovery - Failed
deb31c2 LU-11470 lnet: drop all rule - Pass
...
4344562 LU-11300 lnet: consider alive_router_check_interval - Pass

Kernel version & Distribution

# uname -r
3.10.0-862.9.1.el7_lustre.x86_64
#
# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
#
Comment by Oleg Drokin [ 22/Jun/19 ]

there's a patch trackign there that should fix this.

Comment by Peter Jones [ 08/Jul/19 ]

tracked under LU-10931

Generated at Sat Feb 10 02:52:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.