[LU-14282] WARNING: CPU: 22 PID: 5953 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0 Created: 29/Dec/20  Updated: 08/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: llnl
Environment:

lustre-2.12.5_10.llnl
RHEL 7.9 derivative
3.10.0-1160.4.1.1chaos.ch6.x86_64


Attachments: Text File dk.opal187.1609800485.txt     Text File dmesg.opal187.1609800485.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Lustre router produces the following stack dump when LNet is started:

LNet: Added LNI 192.168.129.134@o2ib27 [8/1024/0/180]
LNet: Using FastReg for registration
------------[ cut here ]------------
WARNING: CPU: 22 PID: 5953 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0
Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) opa_vnic rpcrdma ib_iser mlx5_ib iTCO_wdt iTCO_vendor_support sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi hfi1 kvm sg rdmavt irqbypass joydev pcspkr mlx5_core lpc_ich mlxfw devlink i2c_i801 ioatdma ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_cpufreq sch_fq_codel ib_ipoib rdma_ucm ib_uverbs binfmt_misc msr_safe(OE) ib_umad iw_cxgb4 rdma_cm iw_cm ib_cm iw_cxgb3 ib_core ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache overlay(T) ext4 mbcache jbd2 dm_service_time sd_mod crc_t10dif[  234.953405] LNet: Added LNI 172.19.3.75@o2ib600 [8/1024/0/180]
crct10dif_generic be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi 8021q libcxgb garp mrp stp qla4xxx llc iscsi_boot_sysfs mgag200 drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect crct10dif_common sysimgblt crc32_pclmul fb_sys_fops crc32c_intel ttm ghash_clmulni_intel igb mxm_wmi drm ahci aesni_intel lrw libahci gf128mul dca glue_helper ablk_helper ptp libata cryptd drm_panel_orientation_quirks pps_core i2c_algo_bit dm_multipath wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse
CPU: 22 PID: 5953 Comm: kworker/22:2H Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1160.4.1.1chaos.ch6.x86_64 #1
Hardware name: Penguin Computing Relion 2900e/S2600WT2R, BIOS SE5C610.86B.01.01.0027.071020182329 07/10/2018
Workqueue: rdmavt_cq send_complete [rdmavt]
Call Trace:
 [<ffffffff865ae498>] dump_stack+0x19/0x1b
 [<ffffffff85e9e0e8>] __warn+0xd8/0x100
 [<ffffffff85e9e22d>] warn_slowpath_null+0x1d/0x20
 [<ffffffff85ea7652>] __local_bh_enable_ip+0x82/0xb0
 [<ffffffff865b737e>] _raw_spin_unlock_bh+0x1e/0x20
 [<ffffffffc09673c5>] cfs_trace_unlock_tcd+0x65/0xb0 [libcfs]
 [<ffffffffc096de18>] libcfs_debug_vmsg2+0x728/0xbb0 [libcfs]
 [<ffffffff85eed858>] ? enqueue_task_fair+0x208/0x6c0
 [<ffffffff85edd4df>] ? ttwu_do_activate+0x6f/0x80
 [<ffffffffc096e2f7>] libcfs_debug_msg+0x57/0x80 [libcfs]
 [<ffffffffc086f32a>] kiblnd_cq_completion+0x11a/0x160 [ko2iblnd]
 [<ffffffffc0d745ae>] send_complete+0x3e/0x60 [rdmavt]
 [<ffffffff85ec298f>] process_one_work+0x18f/0x4a0
 [<ffffffff85ec3776>] worker_thread+0x126/0x3e0
 [<ffffffff85ec3650>] ? rescuer_thread+0x430/0x430
 [<ffffffff85ecafc1>] kthread+0xd1/0xe0
 [<ffffffff85ecaef0>] ? insert_kthread_work+0x40/0x40
 [<ffffffff865c1ff7>] ret_from_fork_nospec_begin+0x21/0x21
 [<ffffffff85ecaef0>] ? insert_kthread_work+0x40/0x40
---[ end trace 07f572a412f12f32 ]---
OK Started lnet management.

And the warning indicates that _raw_spin_unlock_bh() was called when either IRQs are disabled, or we are in an IRQ handler.

(gdb) l *(__local_bh_enable_ip+0x82/0xb0)
0xffffffff810a75d0 is in __local_bh_enable_ip (kernel/softirq.c:150).
145	}
146	
147	EXPORT_SYMBOL(_local_bh_enable);
148	
149	void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
150	{
151		WARN_ON_ONCE(in_irq() || irqs_disabled());
152	#ifdef CONFIG_TRACE_IRQFLAGS
153		local_irq_disable();
154	#endif 


 Comments   
Comment by Olaf Faaland [ 29/Dec/20 ]

For my tracking, my local ticket is TOSS4981

Comment by Peter Jones [ 30/Dec/20 ]

Serguei

Can you please advise?

Thanks

Peter

Comment by Serguei Smirnov [ 30/Dec/20 ]

Hi Olaf,

I do believe I can see the problem in the code, I'm just curious when you started seeing the warning. The related code is fairly old as far as I can tell. Did something change in the way the router is used?

Thanks,

Serguei.

 

 

Comment by Olaf Faaland [ 31/Dec/20 ]

Hi Serguei,
Most of our clusters, including that one, switched from Lustre 2.10.x to 2.12.4 or .5 this year between roughly June and August. And I can see that at least some of our clusters were logging that error in September and I just failed to notice. I'll get more detail.
thanks!

Comment by Olaf Faaland [ 31/Dec/20 ]

Hi Serguei,
Actually, it looks like we started seeing it in October 2020, after these clusters had already switched from Lustre 2.10 to 2.12. It appeared on multiple clusters around the same time. I don't know what changed.

Comment by Serguei Smirnov [ 31/Dec/20 ]

Olaf,

Do you have net-level debug logging enabled on these machines? (lctl set_param debug=+net)

 

Comment by Olaf Faaland [ 04/Jan/21 ]

Hi Serguei,
We do have that enabled (at least on some). I'm looking for an instance recent enough that startup is still in the debug buffer and will upload as soon as I find it.
Can you identify the suspect function(s)?
Thanks

Comment by Olaf Faaland [ 04/Jan/21 ]

Hi Serguei,
I've uploaded console and debug logs (*.1609800485.txt). This particular node was running lustre-2.12.6_2.llnl-1.ch6.x86_64 on kernel 3.10.0-1160.11.1.1chaos.ch6.x86_64. It has an OPA interface (internal fabric of a compute cluster) and a 10GigE interface.
thanks

Comment by Serguei Smirnov [ 04/Jan/21 ]

Hi Olaf,

The suspect functions are kiblnd_conn_addref and kiblnd_conn_decref. They call CDEBUG which resolves to libcfs_debug_vmsg2 which calls cfs_trace_lock_tcd/cfs_trace_unlock_tcd which may call spin_lock_bh/spin_unlock_bh respectively.

kiblnd_conn_addref and kiblnd_conn_decref are sometimes called with interrupts disabled, like in kiblnd_cq_completion from the stack dump in this ticket.

If the suspicion is correct, this warning shouldn't occur when running without lnet debug logging enabled.

Thanks,

Serguei.

Comment by Olaf Faaland [ 05/Jan/21 ]

Hi Serguei,

I see.  That fits with the timing I saw.  I changed our default debug mask to add +net on Oct 13th.  I guess I'll be changing that back temporarily .  And it might explain another bug I'm seeing.  Thanks!

Comment by Gerrit Updater [ 19/Jan/21 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41270
Subject: LU-14282 libcfs: refine use of locks when in softirq
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 16dead15d8e2eaf696c8f688387c2d1db993aa8f

Comment by Serguei Smirnov [ 18/Jan/24 ]

Olaf,

Andreas brought to my attention that the patch for this never got merged to master. Are you still seeing this warning sometimes? 

Comment by Olaf Faaland [ 18/Jan/24 ]

Hi Serguei,
Yes, we still see this intermittently.

Comment by Serguei Smirnov [ 22/Jan/24 ]

Olaf,

Are you seeing it with "+net" in debug mask?

Comment by Olaf Faaland [ 08/Feb/24 ]

Hi Serguei, sorry for the slow response. Actually, I do not think we are seeing this now. My prior search was wrong and I found different stacks that are completely unrelated.

Generated at Sat Feb 10 03:08:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.