[LU-14282] WARNING: CPU: 22 PID: 5953 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0 Created: 29/Dec/20 Updated: 08/Feb/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre-2.12.5_10.llnl |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Lustre router produces the following stack dump when LNet is started: LNet: Added LNI 192.168.129.134@o2ib27 [8/1024/0/180] LNet: Using FastReg for registration ------------[ cut here ]------------ WARNING: CPU: 22 PID: 5953 at kernel/softirq.c:151 __local_bh_enable_ip+0x82/0xb0 Modules linked in: ko2iblnd(OE) lnet(OE) libcfs(OE) opa_vnic rpcrdma ib_iser mlx5_ib iTCO_wdt iTCO_vendor_support sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi hfi1 kvm sg rdmavt irqbypass joydev pcspkr mlx5_core lpc_ich mlxfw devlink i2c_i801 ioatdma ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_cpufreq sch_fq_codel ib_ipoib rdma_ucm ib_uverbs binfmt_misc msr_safe(OE) ib_umad iw_cxgb4 rdma_cm iw_cm ib_cm iw_cxgb3 ib_core ip_tables nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache overlay(T) ext4 mbcache jbd2 dm_service_time sd_mod crc_t10dif[ 234.953405] LNet: Added LNI 172.19.3.75@o2ib600 [8/1024/0/180] crct10dif_generic be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi 8021q libcxgb garp mrp stp qla4xxx llc iscsi_boot_sysfs mgag200 drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect crct10dif_common sysimgblt crc32_pclmul fb_sys_fops crc32c_intel ttm ghash_clmulni_intel igb mxm_wmi drm ahci aesni_intel lrw libahci gf128mul dca glue_helper ablk_helper ptp libata cryptd drm_panel_orientation_quirks pps_core i2c_algo_bit dm_multipath wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse CPU: 22 PID: 5953 Comm: kworker/22:2H Kdump: loaded Tainted: G OE ------------ T 3.10.0-1160.4.1.1chaos.ch6.x86_64 #1 Hardware name: Penguin Computing Relion 2900e/S2600WT2R, BIOS SE5C610.86B.01.01.0027.071020182329 07/10/2018 Workqueue: rdmavt_cq send_complete [rdmavt] Call Trace: [<ffffffff865ae498>] dump_stack+0x19/0x1b [<ffffffff85e9e0e8>] __warn+0xd8/0x100 [<ffffffff85e9e22d>] warn_slowpath_null+0x1d/0x20 [<ffffffff85ea7652>] __local_bh_enable_ip+0x82/0xb0 [<ffffffff865b737e>] _raw_spin_unlock_bh+0x1e/0x20 [<ffffffffc09673c5>] cfs_trace_unlock_tcd+0x65/0xb0 [libcfs] [<ffffffffc096de18>] libcfs_debug_vmsg2+0x728/0xbb0 [libcfs] [<ffffffff85eed858>] ? enqueue_task_fair+0x208/0x6c0 [<ffffffff85edd4df>] ? ttwu_do_activate+0x6f/0x80 [<ffffffffc096e2f7>] libcfs_debug_msg+0x57/0x80 [libcfs] [<ffffffffc086f32a>] kiblnd_cq_completion+0x11a/0x160 [ko2iblnd] [<ffffffffc0d745ae>] send_complete+0x3e/0x60 [rdmavt] [<ffffffff85ec298f>] process_one_work+0x18f/0x4a0 [<ffffffff85ec3776>] worker_thread+0x126/0x3e0 [<ffffffff85ec3650>] ? rescuer_thread+0x430/0x430 [<ffffffff85ecafc1>] kthread+0xd1/0xe0 [<ffffffff85ecaef0>] ? insert_kthread_work+0x40/0x40 [<ffffffff865c1ff7>] ret_from_fork_nospec_begin+0x21/0x21 [<ffffffff85ecaef0>] ? insert_kthread_work+0x40/0x40 ---[ end trace 07f572a412f12f32 ]--- OK Started lnet management. And the warning indicates that _raw_spin_unlock_bh() was called when either IRQs are disabled, or we are in an IRQ handler. (gdb) l *(__local_bh_enable_ip+0x82/0xb0)
0xffffffff810a75d0 is in __local_bh_enable_ip (kernel/softirq.c:150).
145 }
146
147 EXPORT_SYMBOL(_local_bh_enable);
148
149 void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
150 {
151 WARN_ON_ONCE(in_irq() || irqs_disabled());
152 #ifdef CONFIG_TRACE_IRQFLAGS
153 local_irq_disable();
154 #endif
|
| Comments |
| Comment by Olaf Faaland [ 29/Dec/20 ] |
|
For my tracking, my local ticket is TOSS4981 |
| Comment by Peter Jones [ 30/Dec/20 ] |
|
Serguei Can you please advise? Thanks Peter |
| Comment by Serguei Smirnov [ 30/Dec/20 ] |
|
Hi Olaf, I do believe I can see the problem in the code, I'm just curious when you started seeing the warning. The related code is fairly old as far as I can tell. Did something change in the way the router is used? Thanks, Serguei.
|
| Comment by Olaf Faaland [ 31/Dec/20 ] |
|
Hi Serguei, |
| Comment by Olaf Faaland [ 31/Dec/20 ] |
|
Hi Serguei, |
| Comment by Serguei Smirnov [ 31/Dec/20 ] |
|
Olaf, Do you have net-level debug logging enabled on these machines? (lctl set_param debug=+net)
|
| Comment by Olaf Faaland [ 04/Jan/21 ] |
|
Hi Serguei, |
| Comment by Olaf Faaland [ 04/Jan/21 ] |
|
Hi Serguei, |
| Comment by Serguei Smirnov [ 04/Jan/21 ] |
|
Hi Olaf, The suspect functions are kiblnd_conn_addref and kiblnd_conn_decref. They call CDEBUG which resolves to libcfs_debug_vmsg2 which calls cfs_trace_lock_tcd/cfs_trace_unlock_tcd which may call spin_lock_bh/spin_unlock_bh respectively. kiblnd_conn_addref and kiblnd_conn_decref are sometimes called with interrupts disabled, like in kiblnd_cq_completion from the stack dump in this ticket. If the suspicion is correct, this warning shouldn't occur when running without lnet debug logging enabled. Thanks, Serguei. |
| Comment by Olaf Faaland [ 05/Jan/21 ] |
|
Hi Serguei, I see. That fits with the timing I saw. I changed our default debug mask to add +net on Oct 13th. I guess I'll be changing that back temporarily |
| Comment by Gerrit Updater [ 19/Jan/21 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41270 |
| Comment by Serguei Smirnov [ 18/Jan/24 ] |
|
Olaf, Andreas brought to my attention that the patch for this never got merged to master. Are you still seeing this warning sometimes? |
| Comment by Olaf Faaland [ 18/Jan/24 ] |
|
Hi Serguei, |
| Comment by Serguei Smirnov [ 22/Jan/24 ] |
|
Olaf, Are you seeing it with "+net" in debug mask? |
| Comment by Olaf Faaland [ 08/Feb/24 ] |
|
Hi Serguei, sorry for the slow response. Actually, I do not think we are seeing this now. My prior search was wrong and I found different stacks that are completely unrelated. |