[LU-13675] LNetError: 14769:0:(o2iblnd.h:1003:kiblnd_queue2str()) LBUG Created: 15/Jun/20  Updated: 23/Jun/20  Resolved: 23/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Shuichi Ihara Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None
Environment:

2.13.54_44_gf3fef81


Issue Links:
Related
is related to LU-1742 Fix 'Timed out tx' error message Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

1 x server(CentoOS7.8), 1 client (CentOS8.1) and both server and client installed OFED-5.0

# ofed_info | head -1
MLNX_OFED_LINUX-5.0-2.1.8.0 (OFED-5.0-2.1.8):

When client mounts lustre, both server and client crashed with follwoing LBUG.

Server

[482108.891327] LNetError: 14769:0:(o2iblnd.h:1003:kiblnd_queue2str()) LBUG
[482108.891395] Pid: 14769, comm: kiblnd_connd 3.10.0-1127.10.1.el7.x86_64 #1 SMP Wed Jun 3 14:28:03 UTC 2020
[482108.891397] Call Trace:
[482108.891412]  [<ffffffffc146f67c>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[482108.891436]  [<ffffffffc146f99c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[482108.891448]  [<ffffffffc15b82cb>] kiblnd_need_noop.part.21+0x0/0x36 [ko2iblnd]
[482108.891463]  [<ffffffffc15aa581>] kiblnd_check_txs_locked+0x421/0x490 [ko2iblnd]
[482108.891474]  [<ffffffffc15b107b>] kiblnd_check_conns+0x3cb/0x880 [ko2iblnd]
[482108.891485]  [<ffffffffc15b6273>] kiblnd_connd+0x813/0x9e0 [ko2iblnd]
[482108.891495]  [<ffffffff9bec6691>] kthread+0xd1/0xe0
[482108.891506]  [<ffffffff9c592d37>] ret_from_fork_nospec_end+0x0/0x39
[482108.891514]  [<ffffffffffffffff>] 0xffffffffffffffff
[482108.891553] Kernel panic - not syncing: LBUG
[482108.891593] CPU: 3 PID: 14769 Comm: kiblnd_connd Kdump: loaded Tainted: P           OE  ------------   3.10.0-1127.10.1.el7.x86_64 #1
[482108.891682] Hardware name: Supermicro SYS-2028U-TN24R4T+/X10DRU-i+, BIOS 3.2 06/11/2019
[482108.891742] Call Trace:
[482108.891773]  [<ffffffff9c57ffa5>] dump_stack+0x19/0x1b
[482108.891817]  [<ffffffff9c579541>] panic+0xe8/0x21f
[482108.891869]  [<ffffffffc146f9eb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[482108.891925]  [<ffffffffc15b82cb>] kiblnd_queue2str.part.17+0x1a/0x1a [ko2iblnd]
[482108.891988]  [<ffffffffc15aa581>] kiblnd_check_txs_locked+0x421/0x490 [ko2iblnd]
[482108.892053]  [<ffffffffc15b107b>] kiblnd_check_conns+0x3cb/0x880 [ko2iblnd]
[482108.892110]  [<ffffffff9beae150>] ? __internal_add_timer+0x130/0x130
[482108.892168]  [<ffffffffc15b6273>] kiblnd_connd+0x813/0x9e0 [ko2iblnd]
[482108.892221]  [<ffffffff9c585942>] ? __schedule+0x402/0x840
[482108.892268]  [<ffffffff9bedb990>] ? wake_up_state+0x20/0x20
[482108.892321]  [<ffffffffc15b5a60>] ? kiblnd_cm_callback+0x2380/0x2380 [ko2iblnd]
[482108.892380]  [<ffffffff9bec6691>] kthread+0xd1/0xe0
[482108.892423]  [<ffffffff9bec65c0>] ? insert_kthread_work+0x40/0x40
[482108.892473]  [<ffffffff9c592d37>] ret_from_fork_nospec_begin+0x21/0x21
[482108.892527]  [<ffffffff9bec65c0>] ? insert_kthread_work+0x40/0x40

Client

[487085.899074] LNetError: 32398:0:(o2iblnd.h:1003:kiblnd_queue2str()) LBUG
[487085.900509] Pid: 32398, comm: kiblnd_connd 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Thu Apr 9 13:49:54 UTC 2020
[487085.900510] Call Trace:
[487085.900531]  libcfs_call_trace+0x86/0xc0 [libcfs]
[487085.900537]  lbug_with_loc+0x43/0x80 [libcfs]
[487085.900546]  kiblnd_queue2str.part.19+0x16/0x20 [ko2iblnd]
[487085.900551]  kiblnd_check_txs_locked+0x39c/0x3a0 [ko2iblnd]
[487085.900556]  kiblnd_check_conns+0x58b/0x920 [ko2iblnd]
[487085.900561]  kiblnd_connd+0x9c2/0xa60 [ko2iblnd]
[487085.900564]  kthread+0x112/0x130
[487085.900567]  ret_from_fork+0x1f/0x40
[487085.900568]  0xffffffffffffffff
[487085.900569] Kernel panic - not syncing: LBUG
[487085.901751] CPU: 4 PID: 32398 Comm: kiblnd_connd Kdump: loaded Tainted: G           OE    --------- -t - 4.18.0-147.8.1.el8_1.x86_64 #1
[487085.904110] Hardware name: Intel Corporation S2600BPB/S2600BPB, BIOS SE5C620.86B.02.01.0010.010620200716 01/06/2020
[487085.905298] Call Trace:
[487085.906489]  dump_stack+0x5c/0x80
[487085.907663]  panic+0xe7/0x247
[487085.908837]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[487085.910002]  kiblnd_queue2str.part.19+0x16/0x20 [ko2iblnd]
[487085.911147]  kiblnd_check_txs_locked+0x39c/0x3a0 [ko2iblnd]
[487085.912287]  kiblnd_check_conns+0x58b/0x920 [ko2iblnd]
[487085.913424]  kiblnd_connd+0x9c2/0xa60 [ko2iblnd]
[487085.914557]  ? wake_up_q+0x70/0x70
[487085.915677]  ? kiblnd_cm_callback+0x2230/0x2230 [ko2iblnd]
[487085.916799]  kthread+0x112/0x130
[487085.917912]  ? kthread_flush_work_fn+0x10/0x10
[487085.919036]  ret_from_fork+0x1f/0x40


 Comments   
Comment by Shuichi Ihara [ 15/Jun/20 ]

it seems that a regression came from commit 7308662efc. reverting that commit didn't cause crashes.

Comment by Andreas Dilger [ 17/Jun/20 ]

That is patch https://review.whamcloud.com/33235 "LU-1742 o2iblnd: 'Timed out tx' error message" and should probably just be reverted. It was almost 2 years old but rebased and only ran "trivial" testing, so is no longer correct.

Comment by Gerrit Updater [ 17/Jun/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38958
Subject: LU-13675 o2iblnd: revert 'Timed out tx' patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9c88ef87b7800db61a3a1944bb8dba2a25ad8be9

Comment by Gerrit Updater [ 23/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38958/
Subject: LU-13675 o2iblnd: revert 'Timed out tx' patch
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e08ac764867a0e36b303f511c94a0fa27e3dd53d

Comment by Peter Jones [ 23/Jun/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:03:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.