[LU-10459] LBUG o2iblnd_cb.c:991:kiblnd_check_sends_locked()) ASSERTION( conn->ibc_nsends_posted <= conn->ibc_queue_depth ) failed: Created: 04/Jan/18  Updated: 07/Jan/19  Resolved: 19/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: soak
Environment:

Soak performance cluster, version=2.10.56_84_gd645c72, RHEL 7.4 kernel


Attachments: Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-10291 remove concurrent_sends tunable Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LBUG occurs immediately when we try to do any IO on clients. Multiple clients impacted

Jan  4 21:57:38 soak-17 kernel: LNetError: 12570:0:(o2iblnd_cb.c:991:kiblnd_check_sends_locked()) ASSERTION( conn->ibc_nsends_posted <= conn->ibc_queue_depth ) failed:
Jan  4 21:57:38 soak-17 kernel: LNetError: 12570:0:(o2iblnd_cb.c:991:kiblnd_check_sends_locked()) LBUG
Jan  4 21:57:38 soak-17 kernel: Pid: 12570, comm: kiblnd_sd_00_00
Jan  4 21:57:38 soak-17 kernel: #012Call Trace:
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc097c7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc097c83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc0c2666b>] kiblnd_check_sends_locked+0xd8b/0xd90 [ko2iblnd]
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc0538b5c>] ? mlx4_ib_post_recv+0x1dc/0x310 [mlx4_ib]
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc0c27f50>] kiblnd_post_rx+0x160/0x520 [ko2iblnd]
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc0c284ea>] kiblnd_recv+0x1da/0x7b0 [ko2iblnd]
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc0a00573>] lnet_ni_recv+0xc3/0x320 [lnet]
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc0a02e06>] lnet_parse_local+0x4c6/0xd40 [lnet]
Jan  4 21:57:38 soak-17 kernel: [<ffffffff810c7705>] ? sched_clock_cpu+0x85/0xc0 
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc0a03f4a>] lnet_parse+0x8ca/0xfc0 [lnet]
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc0c261ac>] ? kiblnd_check_sends_locked+0x8cc/0xd90 [ko2iblnd]
Jan  4 21:57:38 soak-17 kernel: [<ffffffff81029557>] ? __switch_to+0xd7/0x510
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc0c28e63>] kiblnd_handle_rx+0x213/0x6b0 [ko2iblnd]
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc0c2facf>] kiblnd_scheduler+0xf0f/0x1150 [ko2iblnd]
Jan  4 21:57:38 soak-17 kernel: [<ffffffff810ce55e>] ? dequeue_task_fair+0x41e/0x660
Jan  4 21:57:38 soak-17 kernel: [<ffffffff810c7705>] ? sched_clock_cpu+0x85/0xc0
Jan  4 21:57:38 soak-17 kernel: [<ffffffff810c4820>] ? default_wake_function+0x0/0x20
Jan  4 21:57:38 soak-17 kernel: [<ffffffffc0c2ebc0>] ? kiblnd_scheduler+0x0/0x1150 [ko2iblnd]
Jan  4 21:57:38 soak-17 kernel: [<ffffffff810b099f>] kthread+0xcf/0xe0
Jan  4 21:57:38 soak-17 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0
Jan  4 21:57:38 soak-17 kernel: [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
Jan  4 21:57:38 soak-17 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0
Jan  4 21:57:38 soak-17 kernel:

Multiple crash dumps available on Spirit



 Comments   
Comment by Amir Shehata (Inactive) [ 05/Jan/18 ]

This is most likely related to: LU-10291 lnd: remove concurrent_sends tunable

This only affects master. I'm investigating.

Comment by Gerrit Updater [ 05/Jan/18 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/30751
Subject: LU-10459 lnd: throttle tx based on queue depth
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 86d289fed3c0eeccc3a0650d7e5a842391d11c3e

Comment by Cliff White (Inactive) [ 05/Jan/18 ]

Testing the above patch on soak, appears to fix the immediate LBUG.
Soak has been running now for about 30 minutes, will see how we do.

Comment by Amir Shehata (Inactive) [ 10/Jan/18 ]

I checked b2_10, it doesn't look like LU-10291 lnd: remove concurrent_sends tunable was ported over, so I'm wondering if this is the same issue. That assert was hit due to the above patch.

Comment by Cliff White (Inactive) [ 10/Jan/18 ]

Ah, sorry wrong bug - my bad

Comment by Gerrit Updater [ 12/Jan/18 ]

sorry, commit was added against the wrong ticket.

Comment by Cliff White (Inactive) [ 17/Jan/18 ]

I am currently seeing this on a lustre-review-ib build version=2.10.56_86_gd8827a8

Comment by Amir Shehata (Inactive) [ 18/Jan/18 ]

The patch which fixes the issue hasn't landed yet.

Comment by Gerrit Updater [ 19/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30751/
Subject: LU-10459 lnd: throttle tx based on queue depth
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e86f55798ca7bc8f7fe22dd48c9d9f52c1bb029a

Comment by Joseph Gmitter (Inactive) [ 19/Jan/18 ]

Landed to master for 2.11.0

Comment by Gerrit Updater [ 12/Sep/18 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33150
Subject: LU-10459 lnd: throttle tx based on queue depth
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: dce2da916afe3fa474e2199b4993c91ced4e45cf

Generated at Sat Feb 10 02:35:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.