[LU-12906] LBUG ASSERTION( rspt->rspt_cpt == cpt ) failed Created: 26/Oct/19  Updated: 31/Jul/20  Resolved: 12/Dec/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS 7.6


Attachments: File LU-12568.patch    
Issue Links:
Duplicate
Related
is related to LU-12568 LNetError: 28086:0:(lib-move.c:2862:l... Resolved
is related to LU-12441 Response tracker is not detached on r... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Using 2.12.3 servers and clients, we hit this bug once on an OSS, once on a client:

[197803.220678] LNetError: 34981:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: 
[197803.220682] LNetError: 34981:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) LBUG
[197803.220684] Pid: 34981, comm: kiblnd_sd_01_00 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
[197803.220684] Call Trace:
[197803.220706]  [<ffffffffc0e777cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[197803.220712]  [<ffffffffc0e7787c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[197803.220730]  [<ffffffffc0f1849b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]
[197803.220739]  [<ffffffffc0f08d3a>] lnet_finalize+0x72a/0x9a0 [lnet]
[197803.220748]  [<ffffffffc0f12a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]
[197803.220757]  [<ffffffffc0f149a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[197803.220765]  [<ffffffffc0f075ec>] lnet_msg_decommit+0xec/0x700 [lnet]
[197803.220772]  [<ffffffffc0f089b7>] lnet_finalize+0x3a7/0x9a0 [lnet]
[197803.220783]  [<ffffffffc14c861d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
[197803.220789]  [<ffffffffc14d3b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]
[197803.220793]  [<ffffffff8d4c2e81>] kthread+0xd1/0xe0
[197803.220797]  [<ffffffff8db76c37>] ret_from_fork_nospec_end+0x0/0x39
[197803.220822]  [<ffffffffffffffff>] 0xffffffffffffffff
[197803.220823] Kernel panic - not syncing: LBUG
[197803.220826] CPU: 37 PID: 34981 Comm: kiblnd_sd_01_00 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7.x86_64 #1
[197803.220827] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.9.1 12/04/2018
[197803.220828] Call Trace:
[197803.220833]  [<ffffffff8db64147>] dump_stack+0x19/0x1b
[197803.220835]  [<ffffffff8db5d850>] panic+0xe8/0x21f
[197803.220842]  [<ffffffffc0e778cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[197803.220851]  [<ffffffffc0f1849b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]
[197803.220860]  [<ffffffffc0f08d3a>] lnet_finalize+0x72a/0x9a0 [lnet]
[197803.220863]  [<ffffffff8d502372>] ? ktime_get_ts64+0x52/0xf0
[197803.220872]  [<ffffffffc0f12a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]
[197803.220880]  [<ffffffffc0f149a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
[197803.220888]  [<ffffffffc0f075ec>] lnet_msg_decommit+0xec/0x700 [lnet]
[197803.220896]  [<ffffffffc0f089b7>] lnet_finalize+0x3a7/0x9a0 [lnet]
[197803.220901]  [<ffffffffc14c861d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
[197803.220906]  [<ffffffffc14d3b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]
[197803.220908]  [<ffffffff8d4e220e>] ? dequeue_task_fair+0x41e/0x660
[197803.220911]  [<ffffffff8d42a59e>] ? __switch_to+0xce/0x580
[197803.220913]  [<ffffffff8d4d7c40>] ? wake_up_state+0x20/0x20
[197803.220919]  [<ffffffffc14d3270>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd]
[197803.220920]  [<ffffffff8d4c2e81>] kthread+0xd1/0xe0
[197803.220923]  [<ffffffff8d4c2db0>] ? insert_kthread_work+0x40/0x40
[197803.220925]  [<ffffffff8db76c37>] ret_from_fork_nospec_begin+0x21/0x21
[197803.220927]  [<ffffffff8d4c2db0>] ? insert_kthread_work+0x40/0x40

We're using Mellanox OFED 4.7 on servers / routers / this client (not all clients have been upgraded yet).

We can provide crash dumps upon request. This seems to be a new problem either with 2.12.3 vs 2.12.0 or with Mellanox OFED 4.7 vs 4.5.

Thanks!
Stephane



 Comments   
Comment by Stephane Thiell [ 26/Oct/19 ]

On a server (no crash dump found, I'll investigate why...):

[Fri Oct 25 10:06:46 2019][772788.297828] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9960e18c5200^M
[Fri Oct 25 10:06:46 2019][772788.412180] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff993f267bfa00^M
[Fri Oct 25 10:06:46 2019][772788.414316] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff993cb80dd400^M
[Fri Oct 25 10:06:46 2019][772788.414321] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff99504a97e800^M
[Fri Oct 25 10:06:46 2019][772788.414328] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff99417c564a00^M
[Fri Oct 25 10:06:46 2019][772788.414334] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff994561e9f600^M
[Fri Oct 25 10:06:46 2019][772788.523551] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff996515360200^M
[Fri Oct 25 10:06:46 2019][772788.534421] LNetError: 49526:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: ^M
[Fri Oct 25 10:06:46 2019][772788.545237] LNetError: 49526:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) LBUG^M
[Fri Oct 25 10:06:46 2019][772788.552642] Pid: 49526, comm: kiblnd_sd_01_01 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M
[Fri Oct 25 10:06:46 2019][772788.563420] Call Trace:^M
[Fri Oct 25 10:06:46 2019][772788.565978]  [<ffffffffc0c467cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M
[Fri Oct 25 10:06:46 2019][772788.572640]  [<ffffffffc0c4687c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M
[Fri Oct 25 10:06:46 2019][772788.578940]  [<ffffffffc0d4e49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M
[Fri Oct 25 10:06:46 2019][772788.585943]  [<ffffffffc0d3ed3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M
[Fri Oct 25 10:06:46 2019][772788.592251]  [<ffffffffc0d48a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M
[Fri Oct 25 10:06:46 2019][772788.599255]  [<ffffffffc0d4a9a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M
[Fri Oct 25 10:06:46 2019][772788.606942]  [<ffffffffc0d3d5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M
[Fri Oct 25 10:06:46 2019][772788.613514]  [<ffffffffc0d3e9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M
[Fri Oct 25 10:06:46 2019][772788.619814]  [<ffffffffc0cbb61d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M
[Fri Oct 25 10:06:46 2019][772788.626556]  [<ffffffffc0cc6b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M
[Fri Oct 25 10:06:46 2019][772788.633550]  [<ffffffff976c2e81>] kthread+0xd1/0xe0^M
[Fri Oct 25 10:06:46 2019][772788.638551]  [<ffffffff97d77c24>] ret_from_fork_nospec_begin+0xe/0x21^M
[Fri Oct 25 10:06:46 2019][772788.645112]  [<ffffffffffffffff>] 0xffffffffffffffff^M
[Fri Oct 25 10:06:46 2019][772788.650230] Kernel panic - not syncing: LBUG^M
[Fri Oct 25 10:06:46 2019][772788.654593] CPU: 1 PID: 49526 Comm: kiblnd_sd_01_01 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1^M
[Fri Oct 25 10:06:46 2019][772788.667789] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019^M
[Fri Oct 25 10:06:46 2019][772788.675529] Call Trace:^M
[Fri Oct 25 10:06:46 2019][772788.678071]  [<ffffffff97d65147>] dump_stack+0x19/0x1b^M
[Fri Oct 25 10:06:47 2019][772788.683296]  [<ffffffff97d5e850>] panic+0xe8/0x21f^M
[Fri Oct 25 10:06:47 2019][772788.688179]  [<ffffffffc0c468cb>] lbug_with_loc+0x9b/0xa0 [libcfs]^M
[Fri Oct 25 10:06:47 2019][772788.694458]  [<ffffffffc0d4e49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M
[Fri Oct 25 10:06:47 2019][772788.701424]  [<ffffffffc0d3ed3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M
[Fri Oct 25 10:06:47 2019][772788.707695]  [<ffffffff97702372>] ? ktime_get_ts64+0x52/0xf0^M
[Fri Oct 25 10:06:47 2019][772788.713447]  [<ffffffffc0d48a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M
[Fri Oct 25 10:06:47 2019][772788.720414]  [<ffffffffc0d4a9a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M
[Fri Oct 25 10:06:47 2019][772788.728075]  [<ffffffffc0d3d5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M
[Fri Oct 25 10:06:47 2019][772788.734610]  [<ffffffffc0d3e9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M
[Fri Oct 25 10:06:47 2019][772788.740881]  [<ffffffffc0cbb61d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M
[Fri Oct 25 10:06:47 2019][772788.747581]  [<ffffffffc0cc6b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M
[Fri Oct 25 10:06:47 2019][772788.754546]  [<ffffffff976e220e>] ? dequeue_task_fair+0x41e/0x660^M
[Fri Oct 25 10:06:47 2019][772788.760728]  [<ffffffff9762a59e>] ? __switch_to+0xce/0x580^M
[Fri Oct 25 10:06:47 2019][772788.766299]  [<ffffffff976d7c40>] ? wake_up_state+0x20/0x20^M
[Fri Oct 25 10:06:47 2019][772788.771959]  [<ffffffffc0cc6270>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd]^M
[Fri Oct 25 10:06:47 2019][772788.778743]  [<ffffffff976c2e81>] kthread+0xd1/0xe0^M
[Fri Oct 25 10:06:47 2019][772788.783709]  [<ffffffff976c2db0>] ? insert_kthread_work+0x40/0x40^M
[Fri Oct 25 10:06:47 2019][772788.789889]  [<ffffffff97d77c24>] ret_from_fork_nospec_begin+0xe/0x21^M
[Fri Oct 25 10:06:47 2019][772788.796414]  [<ffffffff976c2db0>] ? insert_kthread_work+0x40/0x40^M
Comment by Peter Jones [ 28/Oct/19 ]

Amir

Could you please advise?

Thanks

Peter

Comment by Amir Shehata (Inactive) [ 28/Oct/19 ]

There are two patches which I think should resolve this issue:

LU-12568 lnet: Defer rspt cleanup when MD queued for unlink
LU-12441 lnet: Detach rspt when md_threshold is infinite

LU-12568 fixes a memory corruption which could be leading to both of LU-12906 and LU-12907.

I can port these back to b2.12. Would you be able to test if they resolve the issue?

Comment by Amir Shehata (Inactive) [ 28/Oct/19 ]

the patch for LU-12441 could be applied as is on b2_12

Attached is an updated patch for LU-12568.patch which should work on b2_12

Would we be able to try these two patches out and see if they resolve the problems you're seeing on LU-12906 and LU-12907? I think if the router problem is happening continuously it would be quicker to test these two patches on the routers.

Comment by Stephane Thiell [ 28/Oct/19 ]

Thanks Amir! I'll see what we can do to test this patch, at least on the LNet routers first. I'll keep you posted.

Comment by Peter Jones [ 12/Dec/19 ]

As both the mentioned patches are landed for 2.12.4 can we close out this ticket as a duplicate?

Comment by Stephane Thiell [ 12/Dec/19 ]

Peter, that sounds good to me. We haven't hit this issue since we added the two patches above. Thanks!

Comment by Peter Jones [ 12/Dec/19 ]

Thanks Stephane!

Generated at Sat Feb 10 02:56:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.