[LU-12906] LBUG ASSERTION( rspt->rspt_cpt == cpt ) failed Created: 26/Oct/19 Updated: 31/Jul/20 Resolved: 12/Dec/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.6 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 2 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Using 2.12.3 servers and clients, we hit this bug once on an OSS, once on a client: [197803.220678] LNetError: 34981:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: [197803.220682] LNetError: 34981:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) LBUG [197803.220684] Pid: 34981, comm: kiblnd_sd_01_00 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 [197803.220684] Call Trace: [197803.220706] [<ffffffffc0e777cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [197803.220712] [<ffffffffc0e7787c>] lbug_with_loc+0x4c/0xa0 [libcfs] [197803.220730] [<ffffffffc0f1849b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet] [197803.220739] [<ffffffffc0f08d3a>] lnet_finalize+0x72a/0x9a0 [lnet] [197803.220748] [<ffffffffc0f12a51>] lnet_post_send_locked+0x751/0x9c0 [lnet] [197803.220757] [<ffffffffc0f149a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [197803.220765] [<ffffffffc0f075ec>] lnet_msg_decommit+0xec/0x700 [lnet] [197803.220772] [<ffffffffc0f089b7>] lnet_finalize+0x3a7/0x9a0 [lnet] [197803.220783] [<ffffffffc14c861d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd] [197803.220789] [<ffffffffc14d3b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd] [197803.220793] [<ffffffff8d4c2e81>] kthread+0xd1/0xe0 [197803.220797] [<ffffffff8db76c37>] ret_from_fork_nospec_end+0x0/0x39 [197803.220822] [<ffffffffffffffff>] 0xffffffffffffffff [197803.220823] Kernel panic - not syncing: LBUG [197803.220826] CPU: 37 PID: 34981 Comm: kiblnd_sd_01_00 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7.x86_64 #1 [197803.220827] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.9.1 12/04/2018 [197803.220828] Call Trace: [197803.220833] [<ffffffff8db64147>] dump_stack+0x19/0x1b [197803.220835] [<ffffffff8db5d850>] panic+0xe8/0x21f [197803.220842] [<ffffffffc0e778cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [197803.220851] [<ffffffffc0f1849b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet] [197803.220860] [<ffffffffc0f08d3a>] lnet_finalize+0x72a/0x9a0 [lnet] [197803.220863] [<ffffffff8d502372>] ? ktime_get_ts64+0x52/0xf0 [197803.220872] [<ffffffffc0f12a51>] lnet_post_send_locked+0x751/0x9c0 [lnet] [197803.220880] [<ffffffffc0f149a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet] [197803.220888] [<ffffffffc0f075ec>] lnet_msg_decommit+0xec/0x700 [lnet] [197803.220896] [<ffffffffc0f089b7>] lnet_finalize+0x3a7/0x9a0 [lnet] [197803.220901] [<ffffffffc14c861d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd] [197803.220906] [<ffffffffc14d3b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd] [197803.220908] [<ffffffff8d4e220e>] ? dequeue_task_fair+0x41e/0x660 [197803.220911] [<ffffffff8d42a59e>] ? __switch_to+0xce/0x580 [197803.220913] [<ffffffff8d4d7c40>] ? wake_up_state+0x20/0x20 [197803.220919] [<ffffffffc14d3270>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd] [197803.220920] [<ffffffff8d4c2e81>] kthread+0xd1/0xe0 [197803.220923] [<ffffffff8d4c2db0>] ? insert_kthread_work+0x40/0x40 [197803.220925] [<ffffffff8db76c37>] ret_from_fork_nospec_begin+0x21/0x21 [197803.220927] [<ffffffff8d4c2db0>] ? insert_kthread_work+0x40/0x40 We're using Mellanox OFED 4.7 on servers / routers / this client (not all clients have been upgraded yet). We can provide crash dumps upon request. This seems to be a new problem either with 2.12.3 vs 2.12.0 or with Mellanox OFED 4.7 vs 4.5. Thanks! |
| Comments |
| Comment by Stephane Thiell [ 26/Oct/19 ] |
|
On a server (no crash dump found, I'll investigate why...): [Fri Oct 25 10:06:46 2019][772788.297828] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9960e18c5200^M [Fri Oct 25 10:06:46 2019][772788.412180] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff993f267bfa00^M [Fri Oct 25 10:06:46 2019][772788.414316] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff993cb80dd400^M [Fri Oct 25 10:06:46 2019][772788.414321] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff99504a97e800^M [Fri Oct 25 10:06:46 2019][772788.414328] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff99417c564a00^M [Fri Oct 25 10:06:46 2019][772788.414334] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff994561e9f600^M [Fri Oct 25 10:06:46 2019][772788.523551] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff996515360200^M [Fri Oct 25 10:06:46 2019][772788.534421] LNetError: 49526:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: ^M [Fri Oct 25 10:06:46 2019][772788.545237] LNetError: 49526:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) LBUG^M [Fri Oct 25 10:06:46 2019][772788.552642] Pid: 49526, comm: kiblnd_sd_01_01 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M [Fri Oct 25 10:06:46 2019][772788.563420] Call Trace:^M [Fri Oct 25 10:06:46 2019][772788.565978] [<ffffffffc0c467cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M [Fri Oct 25 10:06:46 2019][772788.572640] [<ffffffffc0c4687c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M [Fri Oct 25 10:06:46 2019][772788.578940] [<ffffffffc0d4e49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.585943] [<ffffffffc0d3ed3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.592251] [<ffffffffc0d48a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.599255] [<ffffffffc0d4a9a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.606942] [<ffffffffc0d3d5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.613514] [<ffffffffc0d3e9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.619814] [<ffffffffc0cbb61d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M [Fri Oct 25 10:06:46 2019][772788.626556] [<ffffffffc0cc6b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M [Fri Oct 25 10:06:46 2019][772788.633550] [<ffffffff976c2e81>] kthread+0xd1/0xe0^M [Fri Oct 25 10:06:46 2019][772788.638551] [<ffffffff97d77c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Fri Oct 25 10:06:46 2019][772788.645112] [<ffffffffffffffff>] 0xffffffffffffffff^M [Fri Oct 25 10:06:46 2019][772788.650230] Kernel panic - not syncing: LBUG^M [Fri Oct 25 10:06:46 2019][772788.654593] CPU: 1 PID: 49526 Comm: kiblnd_sd_01_01 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1^M [Fri Oct 25 10:06:46 2019][772788.667789] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019^M [Fri Oct 25 10:06:46 2019][772788.675529] Call Trace:^M [Fri Oct 25 10:06:46 2019][772788.678071] [<ffffffff97d65147>] dump_stack+0x19/0x1b^M [Fri Oct 25 10:06:47 2019][772788.683296] [<ffffffff97d5e850>] panic+0xe8/0x21f^M [Fri Oct 25 10:06:47 2019][772788.688179] [<ffffffffc0c468cb>] lbug_with_loc+0x9b/0xa0 [libcfs]^M [Fri Oct 25 10:06:47 2019][772788.694458] [<ffffffffc0d4e49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.701424] [<ffffffffc0d3ed3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.707695] [<ffffffff97702372>] ? ktime_get_ts64+0x52/0xf0^M [Fri Oct 25 10:06:47 2019][772788.713447] [<ffffffffc0d48a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.720414] [<ffffffffc0d4a9a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.728075] [<ffffffffc0d3d5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.734610] [<ffffffffc0d3e9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.740881] [<ffffffffc0cbb61d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M [Fri Oct 25 10:06:47 2019][772788.747581] [<ffffffffc0cc6b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M [Fri Oct 25 10:06:47 2019][772788.754546] [<ffffffff976e220e>] ? dequeue_task_fair+0x41e/0x660^M [Fri Oct 25 10:06:47 2019][772788.760728] [<ffffffff9762a59e>] ? __switch_to+0xce/0x580^M [Fri Oct 25 10:06:47 2019][772788.766299] [<ffffffff976d7c40>] ? wake_up_state+0x20/0x20^M [Fri Oct 25 10:06:47 2019][772788.771959] [<ffffffffc0cc6270>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd]^M [Fri Oct 25 10:06:47 2019][772788.778743] [<ffffffff976c2e81>] kthread+0xd1/0xe0^M [Fri Oct 25 10:06:47 2019][772788.783709] [<ffffffff976c2db0>] ? insert_kthread_work+0x40/0x40^M [Fri Oct 25 10:06:47 2019][772788.789889] [<ffffffff97d77c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Fri Oct 25 10:06:47 2019][772788.796414] [<ffffffff976c2db0>] ? insert_kthread_work+0x40/0x40^M |
| Comment by Peter Jones [ 28/Oct/19 ] |
|
Amir Could you please advise? Thanks Peter |
| Comment by Amir Shehata (Inactive) [ 28/Oct/19 ] |
|
There are two patches which I think should resolve this issue:
LU-12568 lnet: Defer rspt cleanup when MD queued for unlink
LU-12441 lnet: Detach rspt when md_threshold is infinite
I can port these back to b2.12. Would you be able to test if they resolve the issue? |
| Comment by Amir Shehata (Inactive) [ 28/Oct/19 ] |
|
the patch for Attached is an updated patch for LU-12568.patch Would we be able to try these two patches out and see if they resolve the problems you're seeing on |
| Comment by Stephane Thiell [ 28/Oct/19 ] |
|
Thanks Amir! I'll see what we can do to test this patch, at least on the LNet routers first. I'll keep you posted. |
| Comment by Peter Jones [ 12/Dec/19 ] |
|
As both the mentioned patches are landed for 2.12.4 can we close out this ticket as a duplicate? |
| Comment by Stephane Thiell [ 12/Dec/19 ] |
|
Peter, that sounds good to me. We haven't hit this issue since we added the two patches above. Thanks! |
| Comment by Peter Jones [ 12/Dec/19 ] |
|
Thanks Stephane! |