Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12906

LBUG ASSERTION( rspt->rspt_cpt == cpt ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.12.3
    • None
    • CentOS 7.6
    • 2
    • 9223372036854775807

    Description

      Using 2.12.3 servers and clients, we hit this bug once on an OSS, once on a client:

      [197803.220678] LNetError: 34981:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: 
      [197803.220682] LNetError: 34981:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) LBUG
      [197803.220684] Pid: 34981, comm: kiblnd_sd_01_00 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019
      [197803.220684] Call Trace:
      [197803.220706]  [<ffffffffc0e777cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [197803.220712]  [<ffffffffc0e7787c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [197803.220730]  [<ffffffffc0f1849b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]
      [197803.220739]  [<ffffffffc0f08d3a>] lnet_finalize+0x72a/0x9a0 [lnet]
      [197803.220748]  [<ffffffffc0f12a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]
      [197803.220757]  [<ffffffffc0f149a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [197803.220765]  [<ffffffffc0f075ec>] lnet_msg_decommit+0xec/0x700 [lnet]
      [197803.220772]  [<ffffffffc0f089b7>] lnet_finalize+0x3a7/0x9a0 [lnet]
      [197803.220783]  [<ffffffffc14c861d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
      [197803.220789]  [<ffffffffc14d3b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]
      [197803.220793]  [<ffffffff8d4c2e81>] kthread+0xd1/0xe0
      [197803.220797]  [<ffffffff8db76c37>] ret_from_fork_nospec_end+0x0/0x39
      [197803.220822]  [<ffffffffffffffff>] 0xffffffffffffffff
      [197803.220823] Kernel panic - not syncing: LBUG
      [197803.220826] CPU: 37 PID: 34981 Comm: kiblnd_sd_01_00 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7.x86_64 #1
      [197803.220827] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.9.1 12/04/2018
      [197803.220828] Call Trace:
      [197803.220833]  [<ffffffff8db64147>] dump_stack+0x19/0x1b
      [197803.220835]  [<ffffffff8db5d850>] panic+0xe8/0x21f
      [197803.220842]  [<ffffffffc0e778cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [197803.220851]  [<ffffffffc0f1849b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]
      [197803.220860]  [<ffffffffc0f08d3a>] lnet_finalize+0x72a/0x9a0 [lnet]
      [197803.220863]  [<ffffffff8d502372>] ? ktime_get_ts64+0x52/0xf0
      [197803.220872]  [<ffffffffc0f12a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]
      [197803.220880]  [<ffffffffc0f149a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]
      [197803.220888]  [<ffffffffc0f075ec>] lnet_msg_decommit+0xec/0x700 [lnet]
      [197803.220896]  [<ffffffffc0f089b7>] lnet_finalize+0x3a7/0x9a0 [lnet]
      [197803.220901]  [<ffffffffc14c861d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
      [197803.220906]  [<ffffffffc14d3b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]
      [197803.220908]  [<ffffffff8d4e220e>] ? dequeue_task_fair+0x41e/0x660
      [197803.220911]  [<ffffffff8d42a59e>] ? __switch_to+0xce/0x580
      [197803.220913]  [<ffffffff8d4d7c40>] ? wake_up_state+0x20/0x20
      [197803.220919]  [<ffffffffc14d3270>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd]
      [197803.220920]  [<ffffffff8d4c2e81>] kthread+0xd1/0xe0
      [197803.220923]  [<ffffffff8d4c2db0>] ? insert_kthread_work+0x40/0x40
      [197803.220925]  [<ffffffff8db76c37>] ret_from_fork_nospec_begin+0x21/0x21
      [197803.220927]  [<ffffffff8d4c2db0>] ? insert_kthread_work+0x40/0x40
      

      We're using Mellanox OFED 4.7 on servers / routers / this client (not all clients have been upgraded yet).

      We can provide crash dumps upon request. This seems to be a new problem either with 2.12.3 vs 2.12.0 or with Mellanox OFED 4.7 vs 4.5.

      Thanks!
      Stephane

      Attachments

        Issue Links

          Activity

            [LU-12906] LBUG ASSERTION( rspt->rspt_cpt == cpt ) failed
            pjones Peter Jones added a comment -

            Thanks Stephane!

            pjones Peter Jones added a comment - Thanks Stephane!

            Peter, that sounds good to me. We haven't hit this issue since we added the two patches above. Thanks!

            sthiell Stephane Thiell added a comment - Peter, that sounds good to me. We haven't hit this issue since we added the two patches above. Thanks!
            pjones Peter Jones added a comment -

            As both the mentioned patches are landed for 2.12.4 can we close out this ticket as a duplicate?

            pjones Peter Jones added a comment - As both the mentioned patches are landed for 2.12.4 can we close out this ticket as a duplicate?

            Thanks Amir! I'll see what we can do to test this patch, at least on the LNet routers first. I'll keep you posted.

            sthiell Stephane Thiell added a comment - Thanks Amir! I'll see what we can do to test this patch, at least on the LNet routers first. I'll keep you posted.

            the patch for LU-12441 could be applied as is on b2_12

            Attached is an updated patch for LU-12568.patch which should work on b2_12

            Would we be able to try these two patches out and see if they resolve the problems you're seeing on LU-12906 and LU-12907? I think if the router problem is happening continuously it would be quicker to test these two patches on the routers.

            ashehata Amir Shehata (Inactive) added a comment - the patch for LU-12441 could be applied as is on b2_12 Attached is an updated patch for LU-12568.patch which should work on b2_12 Would we be able to try these two patches out and see if they resolve the problems you're seeing on LU-12906 and LU-12907 ? I think if the router problem is happening continuously it would be quicker to test these two patches on the routers.

            There are two patches which I think should resolve this issue:

            LU-12568 lnet: Defer rspt cleanup when MD queued for unlink
            LU-12441 lnet: Detach rspt when md_threshold is infinite

            LU-12568 fixes a memory corruption which could be leading to both of LU-12906 and LU-12907.

            I can port these back to b2.12. Would you be able to test if they resolve the issue?

            ashehata Amir Shehata (Inactive) added a comment - There are two patches which I think should resolve this issue: LU-12568 lnet: Defer rspt cleanup when MD queued for unlink LU-12441 lnet: Detach rspt when md_threshold is infinite LU-12568 fixes a memory corruption which could be leading to both of LU-12906 and LU-12907 . I can port these back to b2.12. Would you be able to test if they resolve the issue?
            pjones Peter Jones added a comment -

            Amir

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Amir Could you please advise? Thanks Peter

            On a server (no crash dump found, I'll investigate why...):

            [Fri Oct 25 10:06:46 2019][772788.297828] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9960e18c5200^M
            [Fri Oct 25 10:06:46 2019][772788.412180] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff993f267bfa00^M
            [Fri Oct 25 10:06:46 2019][772788.414316] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff993cb80dd400^M
            [Fri Oct 25 10:06:46 2019][772788.414321] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff99504a97e800^M
            [Fri Oct 25 10:06:46 2019][772788.414328] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff99417c564a00^M
            [Fri Oct 25 10:06:46 2019][772788.414334] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff994561e9f600^M
            [Fri Oct 25 10:06:46 2019][772788.523551] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff996515360200^M
            [Fri Oct 25 10:06:46 2019][772788.534421] LNetError: 49526:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: ^M
            [Fri Oct 25 10:06:46 2019][772788.545237] LNetError: 49526:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) LBUG^M
            [Fri Oct 25 10:06:46 2019][772788.552642] Pid: 49526, comm: kiblnd_sd_01_01 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M
            [Fri Oct 25 10:06:46 2019][772788.563420] Call Trace:^M
            [Fri Oct 25 10:06:46 2019][772788.565978]  [<ffffffffc0c467cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M
            [Fri Oct 25 10:06:46 2019][772788.572640]  [<ffffffffc0c4687c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M
            [Fri Oct 25 10:06:46 2019][772788.578940]  [<ffffffffc0d4e49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M
            [Fri Oct 25 10:06:46 2019][772788.585943]  [<ffffffffc0d3ed3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M
            [Fri Oct 25 10:06:46 2019][772788.592251]  [<ffffffffc0d48a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M
            [Fri Oct 25 10:06:46 2019][772788.599255]  [<ffffffffc0d4a9a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M
            [Fri Oct 25 10:06:46 2019][772788.606942]  [<ffffffffc0d3d5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M
            [Fri Oct 25 10:06:46 2019][772788.613514]  [<ffffffffc0d3e9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M
            [Fri Oct 25 10:06:46 2019][772788.619814]  [<ffffffffc0cbb61d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M
            [Fri Oct 25 10:06:46 2019][772788.626556]  [<ffffffffc0cc6b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M
            [Fri Oct 25 10:06:46 2019][772788.633550]  [<ffffffff976c2e81>] kthread+0xd1/0xe0^M
            [Fri Oct 25 10:06:46 2019][772788.638551]  [<ffffffff97d77c24>] ret_from_fork_nospec_begin+0xe/0x21^M
            [Fri Oct 25 10:06:46 2019][772788.645112]  [<ffffffffffffffff>] 0xffffffffffffffff^M
            [Fri Oct 25 10:06:46 2019][772788.650230] Kernel panic - not syncing: LBUG^M
            [Fri Oct 25 10:06:46 2019][772788.654593] CPU: 1 PID: 49526 Comm: kiblnd_sd_01_01 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1^M
            [Fri Oct 25 10:06:46 2019][772788.667789] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019^M
            [Fri Oct 25 10:06:46 2019][772788.675529] Call Trace:^M
            [Fri Oct 25 10:06:46 2019][772788.678071]  [<ffffffff97d65147>] dump_stack+0x19/0x1b^M
            [Fri Oct 25 10:06:47 2019][772788.683296]  [<ffffffff97d5e850>] panic+0xe8/0x21f^M
            [Fri Oct 25 10:06:47 2019][772788.688179]  [<ffffffffc0c468cb>] lbug_with_loc+0x9b/0xa0 [libcfs]^M
            [Fri Oct 25 10:06:47 2019][772788.694458]  [<ffffffffc0d4e49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M
            [Fri Oct 25 10:06:47 2019][772788.701424]  [<ffffffffc0d3ed3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M
            [Fri Oct 25 10:06:47 2019][772788.707695]  [<ffffffff97702372>] ? ktime_get_ts64+0x52/0xf0^M
            [Fri Oct 25 10:06:47 2019][772788.713447]  [<ffffffffc0d48a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M
            [Fri Oct 25 10:06:47 2019][772788.720414]  [<ffffffffc0d4a9a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M
            [Fri Oct 25 10:06:47 2019][772788.728075]  [<ffffffffc0d3d5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M
            [Fri Oct 25 10:06:47 2019][772788.734610]  [<ffffffffc0d3e9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M
            [Fri Oct 25 10:06:47 2019][772788.740881]  [<ffffffffc0cbb61d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M
            [Fri Oct 25 10:06:47 2019][772788.747581]  [<ffffffffc0cc6b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M
            [Fri Oct 25 10:06:47 2019][772788.754546]  [<ffffffff976e220e>] ? dequeue_task_fair+0x41e/0x660^M
            [Fri Oct 25 10:06:47 2019][772788.760728]  [<ffffffff9762a59e>] ? __switch_to+0xce/0x580^M
            [Fri Oct 25 10:06:47 2019][772788.766299]  [<ffffffff976d7c40>] ? wake_up_state+0x20/0x20^M
            [Fri Oct 25 10:06:47 2019][772788.771959]  [<ffffffffc0cc6270>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd]^M
            [Fri Oct 25 10:06:47 2019][772788.778743]  [<ffffffff976c2e81>] kthread+0xd1/0xe0^M
            [Fri Oct 25 10:06:47 2019][772788.783709]  [<ffffffff976c2db0>] ? insert_kthread_work+0x40/0x40^M
            [Fri Oct 25 10:06:47 2019][772788.789889]  [<ffffffff97d77c24>] ret_from_fork_nospec_begin+0xe/0x21^M
            [Fri Oct 25 10:06:47 2019][772788.796414]  [<ffffffff976c2db0>] ? insert_kthread_work+0x40/0x40^M
            
            sthiell Stephane Thiell added a comment - On a server (no crash dump found, I'll investigate why...): [Fri Oct 25 10:06:46 2019][772788.297828] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff9960e18c5200^M [Fri Oct 25 10:06:46 2019][772788.412180] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff993f267bfa00^M [Fri Oct 25 10:06:46 2019][772788.414316] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff993cb80dd400^M [Fri Oct 25 10:06:46 2019][772788.414321] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff99504a97e800^M [Fri Oct 25 10:06:46 2019][772788.414328] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff99417c564a00^M [Fri Oct 25 10:06:46 2019][772788.414334] LustreError: 49525:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff994561e9f600^M [Fri Oct 25 10:06:46 2019][772788.523551] LustreError: 49526:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff996515360200^M [Fri Oct 25 10:06:46 2019][772788.534421] LNetError: 49526:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) ASSERTION( rspt->rspt_cpt == cpt ) failed: ^M [Fri Oct 25 10:06:46 2019][772788.545237] LNetError: 49526:0:(lib-move.c:2729:lnet_detach_rsp_tracker()) LBUG^M [Fri Oct 25 10:06:46 2019][772788.552642] Pid: 49526, comm: kiblnd_sd_01_01 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M [Fri Oct 25 10:06:46 2019][772788.563420] Call Trace:^M [Fri Oct 25 10:06:46 2019][772788.565978] [<ffffffffc0c467cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M [Fri Oct 25 10:06:46 2019][772788.572640] [<ffffffffc0c4687c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M [Fri Oct 25 10:06:46 2019][772788.578940] [<ffffffffc0d4e49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.585943] [<ffffffffc0d3ed3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.592251] [<ffffffffc0d48a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.599255] [<ffffffffc0d4a9a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.606942] [<ffffffffc0d3d5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.613514] [<ffffffffc0d3e9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M [Fri Oct 25 10:06:46 2019][772788.619814] [<ffffffffc0cbb61d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M [Fri Oct 25 10:06:46 2019][772788.626556] [<ffffffffc0cc6b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M [Fri Oct 25 10:06:46 2019][772788.633550] [<ffffffff976c2e81>] kthread+0xd1/0xe0^M [Fri Oct 25 10:06:46 2019][772788.638551] [<ffffffff97d77c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Fri Oct 25 10:06:46 2019][772788.645112] [<ffffffffffffffff>] 0xffffffffffffffff^M [Fri Oct 25 10:06:46 2019][772788.650230] Kernel panic - not syncing: LBUG^M [Fri Oct 25 10:06:46 2019][772788.654593] CPU: 1 PID: 49526 Comm: kiblnd_sd_01_01 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1^M [Fri Oct 25 10:06:46 2019][772788.667789] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019^M [Fri Oct 25 10:06:46 2019][772788.675529] Call Trace:^M [Fri Oct 25 10:06:46 2019][772788.678071] [<ffffffff97d65147>] dump_stack+0x19/0x1b^M [Fri Oct 25 10:06:47 2019][772788.683296] [<ffffffff97d5e850>] panic+0xe8/0x21f^M [Fri Oct 25 10:06:47 2019][772788.688179] [<ffffffffc0c468cb>] lbug_with_loc+0x9b/0xa0 [libcfs]^M [Fri Oct 25 10:06:47 2019][772788.694458] [<ffffffffc0d4e49b>] lnet_detach_rsp_tracker+0x5b/0x60 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.701424] [<ffffffffc0d3ed3a>] lnet_finalize+0x72a/0x9a0 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.707695] [<ffffffff97702372>] ? ktime_get_ts64+0x52/0xf0^M [Fri Oct 25 10:06:47 2019][772788.713447] [<ffffffffc0d48a51>] lnet_post_send_locked+0x751/0x9c0 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.720414] [<ffffffffc0d4a9a8>] lnet_return_tx_credits_locked+0x2a8/0x490 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.728075] [<ffffffffc0d3d5ec>] lnet_msg_decommit+0xec/0x700 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.734610] [<ffffffffc0d3e9b7>] lnet_finalize+0x3a7/0x9a0 [lnet]^M [Fri Oct 25 10:06:47 2019][772788.740881] [<ffffffffc0cbb61d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]^M [Fri Oct 25 10:06:47 2019][772788.747581] [<ffffffffc0cc6b0d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]^M [Fri Oct 25 10:06:47 2019][772788.754546] [<ffffffff976e220e>] ? dequeue_task_fair+0x41e/0x660^M [Fri Oct 25 10:06:47 2019][772788.760728] [<ffffffff9762a59e>] ? __switch_to+0xce/0x580^M [Fri Oct 25 10:06:47 2019][772788.766299] [<ffffffff976d7c40>] ? wake_up_state+0x20/0x20^M [Fri Oct 25 10:06:47 2019][772788.771959] [<ffffffffc0cc6270>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd]^M [Fri Oct 25 10:06:47 2019][772788.778743] [<ffffffff976c2e81>] kthread+0xd1/0xe0^M [Fri Oct 25 10:06:47 2019][772788.783709] [<ffffffff976c2db0>] ? insert_kthread_work+0x40/0x40^M [Fri Oct 25 10:06:47 2019][772788.789889] [<ffffffff97d77c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Fri Oct 25 10:06:47 2019][772788.796414] [<ffffffff976c2db0>] ? insert_kthread_work+0x40/0x40^M

            People

              ashehata Amir Shehata (Inactive)
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: