[LU-13976] duplicate IB_WR_LOCAL_INV causing ice driver failure (RoCE/iWarp) Created: 22/Sep/20 Updated: 23/Mar/22 Resolved: 23/Mar/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James Erwin | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 8.1, Intel Corporation Ethernet Controller E810-C, iwarp or RoCE mode |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
During lnet_selftest read > 4K (write works without this issue), the ice driver detects a duplicate IB_WR_LOCAL_INV on the same key and the test fails. We also cannot mount a file system over Lustre because of this issue. We instrumented o2iblnd.c and the irdma driver and found the duplicate IB_WR_LOCAL_INV:
In this example below, skl01 is the client and skl02 is the server during a lst read operation.
skl01, irdma trace: [Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0xb16f427b [Mon Sep 21 10:11:42 2020] LNet: Added LNI 192.168.1.1@o2ib [Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x8e89e91b [Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x8e89e91c, info.stag_key=0x1c, info.stag_idx=0x8e89e9 [Mon Sep 21 10:14:08 2020] ice 0000:18:00.0: abnormal ae_id = 0x50a bool qp=1 qp_id = 6 [Mon Sep 21 10:14:08 2020] LNetError: 4268:0:(o2iblnd_cb.c:3676:kiblnd_qp_event()) 192.168.1.2@o2ib [Mon Sep 21 10:14:08 2020] LustreError: 47768:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.1.2@o2ib
skl01, Lustre trace:
skl02, irdma trace: [Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0x968276e9 [Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0xd477355c [Mon Sep 21 10:11:42 2020] LNet: Added LNI 192.168.1.2@o2ib [Mon Sep 21 10:14:06 2020] LNet: 36382:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer [Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601 [Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726 [Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601 [MKI] This key was already invalidated [Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726 [Mon Sep 21 10:14:06 2020] ice 0000:18:00.0: abnormal ae_id = 0x106 bool qp=1 qp_id = 4 [Mon Sep 21 10:14:06 2020] LNetError: 36497:0:(o2iblnd_cb.c:3676:kiblnd_qp_event()) 192.168.1.1@o2ib [Mon Sep 21 10:14:07 2020] MKI-IB_WR_LOCAL_INV rkey = 0x76b95769 [Mon Sep 21 10:14:07 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x76b9576a, info.stag_key=0x6a, info.stag_idx=0x76b957 [Mon Sep 21 10:14:07 2020] LNet: 36261:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.1@o2ib
skl02, Lustre trace:
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601 [Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726
The next line below is duplicate as rkey = 0x31172601 is already invalidated above. So, we think this should have been 0x31172602, that would avoid invalidating an already invalidated key, and also doing a duplicate REG_MR on the same key. Is there something else doing an invalidate somewhere in the code?
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601 [Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726 [Mon Sep 21 10:14:06 2020] ice 0000:18:00.0: abnormal ae_id = 0x106 bool qp=1 qp_id = 4 |
| Comments |
| Comment by James Erwin [ 26/Oct/20 ] |
|
Hello, is there any update on this issue? |
| Comment by Mike Marciniszyn [ 18/Mar/22 ] |
|
I suspect this issue is a duplicate. |
| Comment by Mike Marciniszyn [ 23/Mar/22 ] |
|
I have confirmed that a client build with off of the 2.12.8 branch doesn't see the issue. This is indeed a duplicate. |