Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.12.5
-
None
-
RHEL 8.1, Intel Corporation Ethernet Controller E810-C, iwarp or RoCE mode
-
3
-
9223372036854775807
Description
During lnet_selftest read > 4K (write works without this issue), the ice driver detects a duplicate IB_WR_LOCAL_INV on the same key and the test fails. We also cannot mount a file system over Lustre because of this issue.
We instrumented o2iblnd.c and the irdma driver and found the duplicate IB_WR_LOCAL_INV:
In this example below, skl01 is the client and skl02 is the server during a lst read operation.
skl01, irdma trace:
[Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0xb16f427b
[Mon Sep 21 10:11:42 2020] LNet: Added LNI 192.168.1.1@o2ib [8/256/0/180]
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x8e89e91b
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x8e89e91c, info.stag_key=0x1c, info.stag_idx=0x8e89e9
[Mon Sep 21 10:14:08 2020] ice 0000:18:00.0: abnormal ae_id = 0x50a bool qp=1 qp_id = 6
[Mon Sep 21 10:14:08 2020] LNetError: 4268:0:(o2iblnd_cb.c:3676:kiblnd_qp_event()) 192.168.1.2@o2ib: Async QP event type 1
[Mon Sep 21 10:14:08 2020] LustreError: 47768:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.1.2@o2ib failed with -103
skl01, Lustre trace:
00000800:00000200:57.0:1600697647.117610:0:47619:0:(o2iblnd.c:1913:kiblnd_fmr_pool_map()) jpe key 8e89e91b 00000800:00000200:57.0:1600697647.117611:0:47619:0:(o2iblnd.c:1919:kiblnd_fmr_pool_map()) jpe key after bump 8e89e91c |
skl02, irdma trace:
[Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0x968276e9
[Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0xd477355c
[Mon Sep 21 10:11:42 2020] LNet: Added LNI 192.168.1.2@o2ib [8/256/0/180]
[Mon Sep 21 10:14:06 2020] LNet: 36382:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601 [MKI] This key was already invalidated
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726
[Mon Sep 21 10:14:06 2020] ice 0000:18:00.0: abnormal ae_id = 0x106 bool qp=1 qp_id = 4
[Mon Sep 21 10:14:06 2020] LNetError: 36497:0:(o2iblnd_cb.c:3676:kiblnd_qp_event()) 192.168.1.1@o2ib: Async QP event type 1
[Mon Sep 21 10:14:07 2020] MKI-IB_WR_LOCAL_INV rkey = 0x76b95769
[Mon Sep 21 10:14:07 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x76b9576a, info.stag_key=0x6a, info.stag_idx=0x76b957
[Mon Sep 21 10:14:07 2020] LNet: 36261:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.1@o2ib
skl02, Lustre trace:
00000800:00000200:28.0:1600697647.441213:0:36436:0:(o2iblnd.c:1913:kiblnd_fmr_pool_map()) jpe key 31172601 00000800:00000200:28.0:1600697647.441215:0:36436:0:(o2iblnd.c:1919:kiblnd_fmr_pool_map()) jpe key after bump 31172602 00000800:00000200:59.0:1600697647.677998:0:36262:0:(o2iblnd.c:1913:kiblnd_fmr_pool_map()) jpe key 76b95769 00000800:00000200:59.0:1600697647.678000:0:36262:0:(o2iblnd.c:1919:kiblnd_fmr_pool_map()) jpe key after bump 76b9576a |
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726
The next line below is duplicate as rkey = 0x31172601 is already invalidated above. So, we think this should have been 0x31172602, that would avoid invalidating an already invalidated key, and also doing a duplicate REG_MR on the same key.
Is there something else doing an invalidate somewhere in the code?
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601
[Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726
[Mon Sep 21 10:14:06 2020] ice 0000:18:00.0: abnormal ae_id = 0x106 bool qp=1 qp_id = 4
Attachments
Issue Links
- duplicates
-
LU-14733 brw_bulk_ready() BRW bulk READ failed for RPC from 12345-192.168.128.126@o2ib18: -103
- Resolved