[LU-13976] duplicate IB_WR_LOCAL_INV causing ice driver failure (RoCE/iWarp) Created: 22/Sep/20  Updated: 23/Mar/22  Resolved: 23/Mar/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Erwin Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

RHEL 8.1, Intel Corporation Ethernet Controller E810-C, iwarp or RoCE mode


Issue Links:
Duplicate
duplicates LU-14733 brw_bulk_ready() BRW bulk READ failed... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During lnet_selftest read > 4K (write works without this issue), the ice driver detects a duplicate IB_WR_LOCAL_INV on the same key and the test fails. We also cannot mount a file system over Lustre because of this issue.

We instrumented o2iblnd.c and the irdma driver and found the duplicate IB_WR_LOCAL_INV:

 

In this example below, skl01 is the client and skl02 is the server during a lst read operation.

 

skl01, irdma trace:

[Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0xb16f427b

[Mon Sep 21 10:11:42 2020] LNet: Added LNI 192.168.1.1@o2ib [8/256/0/180]

[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x8e89e91b

[Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x8e89e91c, info.stag_key=0x1c, info.stag_idx=0x8e89e9

[Mon Sep 21 10:14:08 2020] ice 0000:18:00.0: abnormal ae_id = 0x50a bool qp=1 qp_id = 6

[Mon Sep 21 10:14:08 2020] LNetError: 4268:0:(o2iblnd_cb.c:3676:kiblnd_qp_event()) 192.168.1.2@o2ib: Async QP event type 1

[Mon Sep 21 10:14:08 2020] LustreError: 47768:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.1.2@o2ib failed with -103

 

skl01, Lustre trace:

00000800:00000200:57.0:1600697647.117610:0:47619:0:(o2iblnd.c:1913:kiblnd_fmr_pool_map()) jpe key 8e89e91b
00000800:00000200:57.0:1600697647.117611:0:47619:0:(o2iblnd.c:1919:kiblnd_fmr_pool_map()) jpe key after bump 8e89e91c

 

 

skl02, irdma trace:

[Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0x968276e9

[Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0xd477355c

[Mon Sep 21 10:11:42 2020] LNet: Added LNI 192.168.1.2@o2ib [8/256/0/180]

[Mon Sep 21 10:14:06 2020] LNet: 36382:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer

[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601

[Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726

[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601  [MKI] This key was already invalidated

[Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726

[Mon Sep 21 10:14:06 2020] ice 0000:18:00.0: abnormal ae_id = 0x106 bool qp=1 qp_id = 4

[Mon Sep 21 10:14:06 2020] LNetError: 36497:0:(o2iblnd_cb.c:3676:kiblnd_qp_event()) 192.168.1.1@o2ib: Async QP event type 1

[Mon Sep 21 10:14:07 2020] MKI-IB_WR_LOCAL_INV rkey = 0x76b95769

[Mon Sep 21 10:14:07 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x76b9576a, info.stag_key=0x6a, info.stag_idx=0x76b957

[Mon Sep 21 10:14:07 2020] LNet: 36261:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.1@o2ib

 

skl02, Lustre trace:

00000800:00000200:28.0:1600697647.441213:0:36436:0:(o2iblnd.c:1913:kiblnd_fmr_pool_map()) jpe key 31172601
00000800:00000200:28.0:1600697647.441215:0:36436:0:(o2iblnd.c:1919:kiblnd_fmr_pool_map()) jpe key after bump 31172602
00000800:00000200:59.0:1600697647.677998:0:36262:0:(o2iblnd.c:1913:kiblnd_fmr_pool_map()) jpe key 76b95769
00000800:00000200:59.0:1600697647.678000:0:36262:0:(o2iblnd.c:1919:kiblnd_fmr_pool_map()) jpe key after bump 76b9576a

 

 

 

 

[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601

[Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726

 

The next line below is duplicate as rkey = 0x31172601 is already invalidated above. So, we think this should have been 0x31172602, that would avoid invalidating an already invalidated key, and also doing a duplicate REG_MR on the same key.

Is there something else doing an invalidate somewhere in the code?

 

[Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601

[Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726

[Mon Sep 21 10:14:06 2020] ice 0000:18:00.0: abnormal ae_id = 0x106 bool qp=1 qp_id = 4



 Comments   
Comment by James Erwin [ 26/Oct/20 ]

Hello, is there any update on this issue? 

Comment by Mike Marciniszyn [ 18/Mar/22 ]

I suspect this issue is a duplicate.

Comment by Mike Marciniszyn [ 23/Mar/22 ]

I have confirmed that a client build with off of the 2.12.8 branch doesn't see the issue.

This is indeed a duplicate.

Generated at Sat Feb 10 03:05:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.