Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13976

duplicate IB_WR_LOCAL_INV causing ice driver failure (RoCE/iWarp)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.12.5
    • None
    • RHEL 8.1, Intel Corporation Ethernet Controller E810-C, iwarp or RoCE mode
    • 3
    • 9223372036854775807

    Description

      During lnet_selftest read > 4K (write works without this issue), the ice driver detects a duplicate IB_WR_LOCAL_INV on the same key and the test fails. We also cannot mount a file system over Lustre because of this issue.

      We instrumented o2iblnd.c and the irdma driver and found the duplicate IB_WR_LOCAL_INV:

       

      In this example below, skl01 is the client and skl02 is the server during a lst read operation.

       

      skl01, irdma trace:

      [Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0xb16f427b

      [Mon Sep 21 10:11:42 2020] LNet: Added LNI 192.168.1.1@o2ib [8/256/0/180]

      [Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x8e89e91b

      [Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x8e89e91c, info.stag_key=0x1c, info.stag_idx=0x8e89e9

      [Mon Sep 21 10:14:08 2020] ice 0000:18:00.0: abnormal ae_id = 0x50a bool qp=1 qp_id = 6

      [Mon Sep 21 10:14:08 2020] LNetError: 4268:0:(o2iblnd_cb.c:3676:kiblnd_qp_event()) 192.168.1.2@o2ib: Async QP event type 1

      [Mon Sep 21 10:14:08 2020] LustreError: 47768:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.1.2@o2ib failed with -103

       

      skl01, Lustre trace:

      00000800:00000200:57.0:1600697647.117610:0:47619:0:(o2iblnd.c:1913:kiblnd_fmr_pool_map()) jpe key 8e89e91b
      00000800:00000200:57.0:1600697647.117611:0:47619:0:(o2iblnd.c:1919:kiblnd_fmr_pool_map()) jpe key after bump 8e89e91c

       

       

      skl02, irdma trace:

      [Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0x968276e9

      [Mon Sep 21 10:11:42 2020] MKI-irdma_create_stag: returning stag = 0xd477355c

      [Mon Sep 21 10:11:42 2020] LNet: Added LNI 192.168.1.2@o2ib [8/256/0/180]

      [Mon Sep 21 10:14:06 2020] LNet: 36382:0:(rpc.c:612:srpc_service_add_buffers()) waiting for adding buffer

      [Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601

      [Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726

      [Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601  [MKI] This key was already invalidated

      [Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726

      [Mon Sep 21 10:14:06 2020] ice 0000:18:00.0: abnormal ae_id = 0x106 bool qp=1 qp_id = 4

      [Mon Sep 21 10:14:06 2020] LNetError: 36497:0:(o2iblnd_cb.c:3676:kiblnd_qp_event()) 192.168.1.1@o2ib: Async QP event type 1

      [Mon Sep 21 10:14:07 2020] MKI-IB_WR_LOCAL_INV rkey = 0x76b95769

      [Mon Sep 21 10:14:07 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x76b9576a, info.stag_key=0x6a, info.stag_idx=0x76b957

      [Mon Sep 21 10:14:07 2020] LNet: 36261:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 192.168.1.1@o2ib

       

      skl02, Lustre trace:

      00000800:00000200:28.0:1600697647.441213:0:36436:0:(o2iblnd.c:1913:kiblnd_fmr_pool_map()) jpe key 31172601
      00000800:00000200:28.0:1600697647.441215:0:36436:0:(o2iblnd.c:1919:kiblnd_fmr_pool_map()) jpe key after bump 31172602
      00000800:00000200:59.0:1600697647.677998:0:36262:0:(o2iblnd.c:1913:kiblnd_fmr_pool_map()) jpe key 76b95769
      00000800:00000200:59.0:1600697647.678000:0:36262:0:(o2iblnd.c:1919:kiblnd_fmr_pool_map()) jpe key after bump 76b9576a

       

       

       

       

      [Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601

      [Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726

       

      The next line below is duplicate as rkey = 0x31172601 is already invalidated above. So, we think this should have been 0x31172602, that would avoid invalidating an already invalidated key, and also doing a duplicate REG_MR on the same key.

      Is there something else doing an invalidate somewhere in the code?

       

      [Mon Sep 21 10:14:06 2020] MKI-IB_WR_LOCAL_INV rkey = 0x31172601

      [Mon Sep 21 10:14:06 2020] MKI-IB_WR_REG_MR: ib_wr->wr_id=0x4, reg_wr(ib_wr)->key=0x31172602, info.stag_key=0x2, info.stag_idx=0x311726

      [Mon Sep 21 10:14:06 2020] ice 0000:18:00.0: abnormal ae_id = 0x106 bool qp=1 qp_id = 4

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jerwin James Erwin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: