Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • Lustre 2.7.0
    • None
    • mlnx ofed3.2
      lustre-2.7.2-2nas-fe
      Linux elrtr1 3.0.101-77.1.20160630-nasa #1 SMP Thu Jun 30 00:56:32 UTC 2016 (a082ea6) x86_64 x86_64 x86_64 GNU/Linux
    • 2
    • 9223372036854775807

    Description

      Running lnet selftest on a mlx5 card we get these errors.

      [1477328975.069684] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10912): dump error cqe
      [1477328975.085684] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10906): dump error cqe
      [1477328975.085684] 00000000 00000000 00000000 00000000
      [1477328975.085684] 00000000 00000000 00000000 00000000
      [1477328975.085684] 00000000 00000000 00000000 00000000
      [1477328975.085684] 00000000 08007806 2500002f 00085dd0
      [1477328975.085684] LustreError: 11028:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.25@o2ib: -5
      [1477328975.085684] LustreError: 11028:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer to 12345-10.151.27.25@o2ib has failed: -5
      [1477328975.093683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10922): dump error cqe
      [1477328975.093683] 00000000 00000000 00000000 00000000
      [1477328975.093683] 00000000 00000000 00000000 00000000
      [1477328975.093683] 00000000 00000000 00000000 00000000
      [1477328975.093683] 00000000 08007806 25000030 000842d0
      [1477328975.105683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10915): dump error cqe
      [1477328975.105683] 00000000 00000000 00000000 00000000
      [1477328975.105683] 00000000 00000000 00000000 00000000
      [1477328975.105683] 00000000 00000000 00000000 00000000
      [1477328975.105683] 00000000 08007806 25000031 000843d0
      [1477328975.113683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10900): dump error cqe
      [1477328975.113683] 00000000 00000000 00000000 00000000
      [1477328975.113683] 00000000 00000000 00000000 00000000
      [1477328975.113683] 00000000 00000000 00000000 00000000
      [1477328975.113683] 00000000 08007806 25000032 000840d0
      [1477328975.121683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10900): dump error cqe
      [1477328975.121683] 00000000 00000000 00000000 00000000
      [1477328975.121683] 00000000 00000000 00000000 00000000
      [1477328975.121683] 00000000 00000000 00000000 00000000
      [1477328975.121683] 00000000 08007806 25000033 000841d0
      [1477328975.129683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10915): dump error cqe
      [1477328975.129683] 00000000 00000000 00000000 00000000
      [1477328975.129683] 00000000 00000000 00000000 00000000
      [1477328975.129683] 00000000 00000000 00000000 00000000
      [1477328975.129683] 00000000 08007806 2500002e 00085cd0
      [1477328975.133683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10907): dump error cqe
      [1477328975.133683] 00000000 00000000 00000000 00000000
      [1477328975.133683] 00000000 00000000 00000000 00000000
      [1477328975.133683] 00000000 00000000 00000000 00000000
      [1477328975.133683] 00000000 08007806 25000034 000846d0
      [1477328975.205682] 00000000 00000000 00000000 00000000
      [1477328975.281682] 00000000 00000000 00000000 00000000
      [1477328975.305681] 00000000 00000000 00000000 00000000
      [1477328975.305681] 00000000 08007806 2500002d 000b57d0
      

      Attachments

        Issue Links

          Activity

            [LU-8752] mlx5_warn:mlx5_0:dump_cqe:257:
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24306/
            Subject: LU-8752 lnet: Stop MLX5 triggering a dump_cqe
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 783428b60a98874b4783f8da48c66019d68d84d6

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24306/ Subject: LU-8752 lnet: Stop MLX5 triggering a dump_cqe Project: fs/lustre-release Branch: master Current Patch Set: Commit: 783428b60a98874b4783f8da48c66019d68d84d6

            Mahmoud:

            I suspect there is a bug in the Mellanox MLX5 driver somewhere.  My debugging showed that the first time we use the L-key/R-key for an RDMA operation (reading from MLX5), it is being rejected as invalid.  I spent a lot of time tracing the code to see if we have invoked anything  which could have changed the state of the R-key/L-key and found nothing.

            That is why I came up with the solution of just invalidating the first key we set and advancing to the next one.  I have not found an issue with invalidating a free key so am hoping this fix will not cause any problems down the road (Dmitry and Amir's concerns in the code review).

            Doug

            doug Doug Oucharek (Inactive) added a comment - Mahmoud: I suspect there is a bug in the Mellanox MLX5 driver somewhere.  My debugging showed that the first time we use the L-key/R-key for an RDMA operation (reading from MLX5), it is being rejected as invalid.  I spent a lot of time tracing the code to see if we have invoked anything  which could have changed the state of the R-key/L-key and found nothing. That is why I came up with the solution of just invalidating the first key we set and advancing to the next one.  I have not found an issue with invalidating a free key so am hoping this fix will not cause any problems down the road (Dmitry and Amir's concerns in the code review). Doug

            Hi Peter,
            NASA Ames is running 2.7.2+ in production, but we started testing 2.9.0 client on SLES12 SP2.

            • jay
            jaylan Jay Lan (Inactive) added a comment - Hi Peter, NASA Ames is running 2.7.2+ in production, but we started testing 2.9.0 client on SLES12 SP2. jay
            yujian Jian Yu added a comment -

            Hi James,
            Here is the patch for FE 2.8.x release: https://review.whamcloud.com/24365

            yujian Jian Yu added a comment - Hi James, Here is the patch for FE 2.8.x release: https://review.whamcloud.com/24365

            ORNL needs it for the 2.8 release.

            simmonsja James A Simmons added a comment - ORNL needs it for the 2.8 release.
            pjones Peter Jones added a comment -

            Jay

            Are you running with 2.9.x? I had thought that you were using 2.7.x...

            Peter

            pjones Peter Jones added a comment - Jay Are you running with 2.9.x? I had thought that you were using 2.7.x... Peter

            Do we need this patch for Lustre 2.9 release?

            jaylan Jay Lan (Inactive) added a comment - Do we need this patch for Lustre 2.9 release?

            Here is some feed back from MLX.

            "

            According to IB spec state should be set to Free.

            Under "MEMORY REGION TYPES" section there is an explanation of states, specifically:
            Table: "Memory Region States Summary"
            "The following table summarizes the states of Memory Regions L_Keys and R_Keys and the operations allowed on each state:"
            Looking at Property / Operation Allowed: for "Fast Register", there are 3 possible states: Invalid, Free, Valid.
            The only allowed state is - Free.

            "

            mhanafi Mahmoud Hanafi added a comment - Here is some feed back from MLX. " According to IB spec state should be set to Free. Under "MEMORY REGION TYPES" section there is an explanation of states, specifically: Table: "Memory Region States Summary" "The following table summarizes the states of Memory Regions L_Keys and R_Keys and the operations allowed on each state:" Looking at Property / Operation Allowed: for "Fast Register", there are 3 possible states: Invalid, Free, Valid. The only allowed state is - Free. "
            doug Doug Oucharek (Inactive) added a comment - The 2.7FE patch is:  https://review.whamcloud.com/24336

            People

              doug Doug Oucharek (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: