Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • Lustre 2.7.0
    • None
    • mlnx ofed3.2
      lustre-2.7.2-2nas-fe
      Linux elrtr1 3.0.101-77.1.20160630-nasa #1 SMP Thu Jun 30 00:56:32 UTC 2016 (a082ea6) x86_64 x86_64 x86_64 GNU/Linux
    • 2
    • 9223372036854775807

    Description

      Running lnet selftest on a mlx5 card we get these errors.

      [1477328975.069684] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10912): dump error cqe
      [1477328975.085684] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10906): dump error cqe
      [1477328975.085684] 00000000 00000000 00000000 00000000
      [1477328975.085684] 00000000 00000000 00000000 00000000
      [1477328975.085684] 00000000 00000000 00000000 00000000
      [1477328975.085684] 00000000 08007806 2500002f 00085dd0
      [1477328975.085684] LustreError: 11028:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.25@o2ib: -5
      [1477328975.085684] LustreError: 11028:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer to 12345-10.151.27.25@o2ib has failed: -5
      [1477328975.093683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10922): dump error cqe
      [1477328975.093683] 00000000 00000000 00000000 00000000
      [1477328975.093683] 00000000 00000000 00000000 00000000
      [1477328975.093683] 00000000 00000000 00000000 00000000
      [1477328975.093683] 00000000 08007806 25000030 000842d0
      [1477328975.105683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10915): dump error cqe
      [1477328975.105683] 00000000 00000000 00000000 00000000
      [1477328975.105683] 00000000 00000000 00000000 00000000
      [1477328975.105683] 00000000 00000000 00000000 00000000
      [1477328975.105683] 00000000 08007806 25000031 000843d0
      [1477328975.113683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10900): dump error cqe
      [1477328975.113683] 00000000 00000000 00000000 00000000
      [1477328975.113683] 00000000 00000000 00000000 00000000
      [1477328975.113683] 00000000 00000000 00000000 00000000
      [1477328975.113683] 00000000 08007806 25000032 000840d0
      [1477328975.121683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10900): dump error cqe
      [1477328975.121683] 00000000 00000000 00000000 00000000
      [1477328975.121683] 00000000 00000000 00000000 00000000
      [1477328975.121683] 00000000 00000000 00000000 00000000
      [1477328975.121683] 00000000 08007806 25000033 000841d0
      [1477328975.129683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10915): dump error cqe
      [1477328975.129683] 00000000 00000000 00000000 00000000
      [1477328975.129683] 00000000 00000000 00000000 00000000
      [1477328975.129683] 00000000 00000000 00000000 00000000
      [1477328975.129683] 00000000 08007806 2500002e 00085cd0
      [1477328975.133683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10907): dump error cqe
      [1477328975.133683] 00000000 00000000 00000000 00000000
      [1477328975.133683] 00000000 00000000 00000000 00000000
      [1477328975.133683] 00000000 00000000 00000000 00000000
      [1477328975.133683] 00000000 08007806 25000034 000846d0
      [1477328975.205682] 00000000 00000000 00000000 00000000
      [1477328975.281682] 00000000 00000000 00000000 00000000
      [1477328975.305681] 00000000 00000000 00000000 00000000
      [1477328975.305681] 00000000 08007806 2500002d 000b57d0
      

      Attachments

        Issue Links

          Activity

            [LU-8752] mlx5_warn:mlx5_0:dump_cqe:257:

            OmniPath is not affected by this patch as it uses FMR and not FastReg.  So this change should only affect MLX5 based cards.

            Did you need me to push a 2.7FE patch?

            doug Doug Oucharek (Inactive) added a comment - OmniPath is not affected by this patch as it uses FMR and not FastReg.  So this change should only affect MLX5 based cards. Did you need me to push a 2.7FE patch?

            Thanks we will build and test it.

             

            mhanafi Mahmoud Hanafi added a comment - Thanks we will build and test it.  

            I have submitted a fix to this issue above (ignore the earlier Gerrit patch as it was just for my debugging).  I have validated it on our mlx4 / mlx5 mixed cluster.  I still need to validate it on OmniPath to ensure it does not cause a problem there.

            doug Doug Oucharek (Inactive) added a comment - I have submitted a fix to this issue above (ignore the earlier Gerrit patch as it was just for my debugging).  I have validated it on our mlx4 / mlx5 mixed cluster.  I still need to validate it on OmniPath to ensure it does not cause a problem there.

            Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/24306
            Subject: LU-8752 lnet: Stop MLX5 triggering a dump_cqe
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3dcf679c7f1e2aa99d99a10deb65ac2faa30a629

            gerrit Gerrit Updater added a comment - Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/24306 Subject: LU-8752 lnet: Stop MLX5 triggering a dump_cqe Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3dcf679c7f1e2aa99d99a10deb65ac2faa30a629

            Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/24162
            Subject: LU-8752 lnet: Debugging Patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 235b2ce7d07ae80e31a170cb7e424f859dce97b1

            gerrit Gerrit Updater added a comment - Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/24162 Subject: LU-8752 lnet: Debugging Patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 235b2ce7d07ae80e31a170cb7e424f859dce97b1

            Thank you for that feedback.  I will dig into the code for the path where we are making this mistake.

            doug Doug Oucharek (Inactive) added a comment - Thank you for that feedback.  I will dig into the code for the path where we are making this mistake.

            Here is update from mellanox engineering

            "The failure is happening because of fast reg mr called twice, second time the mkey is not free - but it set for a check if free and its meaning the operation will succeed only if mkey is free.

            In order to call for second fast reg mr customer needed to do local invalidate before that call.

            (These operation are explained in IB spec)."

            mhanafi Mahmoud Hanafi added a comment - Here is update from mellanox engineering "The failure is happening because of fast reg mr called twice, second time the mkey is not free - but it set for a check if free and its meaning the operation will succeed only if mkey is free. In order to call for second fast reg mr customer needed to do local invalidate before that call. (These operation are explained in IB spec)."

            Another interesting data point:

            I flipped the from and to parameters in your test making it "--from mlx5_host --to mlx4_host". In theory, the issue should happen when I then do a write as this is almost the same thing as what you have. However, in this case, I am not able to reproduce the problem with either read or write.

            I believe the difference when you flip the from and to parameters is who initiates the test.

            doug Doug Oucharek (Inactive) added a comment - Another interesting data point: I flipped the from and to parameters in your test making it "--from mlx5_host --to mlx4_host". In theory, the issue should happen when I then do a write as this is almost the same thing as what you have. However, in this case, I am not able to reproduce the problem with either read or write. I believe the difference when you flip the from and to parameters is who initiates the test.

            Mellanox has also been able to reproduce this issue in their lab and are looking at it.

            mhanafi Mahmoud Hanafi added a comment - Mellanox has also been able to reproduce this issue in their lab and are looking at it.
            doug Doug Oucharek (Inactive) added a comment - - edited

            I've been able to get mixed mlx4 and mlx5 cards in the same cluster. Using your test examples, I have been able to reproduce this issue getting the same results as you.

            I verified that the issue only occurs when map_on_demand is non zero. I have reproduced the issue with both upstream OFED and MOFED (latest version for both). I have also reproduced it with RHEL 6.8 and 7.3.

            No solution yet, but at least I have a way to investigate this now.

            doug Doug Oucharek (Inactive) added a comment - - edited I've been able to get mixed mlx4 and mlx5 cards in the same cluster. Using your test examples, I have been able to reproduce this issue getting the same results as you. I verified that the issue only occurs when map_on_demand is non zero. I have reproduced the issue with both upstream OFED and MOFED (latest version for both). I have also reproduced it with RHEL 6.8 and 7.3. No solution yet, but at least I have a way to investigate this now.
            mhanafi Mahmoud Hanafi added a comment - - edited

            After the host is reboot, the error can be reproduced when read from mlx4 to mlx5.
            This will produce the dump_cqe error.

            #test1
            lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw read size=1M check=full
            

            But this will work!

            #test2
            lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw write size=1M check=full 
            

            What's interesting once the write test is ran the read will start to work.

            mhanafi Mahmoud Hanafi added a comment - - edited After the host is reboot, the error can be reproduced when read from mlx4 to mlx5. This will produce the dump_cqe error. #test1 lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw read size=1M check=full But this will work! #test2 lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw write size=1M check=full What's interesting once the write test is ran the read will start to work.

            People

              doug Doug Oucharek (Inactive)
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: