Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12901

Failing to create a properly sized IB queue pair

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0
    • Lustre 2.13.0
    • None
    • Seen with newer Mellanox ConnectX-4 devices
    • 3
    • 9223372036854775807

    Description

      Attempting to bring up a file system in our test bed with the latest lustre version (2.13) I saw this new error on LNet bring up.

      [ 472.738363] LNet: 8481:0:(o2iblnd_cb.c:3395:kiblnd_check_conns()) Timed out tx for 10.37.248.232@o2ib1: 471 seconds
      [ 473.739295] LNetError: 2014:0:(o2iblnd.c:929:kiblnd_create_conn()) Can't create QP: -12, send_wr: 16317, recv_wr: 128, send_sge: 2, recv_sge: 1

      I found I can lower the peer_credits to get around this but that is not the proper fix.

       

       

       

       

      Attachments

        Issue Links

          Activity

            [LU-12901] Failing to create a properly sized IB queue pair

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45901
            Subject: LU-12901 o2iblnd: retry qp creation with reduced queue depth
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 9e0736f2306286f2f2c653c4e06c17d2201d1c0f

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45901 Subject: LU-12901 o2iblnd: retry qp creation with reduced queue depth Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 9e0736f2306286f2f2c653c4e06c17d2201d1c0f
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40748/
            Subject: LU-12901 o2iblnd: retry qp creation with reduced queue depth
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8a3ef5713cc4aed1ac7bd3ce177895caa597cc4c

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40748/ Subject: LU-12901 o2iblnd: retry qp creation with reduced queue depth Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8a3ef5713cc4aed1ac7bd3ce177895caa597cc4c

            Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40748
            Subject: LU-12901 o2iblnd: retry qp creation with reduced queue depth
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fe4fcd922196355b08981d9015f1635c88904fd3

            gerrit Gerrit Updater added a comment - Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40748 Subject: LU-12901 o2iblnd: retry qp creation with reduced queue depth Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fe4fcd922196355b08981d9015f1635c88904fd3
            knweiss Karsten Weiss added a comment - - edited

            May I suggest to change the "Affects Version/s" attribute of this bug from 2.13.0 to 2.12.x (including 2.12.5 which is a LTS release). See e.g. the comments here or the reports on lustre-discuss.

            knweiss Karsten Weiss added a comment - - edited May I suggest to change the " Affects Version/s " attribute of this bug from 2.13.0 to 2.12.x (including 2.12.5 which is a LTS release). See e.g. the comments here or the reports on lustre-discuss.
            aeonjeff Jeff Johnson added a comment -

            I see this issue as well. CentOS 7.8, Lustre 2.13.0, MOFED 5.0-2.1.8, ConnectX6. Setting peer_credits to 128 fails as described. Lowering peer_credits to 48 results in functioning lnet.

             

             

            aeonjeff Jeff Johnson added a comment - I see this issue as well. CentOS 7.8, Lustre 2.13.0, MOFED 5.0-2.1.8, ConnectX6. Setting peer_credits to 128 fails as described. Lowering peer_credits to 48 results in functioning lnet.    
            mneff Michael Neff added a comment -

            I also see this on a Lustre client with Centos7.7 and ConnectX6 using Mellanox OFED 4.7

            mneff Michael Neff added a comment - I also see this on a Lustre client with Centos7.7 and ConnectX6 using Mellanox OFED 4.7

            I also still see this on a Lustre client v2_12_3-98-g6db0c4f082 (+LU-12637 patch) on CentOS 8.1 (4.18.0-147.3.1.el8_1.x86_64) with ConnectX-4 using the CentOS RDMA drivers (mlx5_core) . I can also confirm that peer_credits=42 (instead of 63) works.

            knweiss Karsten Weiss added a comment - I also still see this on a Lustre client v2_12_3-98-g6db0c4f082 (+ LU-12637 patch) on CentOS 8.1 (4.18.0-147.3.1.el8_1.x86_64) with ConnectX-4 using the CentOS RDMA drivers (mlx5_core) . I can also confirm that peer_credits=42 (instead of 63) works.

            Sorry we are also having issues with our IB switch so currently I'm not using our IB network.

            simmonsja James A Simmons added a comment - Sorry we are also having issues with our IB switch so currently I'm not using our IB network.

            Also seeing with Lustre 2.12.3 client on CentOS 7.6 (3.10.0-957.27.2.el7.x86_64). With both ConnectX-4 and -5 on client nodes.

            Server nodes are running Lustre 2.10.6 on CentOS 7.5 (3.10.0-862.14.4.el7.x86_64).  Servers are also mixed Mellanox EDR and Intel/Qlogic QDR.

            All nodes are using the in-box RDMA drivers.

            As most client nodes are still on Intel QDR, we previously set the ib0 peer_credits to 62 across all nodes. Lowering peer_credits to 42 on the problematic clients allows mounting of file system. 

            nathan.crawford@uci.edu Nathan Crawford added a comment - Also seeing with Lustre 2.12.3 client on CentOS 7.6 (3.10.0-957.27.2.el7.x86_64). With both ConnectX-4 and -5 on client nodes. Server nodes are running Lustre 2.10.6 on CentOS 7.5 (3.10.0-862.14.4.el7.x86_64).  Servers are also mixed Mellanox EDR and Intel/Qlogic QDR. All nodes are using the in-box RDMA drivers. As most client nodes are still on Intel QDR, we previously set the ib0 peer_credits to 62 across all nodes. Lowering peer_credits to 42 on the problematic clients allows mounting of file system. 

            People

              ssmirnov Serguei Smirnov
              simmonsja James A Simmons
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: