Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.10.0
    • Lustre 2.7.0, Lustre 2.8.0, Lustre 2.9.0
    • 3
    • 16043

    Description

      Got an IOR failure on the soak cluster with the following errors:

      Oct  7 21:54:01 lola-23 kernel: LNetError: 3613:0:(o2iblnd_cb.c:1134:kiblnd_init_rdma()) RDMA too fragmented for 192.168.1.115@o2ib100 (256): 128/256 src 128/256 dst frags
      Oct  7 21:54:01 lola-23 kernel: LNetError: 3618:0:(o2iblnd_cb.c:428:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.1.114@o2ib100: -90
      Oct  7 21:54:01 lola-23 kernel: LNetError: 3618:0:(o2iblnd_cb.c:428:kiblnd_handle_rx()) Skipped 7 previous similar messages
      

      Liang told me that this is a known issue with routing. That said, the IOR process is not killable and the only option is to reboot the client node. We should at least fail "gracefully" by returning the error to the application.

      Attachments

        Issue Links

          Activity

            [LU-5718] RDMA too fragmented with router
            spitzcor Cory Spitz added a comment -

            LUDOC-378 is linked to this issue.

            spitzcor Cory Spitz added a comment - LUDOC-378 is linked to this issue.
            spitzcor Cory Spitz added a comment -

            Looks like we should have opened a LUDOC ticket to document wrq_sge.

            spitzcor Cory Spitz added a comment - Looks like we should have opened a LUDOC ticket to document wrq_sge.

            Ah! Thanks for the clarification, Chris and Doug! I was a bit lost as the parameters changed along the work done in this ticket. We'll test this right away.
            All the best,
            Stephane

            srcc Stanford Research Computing Center added a comment - Ah! Thanks for the clarification, Chris and Doug! I was a bit lost as the parameters changed along the work done in this ticket. We'll test this right away. All the best, Stephane

            Your router needs wrq_sge=2.

            doug Doug Oucharek (Inactive) added a comment - Your router needs wrq_sge=2.
            hornc Chris Horn added a comment -

            You need to set wrq_sge=2 on the routers, too.

            hornc Chris Horn added a comment - You need to set wrq_sge=2 on the routers, too.

            Hi,
            Could you please explain what is required to make the patches that landed work? We have tried 2.9 FE + patches from both LU-5718 and LU-9420 but are still seeing the problem on the routers. We have set wrq_sge=2 on the clients, and let the default wrq_sge=1 on the routers. We were not able to patch the servers at the moment (running IEEL3), see DELL-221.

            on the router with wrq_sge=1 (10.210.34.213@o2ib1 is an OSS not patched):

            [ 1111.504575] LNetError: 8688:0:(o2iblnd_cb.c:1093:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.210.34.213@o2ib1 (256), src idx/frags: 128/147 dst idx/frags: 128/147
            [ 1111.522352] LNetError: 8688:0:(o2iblnd_cb.c:430:kiblnd_handle_rx()) Can't setup rdma for PUT to 10.210.34.213@o2ib1: -90
            

            Clients and routers are using mlx5, servers are using mlx4.

            Thanks,
            Stephane

            sthiell Stephane Thiell added a comment - Hi, Could you please explain what is required to make the patches that landed work? We have tried 2.9 FE + patches from both LU-5718 and LU-9420 but are still seeing the problem on the routers. We have set wrq_sge=2 on the clients, and let the default wrq_sge=1 on the routers. We were not able to patch the servers at the moment (running IEEL3), see DELL-221. on the router with wrq_sge=1 (10.210.34.213@o2ib1 is an OSS not patched): [ 1111.504575] LNetError: 8688:0:(o2iblnd_cb.c:1093:kiblnd_init_rdma()) RDMA has too many fragments for peer 10.210.34.213@o2ib1 (256), src idx/frags: 128/147 dst idx/frags: 128/147 [ 1111.522352] LNetError: 8688:0:(o2iblnd_cb.c:430:kiblnd_handle_rx()) Can't setup rdma for PUT to 10.210.34.213@o2ib1: -90 Clients and routers are using mlx5, servers are using mlx4. Thanks, Stephane

            People

              doug Doug Oucharek (Inactive)
              johann Johann Lombardi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              39 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: