Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10775

(sec.c:2363:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 1048576(2097152)

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.10.3
    • None
    • RHEL 7.4 ARM client vs x86 server
    • 3
    • 9223372036854775807

    Description

      running IOR on locally built lustre branch b2_10 at commit 0f6c448, a couple of initial data transfers work but quickly start to fail, with server side messages like:

      (sec.c:2363:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 1048576(4194304)  req@ffff880f052d8050 x1593867370500512/t0(0) o4->d0c9fb64-cf93-52c4-8daf-a80ac8484f6b@194.1.0.2@o2ib4:76/0 lens 608/448 e 0 to 0 dl 1520037046 ref 1 fl Interpret:H/2/0 rc 0/0
      

       

      config arg: --disable-gss 

      module opts all defaults on both sides, perhaps something needs changed for ARM client?

      server has mdt + 3 osts on one node for testing, no lnet routers

      IB mlx5 connections.

      Attachments

        Issue Links

          Activity

            [LU-10775] (sec.c:2363:sptlrpc_svc_unwrap_bulk()) @@@ truncated bulk GET 1048576(2097152)

            Now that LU-10157 landed this ticket can be closed. As a note we need to document on the wiki for ARM/Power8 systems that you need to set map_on_demand to 16 on the back end x86 servers for lustre 2.10 version.

             

            simmonsja James A Simmons added a comment - Now that LU-10157 landed this ticket can be closed. As a note we need to document on the wiki for ARM/Power8 systems that you need to set map_on_demand to 16 on the back end x86 servers for lustre 2.10 version.  

            yes I'll be on

             

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - yes I'll be on  

            Ruth can you join the LWG call today?

            simmonsja James A Simmons added a comment - Ruth can you join the LWG call today?
            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - - edited

            Amir, the server side is RHEL 7.4, I built the 2.10 at 0f6c448. The ofed is MLNX_OFED_LINUX-4.2-1.0.0.0.

            configure reports yes to checking if 'ib_get_dma_mr' exists, but also:

            WARNING: "ib_get_dma_mr" [/build_area/lustre-release/build/conftest.ko] undefined!

            > nm /lib/modules/3.10.0-693.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_core.ko | grep ib_get_dma

            0000000000006e50 T ib_get_dma_mr

            Setting map_on_demand=16 on the server works, traffic is moving, Thanks. (I guess that would not work if there were other clients mounting, with a different setting though.)

            The client side is now:

            2.11.0_RC2 + 'LU-10157 lnet: make LNET_MAX_IOV dependent on page size'

                                + LU-10560 libcfs: Use kernel_write when appropriate

            rhel7.5, kernel 4.14.0-49.el7a.aarch64 and MLNX_OFED_LINUX-4.3-1.0.1.0 

             

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - - edited Amir, the server side is RHEL 7.4, I built the 2.10 at 0f6c448. The ofed is MLNX_OFED_LINUX-4.2-1.0.0.0. configure reports yes to checking if 'ib_get_dma_mr' exists, but also: WARNING: "ib_get_dma_mr" [/build_area/lustre-release/build/conftest.ko] undefined! > nm /lib/modules/3.10.0-693.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/core/ib_core.ko | grep ib_get_dma 0000000000006e50 T ib_get_dma_mr Setting map_on_demand=16 on the server works, traffic is moving, Thanks. (I guess that would not work if there were other clients mounting, with a different setting though.) The client side is now: 2.11.0_RC2 + ' LU-10157 lnet: make LNET_MAX_IOV dependent on page size'                     +  LU-10560 libcfs: Use kernel_write when appropriate rhel7.5, kernel 4.14.0-49.el7a.aarch64 and MLNX_OFED_LINUX-4.3-1.0.1.0   
            simmonsja James A Simmons added a comment - - edited

            setting map-on_demand to 16 is not going to help. I have tried it before. We are going to need the map-on-demand changes for b2_10. As he pointed out using the 64K page patch that is slated for  2.12 will break interop with x86 2.10 server when using ARM clients since it lacks the all the changes to make it possible. So we have a choice here, state that in order to user ARM clients you must use at least a 2.11 server, or back port a bunch of o2iblnd patches to make it possible. Also many of the changes missing from 2.10 make using newer MOFED possible.  Do we say you have to stay on a MOFED 3.X version for 2.10?

            simmonsja James A Simmons added a comment - - edited setting map-on_demand to 16 is not going to help. I have tried it before. We are going to need the map-on-demand changes for b2_10. As he pointed out using the 64K page patch that is slated for  2.12 will break interop with x86 2.10 server when using ARM clients since it lacks the all the changes to make it possible. So we have a choice here, state that in order to user ARM clients you must use at least a 2.11 server, or back port a bunch of o2iblnd patches to make it possible. Also many of the changes missing from 2.10 make using newer MOFED possible.  Do we say you have to stay on a MOFED 3.X version for 2.10?

            Ruth, is the server side running RHEL 7.2 or earlier?

            Looking through the code the reason you'd get:

            LNet: 14703:0:(o2iblnd_cb.c:2355:kiblnd_passive_connect()) Can't accept conn from 194.1.0.2@o2ib4 (version 12): max_frags 16 incompatible without FMR pool (256 wanted) 

            is because you're not using FMR. This would occur if HAVE_IB_GET_DMA_MR is defined. I believe this is defined for RHEL 7.2 and earlier.

            you would be able to avoid this issue by setting map-on-demand to 16 on the server side as well.

            Can you try that and see if it resolves the issue?

            James, I consider the map-on-demand changes to be mini-feature. Not sure if it's the best decision to backport that to 2.10.

            However, we might consider porting the below patch to 2.10, because it fixes a bug

            LU-10213 lnd: calculate qp max_send_wrs properly 
            ashehata Amir Shehata (Inactive) added a comment - Ruth, is the server side running RHEL 7.2 or earlier? Looking through the code the reason you'd get: LNet: 14703:0:(o2iblnd_cb.c:2355:kiblnd_passive_connect()) Can't accept conn from 194.1.0.2@o2ib4 (version 12): max_frags 16 incompatible without FMR pool (256 wanted) is because you're not using FMR. This would occur if HAVE_IB_GET_DMA_MR is defined. I believe this is defined for RHEL 7.2 and earlier. you would be able to avoid this issue by setting map-on-demand to 16 on the server side as well. Can you try that and see if it resolves the issue? James, I consider the map-on-demand changes to be mini-feature. Not sure if it's the best decision to backport that to 2.10. However, we might consider porting the below patch to 2.10, because it fixes a bug LU-10213 lnd: calculate qp max_send_wrs properly

            whoa, thanks but no need to patch 2.10 to make this work, I'm fine with moving the server to 2.11, it's a tiny toy fs, no routers.

            I'm far from grasping the whole map_on_demand ish, but maybe I just needed to set it to 256, don't think I did that.

            ps the test cluster is under work again...so next try will be in a while.

             

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - whoa, thanks but no need to patch 2.10 to make this work, I'm fine with moving the server to 2.11, it's a tiny toy fs, no routers. I'm far from grasping the whole map_on_demand ish, but maybe I just needed to set it to 256, don't think I did that. ps the test cluster is under work again...so next try will be in a while.  

            Actually I realized I have been testing with an unpatched 2.11 server and it does work. The problem is lustre 2.10 is missing a bunch of fixes to properly support newer MOFED stack. Things like queue pair manage and map_on_demand have changed dramatically. Amir can you put together a list of missing patches for 2.10 to make this work?

            simmonsja James A Simmons added a comment - Actually I realized I have been testing with an unpatched 2.11 server and it does work. The problem is lustre 2.10 is missing a bunch of fixes to properly support newer MOFED stack. Things like queue pair manage and map_on_demand have changed dramatically. Amir can you put together a list of missing patches for 2.10 to make this work?

            That is unexpected considering x86 sends 1MB packets with and without the patch. Its ARM/Power8 that is sending 16 MB packets. I can tell you that the patch on x86 platforms will work with x86 systems without the patch. I have run the upstream client which lacks the patch against patched servers. So we have:

            patched x86 <-> patched x86 works

            unpatch x86 <-> unpatch x86 works

            unpatch x86 <-> patched x86 works

            patched x86 <-> unpatched x86 ??? should work

             

            patched ARM <-> patched x86 works

            unpatch ARM <-> unpatch x86  fails

            unpatch ARM <-> pacthed x86 ?? should fail since ARM is not addressed

            patched ARM <-> unpatch x86 fails

            Did you trying the server side with the patch to see if it works?

            simmonsja James A Simmons added a comment - That is unexpected considering x86 sends 1MB packets with and without the patch. Its ARM/Power8 that is sending 16 MB packets. I can tell you that the patch on x86 platforms will work with x86 systems without the patch. I have run the upstream client which lacks the patch against patched servers. So we have: patched x86 <-> patched x86 works unpatch x86 <-> unpatch x86 works unpatch x86 <-> patched x86 works patched x86 <-> unpatched x86 ??? should work   patched ARM <-> patched x86 works unpatch ARM <-> unpatch x86  fails unpatch ARM <-> pacthed x86 ?? should fail since ARM is not addressed patched ARM <-> unpatch x86 fails Did you trying the server side with the patch to see if it works?

            I was at 2.11 RC1 with the patch on the client side, 2.10 without the patch on the server. 

            After removing the patch the client connects fine. I'm defaulting any module options. 

            This seems like an interop issue to me. 

             

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - I was at 2.11 RC1 with the patch on the client side, 2.10 without the patch on the server.  After removing the patch the client connects fine. I'm defaulting any module options.  This seems like an interop issue to me.   

            People

              simmonsja James A Simmons
              ruth.klundt@gmail.com Ruth Klundt (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: