Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9461

lustre client mount fail after update IB driver and Lustre patch.

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • CentOS7.3
      Lustre 2.9.0 + cherry-picked as e4297ef38561f1e788ba73ca0c8078a09dc8c303
      MLNX_OFED_LINUX-4.0-2.0.0.1-rhel7.3
      IB: Mellanox ConnectX-4 adapter EDR
    • 3
    • 9223372036854775807

    Description

      The lustre client mount fail after update IB driver and Lustre client patch(LU-9026).
      Should I apply any other patch for new IB driver?

      mount fail error mesage
      [ 5713.280039] LNet: Using FastReg for registration
      [ 5713.370689] LNet: Added LNI 192.168.2.220@o2ib0 [8/256/0/180]
      [ 5736.543149] LNetError: 0:0:(o2iblnd_cb.c:3436:kiblnd_qp_event()) 192.168.2.201@o2ib0: Async QP event type 3
      [ 5743.539710] Lustre: 15524:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493897510/real 1493897510] req@ffff881011ef0300 x1566465051328608/t0(0) o503->MGC192.168.2.201@o2ib0@192.168.2.201@o2ib0:26/25 lens 272/8416 e 0 to 1 dl 1493897517 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [ 5743.539728] LustreError: 166-1: MGC192.168.2.201@o2ib0: Connection to MGS (at 192.168.2.201@o2ib0) was lost; in progress operations using this service will fail
      [ 5743.539899] LustreError: 15c-8: MGC192.168.2.201@o2ib0: The configuration from log 'hpcfs-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      [ 5743.540282] Lustre: Unmounted hpcfs-client
      [ 5743.544658] LustreError: 15524:0:(obd_mount.c:1449:lustre_fill_super()) Unable to mount (-5)

      Attachments

        Issue Links

          Activity

            [LU-9461] lustre client mount fail after update IB driver and Lustre patch.
            pjones Peter Jones added a comment -

            This issue is being tracked under LU-9958

            pjones Peter Jones added a comment - This issue is being tracked under LU-9958

            Hi,

            I got create striped directory error in Lustre 2.10 with LU-9500 patch (https://review.whamcloud.com/#/c/28237/) for OFED4.0

            [root@hsm client]# lfs mkdir -c 2 dir1
            error on LL_IOC_LMV_SETSTRIPE 'dir1' (3): Input/output error
            error: mkdir: create stripe dir 'dir1' failed

            Should I apply any other patch for this issue? Thanks

            sebg-crd-pm sebg-crd-pm (Inactive) added a comment - Hi, I got create striped directory error in Lustre 2.10 with LU-9500 patch ( https://review.whamcloud.com/#/c/28237/ ) for OFED4.0 [root@hsm client] # lfs mkdir -c 2 dir1 error on LL_IOC_LMV_SETSTRIPE 'dir1' (3): Input/output error error: mkdir: create stripe dir 'dir1' failed Should I apply any other patch for this issue? Thanks
            pjones Peter Jones added a comment -

            dinatale2 can you please open a new ticket to track this request?

            pjones Peter Jones added a comment - dinatale2 can you please open a new ticket to track this request?

            We have just encountered this issue as well. Is it possible to have LU-9026, LU-9500, and LU-9472 backported to the lustre 2.5 and 2.8 branches?

            dinatale2 Giuseppe Di Natale (Inactive) added a comment - We have just encountered this issue as well. Is it possible to have LU-9026 , LU-9500 , and LU-9472 backported to the lustre 2.5 and 2.8 branches?
            doug Doug Oucharek (Inactive) added a comment - - edited

            Li: Thank you for the information.  I checked and you are right, ib_map_mr_sg() does set mr->iova so that line is not needed (could cause a problem).  I will update LU-9500 with a removal of that line.

             sebg-crd-pm: Correct, you only need --LU-9026-- + LU-9500 (with the removal of setting mr->iova) + LU-9472.

            doug Doug Oucharek (Inactive) added a comment - - edited Li: Thank you for the information.  I checked and you are right, ib_map_mr_sg() does set mr->iova so that line is not needed (could cause a problem).  I will update LU-9500  with a removal of that line.  sebg-crd-pm : Correct, you only need -- LU-9026 -- + LU-9500 (with the removal of setting mr->iova) + LU-9472 .

            I can mount lustre ok ( 2.9.57_69_g0bc1964 + LU-9472 / LU-9500 and this patch (- mr->iova = iov )

            Is it ok to apply only LU-9026 + LU-9472 / LU-9500 and (- mr->iova = iov patchs to 2.9.0?
            Sould I apply any other patchs to Lustre2.9.0 for mlx5 cards using MOFED4 ? Thanks for your suggestion

            sebg-crd-pm sebg-crd-pm (Inactive) added a comment - I can mount lustre ok ( 2.9.57_69_g0bc1964 + LU-9472 / LU-9500 and this patch (- mr->iova = iov ) Is it ok to apply only LU-9026 + LU-9472 / LU-9500 and (- mr->iova = iov patchs to 2.9.0? Sould I apply any other patchs to Lustre2.9.0 for mlx5 cards using MOFED4 ? Thanks for your suggestion

            Hi Doug, 

            I believe the issue only applies to mlx5 cards using MOFED4.

            in MOFED4, mr->iova is set by ib_map_mr_sg()->ib_sg_to_pages()

            It doesn't make sense to reset mr->iova after calling ib_map_mr_sg().

            That line of code was introduced to address an similar issue, see the comments on

            https://review.whamcloud.com/#/c/19168/

            I've done some testing using MOFED4 + lustre-release with mlx4 cards forcing fast reg as well, so far I've seen no problems.

            lidongyang Li Dongyang (Inactive) added a comment - Hi Doug,  I believe the issue only applies to mlx5 cards using MOFED4. in MOFED4, mr->iova is set by ib_map_mr_sg()->ib_sg_to_pages() It doesn't make sense to reset mr->iova after calling ib_map_mr_sg(). That line of code was introduced to address an similar issue, see the comments on https://review.whamcloud.com/#/c/19168/ I've done some testing using MOFED4 + lustre-release with mlx4 cards forcing fast reg as well, so far I've seen no problems.

            I'm curious as to why you would not want to set the mr->iova value?  Is this an unneeded step?

            doug Doug Oucharek (Inactive) added a comment - I'm curious as to why you would not want to set the mr->iova value?  Is this an unneeded step?
            lidongyang Li Dongyang (Inactive) added a comment - - edited

            I think it's my fault.
            Could you try this patch on top of LU-9472 and LU-9500?

            diff --git a/lnet/klnds/o2iblnd/o2iblnd.c b/lnet/klnds/o2iblnd/o2iblnd.c
            index 047fe3c..ba7829b 100644
            --- a/lnet/klnds/o2iblnd/o2iblnd.c
            +++ b/lnet/klnds/o2iblnd/o2iblnd.c
            @@ -1900,8 +1900,6 @@ again:
                                                    return n < 0 ? n : -EINVAL;
                                            }
             
            -                               mr->iova = iov;
            -
                                            wr = &frd->frd_fastreg_wr;
                                            memset(wr, 0, sizeof(*wr));
            
            
            
            lidongyang Li Dongyang (Inactive) added a comment - - edited I think it's my fault. Could you try this patch on top of LU-9472 and LU-9500 ? diff --git a/lnet/klnds/o2iblnd/o2iblnd.c b/lnet/klnds/o2iblnd/o2iblnd.c index 047fe3c..ba7829b 100644 --- a/lnet/klnds/o2iblnd/o2iblnd.c +++ b/lnet/klnds/o2iblnd/o2iblnd.c @@ -1900,8 +1900,6 @@ again: return n < 0 ? n : -EINVAL; } - mr->iova = iov; - wr = &frd->frd_fastreg_wr; memset(wr, 0, sizeof(*wr));

            I ran into this "Async" error at the same time as the issues I talk about in LU-9500.  They are related.  When I have a fix for LU-9500, this issue will be addressed as well.

            doug Doug Oucharek (Inactive) added a comment - I ran into this "Async" error at the same time as the issues I talk about in LU-9500 .  They are related.  When I have a fix for LU-9500 , this issue will be addressed as well.

            People

              ashehata Amir Shehata (Inactive)
              sebg-crd-pm sebg-crd-pm (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: