Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When I try to mount lustre with GSS (SSK) enabled I receive checksum errors using a multi-rail client where I do not when using only a single interface.  My guess is the NID is encoded in the checksum though I haven't dug into the cause yet.  I also had lots of errors when using GSS on multi-rail servers although the errors were different.

      [154311.786639] LustreError: 194908:0:(gss_sk_mech.c:388:sk_verify_hmac()) checksum mismatch
      [154311.798154] LustreError: 194908:0:(sec_gss.c:242:gss_verify_msg()) mic verify error: 00060000
      [154311.810015] LustreError: 194908:0:(sec_gss.c:2125:gss_svc_verify_request()) failed to verify request: 60000
      

      Attachments

        Activity

          [LU-15047] GSS and multi-rail incompatibility

          When I had the servers coming up with multirail they were even failing with GSS.  So I had moved it to one interface and it came up fine.  Then adding the client with multirail fails with the checksum error but with a single interface works fine. 

          I'm pretty sure the arp settings Whamcloud keeps recommending is wrong.  I went through this with Amir a few months ago but arp_filter and rp_filter should be set to 1 for mutli-rail to function correctly.  In every other case it was intermittent. 

          Is GSS and multirail actually being tested together?  I was somewhat assuming when I filed this that they were only being tested independently.  If I get a chance this week I'll try to dig into it more to get to the bottom of what's happening. 

          jfilizetti Jeremy Filizetti added a comment - When I had the servers coming up with multirail they were even failing with GSS.  So I had moved it to one interface and it came up fine.  Then adding the client with multirail fails with the checksum error but with a single interface works fine.  I'm pretty sure the arp settings Whamcloud keeps recommending is wrong.  I went through this with Amir a few months ago but arp_filter and rp_filter should be set to 1 for mutli-rail to function correctly.  In every other case it was intermittent.  Is GSS and multirail actually being tested together?  I was somewhat assuming when I filed this that they were only being tested independently.  If I get a chance this week I'll try to dig into it more to get to the bottom of what's happening. 

          Sebastien,

          Yes, there were LNet patches that went into 2.14.54 which could be related (LU-14668, LU-14661), but I just didn't think these patches would cause LNet to switch the peer's primary NID somehow. Perhaps ashehata can confirm.

          Not sure if this can affect SSK, but another thing to check for a MR client would be the linux routing setup, to make sure that the intended interface is actually used for sending. For example:

          sysctl -w net.ipv4.conf.all.rp_filter=0
          sysctl -w net.ipv4.conf.all.arp_filter=0
          sysctl -w net.ipv4.conf.ib0.arp_ignore=1
          sysctl -w net.ipv4.conf.ib0.arp_filter=0
          sysctl -w net.ipv4.conf.ib0.arp_announce=2
          sysctl -w net.ipv4.conf.ib0.rp_filter=0 

          If tcp is used, the routes also need to be added. Manual steps are described here: https://wiki.whamcloud.com/display/LNet/MR+Cluster+Setup

          The following patch (still under review) automates adding the routes for tcp interfaces:

          https://review.whamcloud.com/#/c/44065/

           

          ssmirnov Serguei Smirnov added a comment - Sebastien, Yes, there were LNet patches that went into 2.14.54 which could be related ( LU-14668 , LU-14661 ), but I just didn't think these patches would cause LNet to switch the peer's primary NID somehow. Perhaps ashehata  can confirm. Not sure if this can affect SSK, but another thing to check for a MR client would be the linux routing setup, to make sure that the intended interface is actually used for sending. For example: sysctl -w net.ipv4.conf.all.rp_filter=0 sysctl -w net.ipv4.conf.all.arp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_ignore=1 sysctl -w net.ipv4.conf.ib0.arp_filter=0 sysctl -w net.ipv4.conf.ib0.arp_announce=2 sysctl -w net.ipv4.conf.ib0.rp_filter=0 If tcp is used, the routes also need to be added. Manual steps are described here: https://wiki.whamcloud.com/display/LNet/MR+Cluster+Setup The following patch (still under review) automates adding the routes for tcp interfaces: https://review.whamcloud.com/#/c/44065/  

          Also, I thought that in case of multirail, the primary NID was always used as the unique identifier of the connection. ssmirnov are you aware of any change in this area in (recent) master?

          sebastien Sebastien Buisson added a comment - Also, I thought that in case of multirail, the primary NID was always used as the unique identifier of the connection. ssmirnov are you aware of any change in this area in (recent) master?

          Hi,

          gss_svc_verify_request computes the checksum on req->rq_reqbuf, so even if it contains the NID, that should not be a problem for checksum calculation.
          Are the log messages in the description of the ticket seen on client or server side? I can see that you are running Lustre master branch on your client (2.14.54). Is this issue with multirail new? Have you been able to successfully mount a multirail client with older versions of Lustre, and if so, what is the most recent one that made it?
          Also, which SK flavor are you running?

          Cheers,
          Sebastien.

          sebastien Sebastien Buisson added a comment - Hi, gss_svc_verify_request computes the checksum on req->rq_reqbuf , so even if it contains the NID, that should not be a problem for checksum calculation. Are the log messages in the description of the ticket seen on client or server side? I can see that you are running Lustre master branch on your client ( 2.14.54 ). Is this issue with multirail new? Have you been able to successfully mount a multirail client with older versions of Lustre, and if so, what is the most recent one that made it? Also, which SK flavor are you running? Cheers, Sebastien.
          pjones Peter Jones added a comment -

          Sebastien

          What are your thoughts here?

          Peter

          pjones Peter Jones added a comment - Sebastien What are your thoughts here? Peter

          Servers were running 2.12.7 client is tag 2.14.54

          jfilizetti Jeremy Filizetti added a comment - Servers were running 2.12.7 client is tag 2.14.54
          pjones Peter Jones added a comment -

          What version of Lustre are you using here Jeremy?

          pjones Peter Jones added a comment - What version of Lustre are you using here Jeremy?

          People

            sebastien Sebastien Buisson
            jfilizetti Jeremy Filizetti
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: