Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10593

Cannot mount client if SSK is setup over IB network

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      This is an IB specific issue, has been seen on 2.10.1 and master branch. Test with same config on TCP pass.

      This is what I got with IB network when trying to mount client. I have tried with different flavors (skpi, ski, gss null) and all failed with same error. The patch https://review.whamcloud.com/30937
      doesn't help with this issue.

      [104174.396583] LustreError: 48209:0:(gss_keyring.c:1426:gss_kt_update()) negotiation: rpc err -13, gss err 0
      [104174.410110] LustreError: 48209:0:(gss_keyring.c:1426:gss_kt_update()) Skipped 4 previous similar messages
      [104174.423357] Lustre: 48209:0:(sec_gss.c:316:cli_ctx_expire()) ctx ffff880824f1e540(0->lustre-MDT0000_UUID) get expired: 1517010602(+200s)
      [104174.441414] Lustre: 48209:0:(sec_gss.c:316:cli_ctx_expire()) Skipped 4 previous similar messages
      [104199.411600] Lustre: 45094:0:(sec_gss.c:1226:gss_cli_ctx_fini_common()) gss.keyring@ffff8807f2a62400: destroy ctx ffff88105aa0dec0(0->lustre-MDT0000_UUID)
      [104199.432714] Lustre: 45094:0:(sec_gss.c:1226:gss_cli_ctx_fini_common()) Skipped 5 previous similar messages
      [104249.405563] LustreError: 48222:0:(gss_keyring.c:1426:gss_kt_update()) negotiation: rpc err -13, gss err 0
      [104249.418953] LustreError: 48222:0:(gss_keyring.c:1426:gss_kt_update()) Skipped 2 previous similar messages
      [104249.432234] Lustre: 48222:0:(sec_gss.c:316:cli_ctx_expire()) ctx ffff88105aa0c180(0->lustre-MDT0000_UUID) get expired: 1517010677(+200s)
      [104249.450371] Lustre: 48222:0:(sec_gss.c:316:cli_ctx_expire()) Skipped 2 previous similar messages
      [104299.408796] Lustre: 45094:0:(sec_gss.c:1226:gss_cli_ctx_fini_common()) gss.keyring@ffff8807f2a62400: destroy ctx ffff88105aa0cf00(0->lustre-MDT0000_UUID)
      [104299.429934] Lustre: 45094:0:(sec_gss.c:1226:gss_cli_ctx_fini_common()) Skipped 3 previous similar messages
      

      Attachments

        Activity

          [LU-10593] Cannot mount client if SSK is setup over IB network

          Just make sure you when you do an "nslookup <ipoib address>" it resolves to a hostname and things should progress beyond that issue.

          jfilizetti Jeremy Filizetti added a comment - Just make sure you when you do an "nslookup <ipoib address>" it resolves to a hostname and things should progress beyond that issue.

          The log ssk_20190213.logseems got the same hostname resolution issue.

           

          You will need to resolve the reverse lookup for your IB addresses.

          =>What should I  do anything for resolve  IB addresses ?

           

          sebg-crd-pm sebg-crd-pm (Inactive) added a comment - The log ssk_20190213.log seems got the same hostname resolution issue.   You will need to resolve the reverse lookup for your IB addresses. =>What should I  do anything for resolve  IB addresses ?  
          jfilizetti Jeremy Filizetti added a comment - - edited

          The problem Sebastien pointed out with hostname resolution is probably your issue.  You will need to resolve the reverse lookup for your IB addresses.

           

          You can increase debugging on the client with:

           

          [root@r01svr1 ~]# echo 3 > /proc/fs/lustre/sptlrpc/gss/lgss_keyring/debug_level
          

           

          And look in you logs for something similar to:

           

          Feb 12 14:48:45 r01svr1 lgss_keyring: [22410]:INFO:main(): key 441386067, desc 0@e, ugid 0:0, sring 279671070, coinfo 14:sk:0:0:r:n:1:0x500000a0a0a13:SiteA2-MDT0000-mdc-ffff964c12745800:0x500000a0a0601:1
          Feb 12 14:48:45 r01svr1 lgss_keyring: [22410]:DEBUG:parse_callout_info(): parse call out info: secid 14, mech sk, ugid 0:0, is_root 1, is_mdt 0, is_ost 0, svc type n, svc 1, nid 0x500000a0a0a13, tgt SiteA2-MDT0000-mdc-ffff964c12745800, self nid 0x500000a0a0601, pid 1
          Feb 12 14:48:45 r01svr1 lgss_keyring: [22410]:INFO:sk_create_cred(): Creating credentials for target: SiteA2-MDT0000-mdc-ffff964c12745800 with nodemap: (null)
          Feb 12 14:48:45 r01svr1 lgss_keyring: [22410]:INFO:sk_create_cred(): Searching for key with description: lustre:SiteA2
          Feb 12 14:48:45 r01svr1 lgss_keyring: [22411]:ERROR:ipv4_nid2hostname(): O2IBLND: can't resolve 0x130a0a0a
          Feb 12 14:48:45 r01svr1 lgss_keyring: [22411]:ERROR:lgss_get_service_str(): cannot resolve hostname from nid 500000a0a0a13
          Feb 12 14:48:45 r01svr1 lgss_keyring: [22411]:ERROR:lgssc_kr_negotiate_manual(): key 1a4f0453: failed to construct service string

           

           

          jfilizetti Jeremy Filizetti added a comment - - edited The problem Sebastien pointed out with hostname resolution is probably your issue.  You will need to resolve the reverse lookup for your IB addresses.   You can increase debugging on the client with:   [root@r01svr1 ~]# echo 3 > /proc/fs/lustre/sptlrpc/gss/lgss_keyring/debug_level   And look in you logs for something similar to:   Feb 12 14:48:45 r01svr1 lgss_keyring: [22410]:INFO:main(): key 441386067, desc 0@e, ugid 0:0, sring 279671070, coinfo 14:sk:0:0:r:n:1:0x500000a0a0a13:SiteA2-MDT0000-mdc-ffff964c12745800:0x500000a0a0601:1 Feb 12 14:48:45 r01svr1 lgss_keyring: [22410]:DEBUG:parse_callout_info(): parse call out info: secid 14, mech sk, ugid 0:0, is_root 1, is_mdt 0, is_ost 0, svc type n, svc 1, nid 0x500000a0a0a13, tgt SiteA2-MDT0000-mdc-ffff964c12745800, self nid 0x500000a0a0601, pid 1 Feb 12 14:48:45 r01svr1 lgss_keyring: [22410]:INFO:sk_create_cred(): Creating credentials for target: SiteA2-MDT0000-mdc-ffff964c12745800 with nodemap: ( null ) Feb 12 14:48:45 r01svr1 lgss_keyring: [22410]:INFO:sk_create_cred(): Searching for key with description: lustre:SiteA2 Feb 12 14:48:45 r01svr1 lgss_keyring: [22411]:ERROR:ipv4_nid2hostname(): O2IBLND: can't resolve 0x130a0a0a Feb 12 14:48:45 r01svr1 lgss_keyring: [22411]:ERROR:lgss_get_service_str(): cannot resolve hostname from nid 500000a0a0a13 Feb 12 14:48:45 r01svr1 lgss_keyring: [22411]:ERROR:lgssc_kr_negotiate_manual(): key 1a4f0453: failed to construct service string    

          Does SSK function available on Lustre 2.12 or 2.10.6  IB network ?

          I have test Lustre 2.12 or 2.10.6 (IB network) also got the same error.

          [ 877.959783] LustreError: 14203:0:(gss_keyring.c:1423:gss_kt_update()) negotiation: rpc err -13, gss err 0
          [ 877.959878] Lustre: 14203:0:(sec_gss.c:315:cli_ctx_expire()) ctx ffff8b207acc3200(0->testfs-MDT0000_UUID) get expired: 1549873937(+200s)
          [ 877.968204] Lustre: 13621:0:(sec_gss.c:1225:gss_cli_ctx_fini_common()) gss.keyring@ffff8b207c897300: destroy ctx ffff8b2077e8fec0(0->testfs-MDT0000_UUID)
          [ 877.968213] Lustre: 13621:0:(sec_gss.c:1225:gss_cli_ctx_fini_common()) Skipped 1 previous similar message
          [ 978.206863] LustreError: 14303:0:(gss_keyring.c:1423:gss_kt_update()) negotiation: rpc err -13, gss err 0

           

          sebg-crd-pm sebg-crd-pm (Inactive) added a comment - Does SSK function available on Lustre 2.12 or 2.10.6  IB network ? I have test Lustre 2.12 or 2.10.6 (IB network) also got the same error. [ 877.959783] LustreError: 14203:0:(gss_keyring.c:1423:gss_kt_update()) negotiation: rpc err -13, gss err 0 [ 877.959878] Lustre: 14203:0:(sec_gss.c:315:cli_ctx_expire()) ctx ffff8b207acc3200(0->testfs-MDT0000_UUID) get expired: 1549873937(+200s) [ 877.968204] Lustre: 13621:0:(sec_gss.c:1225:gss_cli_ctx_fini_common()) gss.keyring@ffff8b207c897300: destroy ctx ffff8b2077e8fec0(0->testfs-MDT0000_UUID) [ 877.968213] Lustre: 13621:0:(sec_gss.c:1225:gss_cli_ctx_fini_common()) Skipped 1 previous similar message [ 978.206863] LustreError: 14303:0:(gss_keyring.c:1423:gss_kt_update()) negotiation: rpc err -13, gss err 0  

          > Not all interconnects have IP addresses.

          For this kind of interconnect there is already an existing mechanism in Lustre. You can put a script named /etc/lustre/nid2hostname on all your nodes, which takes 3 parameters:

          • $lnd is a string identifying the LND, like "PTL4LND"
          • $netid is the network identifier in hex format, like "0x12"
          • $nid is the NID in hex format
            The script is supposed to output the corresponding hostname, or an error message starting with '@' for error logging.

          Note that at the moment this script is only called in the case of a PTL4LND interconnect type, as QSWLND or GMLND for instance were deprecated by patch https://review.whamcloud.com/23621 some time ago.

          sbuisson Sebastien Buisson (Inactive) added a comment - > Not all interconnects have IP addresses. For this kind of interconnect there is already an existing mechanism in Lustre. You can put a script named /etc/lustre/nid2hostname on all your nodes, which takes 3 parameters: $lnd is a string identifying the LND, like "PTL4LND" $netid is the network identifier in hex format, like "0x12" $nid is the NID in hex format The script is supposed to output the corresponding hostname, or an error message starting with '@' for error logging. Note that at the moment this script is only called in the case of a PTL4LND interconnect type, as QSWLND or GMLND for instance were deprecated by patch https://review.whamcloud.com/23621 some time ago.
          sarah Sarah Liu added a comment -

          Sebastien,

          1. the above logs was with debug enabled.
          2. I used hostname when mount

          on MDS ping client. and vice versa

          [root@onyx-80 ~]# lctl ping onyx-79-ib@o2ib
          This command has been deprecated. Plesae use 'lnetctl ping'
          12345-0@lo
          12345-192.168.1.79@o2ib
          12345-10.2.2.51@tcp
          [root@onyx-80 ~]# 
          
          [root@onyx-79 ~]# lctl ping onyx-80-ib@o2ib
          This command has been deprecated. Plesae use 'lnetctl ping'
          12345-0@lo
          12345-192.168.1.80@o2ib
          12345-10.2.2.52@tcp
          [root@onyx-79 ~]
          
          sarah Sarah Liu added a comment - Sebastien, 1. the above logs was with debug enabled. 2. I used hostname when mount on MDS ping client. and vice versa [root@onyx-80 ~]# lctl ping onyx-79-ib@o2ib This command has been deprecated. Plesae use 'lnetctl ping' 12345-0@lo 12345-192.168.1.79@o2ib 12345-10.2.2.51@tcp [root@onyx-80 ~]# [root@onyx-79 ~]# lctl ping onyx-80-ib@o2ib This command has been deprecated. Plesae use 'lnetctl ping' 12345-0@lo 12345-192.168.1.80@o2ib 12345-10.2.2.52@tcp [root@onyx-79 ~]

          James,

          I would tend to consider what you suggest as an enhancement or feature request. Even in the case of using the IB hardware address instead of an IP address as a lookup in the lnet layer, Kerberos credentials must be associated to nodes. So probably that would mean adapting the way name resolution is carried out today, once this change for IB hardware address is done...

          Sebastien.

          sbuisson Sebastien Buisson (Inactive) added a comment - James, I would tend to consider what you suggest as an enhancement or feature request. Even in the case of using the IB hardware address instead of an IP address as a lookup in the lnet layer, Kerberos credentials must be associated to nodes. So probably that would mean adapting the way name resolution is carried out today, once this change for IB hardware address is done... Sebastien.

          No I would consider this a real bug. Not all interconnects have IP address. Consider the Cray Gemini interconnects. Also discussion is under way about using the IB hardware address instead of an IP address as a lookup in the lnet layer. That change will then make IB totally unusable. We do need a proper solution.

          simmonsja James A Simmons added a comment - No I would consider this a real bug. Not all interconnects have IP address. Consider the Cray Gemini interconnects. Also discussion is under way about using the IB hardware address instead of an IP address as a lookup in the lnet layer. That change will then make IB totally unusable. We do need a proper solution.

          True Jeremy

          Hopefully I managed to reproduce the issue on a test system at DDN (negotiation: rpc err -13, gss err 0). I figured out how to make it work with Kerberos, but I guess it would be the same with Shared Key.

          The issue stems from the fact that context negotiation process needs to perform name resolution at some point:

          lgssc_kr_negotiate_{krb,manual}
             lgss_get_service_str
                ipv4_nid2hostname
                   lnet_nid2hostname
          

          So in order to make it work, your Lustre nodes' NIDs on IB must have an associated hostname that can be resolved by each other node. With Kerberos, the credentials must be created for these IB-based hostnames.

          It explains why it works out of the box when using a TCP based interconnect network. NIDs naturally match nodes' hostnames

          Do you agree to close this ticket with 'configuration issue' as the reason?

          Cheers,
          Sebastien.

          sbuisson Sebastien Buisson (Inactive) added a comment - - edited True Jeremy Hopefully I managed to reproduce the issue on a test system at DDN (negotiation: rpc err -13, gss err 0). I figured out how to make it work with Kerberos, but I guess it would be the same with Shared Key. The issue stems from the fact that context negotiation process needs to perform name resolution at some point: lgssc_kr_negotiate_{krb,manual} lgss_get_service_str ipv4_nid2hostname lnet_nid2hostname So in order to make it work, your Lustre nodes' NIDs on IB must have an associated hostname that can be resolved by each other node. With Kerberos, the credentials must be created for these IB-based hostnames. It explains why it works out of the box when using a TCP based interconnect network. NIDs naturally match nodes' hostnames Do you agree to close this ticket with 'configuration issue' as the reason? Cheers, Sebastien.

          Did you add the debugging the Sebastien asked? Without that information from the client it will be hard to see what is wrong since it appears to be during the request-key portion.

          I have ran into an issue where the server side lsvcgss loses access to the key for some reason. This may require me to rework the key handling so that keys remain associated with lustre processes. However, this returns a GSS error to the client not an RPC error so it's a different issue.

          jfilizetti Jeremy Filizetti added a comment - Did you add the debugging the Sebastien asked? Without that information from the client it will be hard to see what is wrong since it appears to be during the request-key portion. I have ran into an issue where the server side lsvcgss loses access to the key for some reason. This may require me to rework the key handling so that keys remain associated with lustre processes. However, this returns a GSS error to the client not an RPC error so it's a different issue.

          People

            simmonsja James A Simmons
            sarah Sarah Liu
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: