Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9823

LNet fails to come up when using lctl but works with lnetctl

Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • None
    • Lustre 2.10.1, Lustre 2.11.0
    • Seen on various systems with Lustre 2.10 and Lustre 2.11.
    • 3
    • 9223372036854775807

    Description

      On several systems when attempting to bring a lustre system this is reported:

      [188273.054578] LNet: Added LNI 10.0.1.22@tcp [8/256/0/180]
      [188273.054724] LNet: Accept secure, port 988
      [191295.504584] Lustre: Lustre: Build Version: 2.10.0_dirty
      [191300.858629] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789735/real 1501789735]  req@ffff800fb09cfc80 x1574740673167376/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789740 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [191301.858634] LustreError: 22036:0:(mgc_request.c:251:do_config_log_add()) MGC128.219.141.4@tcp: failed processing log, type 1: rc = -5
      [191330.858099] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789760/real 1501789760]  req@ffff800fb4910980 x1574740673167424/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789770 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [191332.858106] LustreError: 15c-8: MGC128.219.141.4@tcp: The configuration from log 'legs-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      [191332.858399] Lustre: Unmounted legs-client
      [191332.859241] LustreError: 22036:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-5)
      
      

      After investigation this is a symptom of the LNet layer communication failure. This occurs when LNet has been setup with lctl but if one uses lnetctl then this issue appears to go away.

      Attachments

        Issue Links

          Activity

            [LU-9823] LNet fails to come up when using lctl but works with lnetctl
            adilger Andreas Dilger made changes -
            Labels New: IPv6
            adilger Andreas Dilger made changes -
            Resolution New: Not a Bug [ 6 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Ended up being a configuration issue.

            adilger Andreas Dilger added a comment - Ended up being a configuration issue.
            simmonsja James A Simmons made changes -
            Link New: This issue is related to LU-10391 [ LU-10391 ]

            Lustre doesn't support IPv6, though it is definitely something that we should keep in mind moving forward (LU-10391).

            adilger Andreas Dilger added a comment - Lustre doesn't support IPv6, though it is definitely something that we should keep in mind moving forward ( LU-10391 ).
            simmonsja James A Simmons made changes -
            Description Original: While attempting to bring up my ARM test bed with a Lustre 2.10 client I ran into this error which I have never seen before.
            {noformat}
            [188273.054578] LNet: Added LNI 10.0.1.22@tcp [8/256/0/180]
            [188273.054724] LNet: Accept secure, port 988
            [191295.504584] Lustre: Lustre: Build Version: 2.10.0_dirty
            [191300.858629] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789735/real 1501789735] req@ffff800fb09cfc80 x1574740673167376/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789740 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [191301.858634] LustreError: 22036:0:(mgc_request.c:251:do_config_log_add()) MGC128.219.141.4@tcp: failed processing log, type 1: rc = -5
            [191330.858099] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789760/real 1501789760] req@ffff800fb4910980 x1574740673167424/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789770 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [191332.858106] LustreError: 15c-8: MGC128.219.141.4@tcp: The configuration from log 'legs-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            [191332.858399] Lustre: Unmounted legs-client
            [191332.859241] LustreError: 22036:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount (-5)
            {noformat}
            Has anyone run into this before?
            New: On several systems when attempting to bring a lustre system this is reported:
            {noformat}
            [188273.054578] LNet: Added LNI 10.0.1.22@tcp [8/256/0/180]
            [188273.054724] LNet: Accept secure, port 988
            [191295.504584] Lustre: Lustre: Build Version: 2.10.0_dirty
            [191300.858629] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789735/real 1501789735] req@ffff800fb09cfc80 x1574740673167376/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789740 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [191301.858634] LustreError: 22036:0:(mgc_request.c:251:do_config_log_add()) MGC128.219.141.4@tcp: failed processing log, type 1: rc = -5
            [191330.858099] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789760/real 1501789760] req@ffff800fb4910980 x1574740673167424/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789770 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [191332.858106] LustreError: 15c-8: MGC128.219.141.4@tcp: The configuration from log 'legs-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            [191332.858399] Lustre: Unmounted legs-client
            [191332.859241] LustreError: 22036:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount (-5)

            {noformat}
            After investigation this is a symptom of the LNet layer communication failure. This occurs when LNet has been setup with lctl but if one uses lnetctl then this issue appears to go away.
            Environment Original: ARM64 running SLES12 with latest lustre 2.10 New: Seen on various systems with Lustre 2.10 and Lustre 2.11.
            Summary Original: mgc sptlrpc log process error New: LNet fails to come up when using lctl but works with lnetctl
            adilger Andreas Dilger made changes -
            Description Original: While attempting to bring up my ARM test bed with a Lustre 2.10 client I ran into this error which I have never seen before.

            [188273.054578] LNet: Added LNI 10.0.1.22@tcp [8/256/0/180]
            [188273.054724] LNet: Accept secure, port 988
            [191295.504584] Lustre: Lustre: Build Version: 2.10.0_dirty
            [191300.858629] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789735/real 1501789735] req@ffff800fb09cfc80 x1574740673167376/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789740 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [191301.858634] LustreError: 22036:0:(mgc_request.c:251:do_config_log_add()) MGC128.219.141.4@tcp: failed processing log, type 1: rc = -5
            [191330.858099] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789760/real 1501789760] req@ffff800fb4910980 x1574740673167424/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789770 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [191332.858106] LustreError: 15c-8: MGC128.219.141.4@tcp: The configuration from log 'legs-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            [191332.858399] Lustre: Unmounted legs-client
            [191332.859241] LustreError: 22036:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount (-5)

            Has anyone run into this before?
            New: While attempting to bring up my ARM test bed with a Lustre 2.10 client I ran into this error which I have never seen before.
            {noformat}
            [188273.054578] LNet: Added LNI 10.0.1.22@tcp [8/256/0/180]
            [188273.054724] LNet: Accept secure, port 988
            [191295.504584] Lustre: Lustre: Build Version: 2.10.0_dirty
            [191300.858629] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789735/real 1501789735] req@ffff800fb09cfc80 x1574740673167376/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789740 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [191301.858634] LustreError: 22036:0:(mgc_request.c:251:do_config_log_add()) MGC128.219.141.4@tcp: failed processing log, type 1: rc = -5
            [191330.858099] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789760/real 1501789760] req@ffff800fb4910980 x1574740673167424/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789770 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [191332.858106] LustreError: 15c-8: MGC128.219.141.4@tcp: The configuration from log 'legs-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            [191332.858399] Lustre: Unmounted legs-client
            [191332.859241] LustreError: 22036:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount (-5)
            {noformat}
            Has anyone run into this before?

            I think I see why we have a problem. The network interface has both an ipv4 and ipv6 address. How you every tried this setup?

            simmonsja James A Simmons added a comment - I think I see why we have a problem. The network interface has both an ipv4 and ipv6 address. How you every tried this setup?

            For our sultan OSS nodes it was a configuration issue. We placed the two other IB ports on a different subnet and that seems to have worked. As for the ARM system it does have multiple ethernet interfaces for the computes but only one has been setup with an IP address.

            ip addr show
            1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
            link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
            inet 127.0.0.1/8 scope host lo
            valid_lft forever preferred_lft forever
            inet6 ::1/128 scope host
            valid_lft forever preferred_lft forever
            2: enP2p1s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
            link/ether 00:22:4d:c8:10:9f brd ff:ff:ff:ff:ff:ff
            3: enP2p1s0f2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
            link/ether 00:22:4d:c8:10:a0 brd ff:ff:ff:ff:ff:ff
            inet 10.0.1.22/24 brd 10.0.1.255 scope global enP2p1s0f2
            valid_lft forever preferred_lft forever
            inet6 fe80::222:4dff:fec8:10a0/64 scope link
            valid_lft forever preferred_lft forever
            4: enP6p1s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
            link/ether 00:22:4d:c8:10:a1 brd ff:ff:ff:ff:ff:ff

            simmonsja James A Simmons added a comment - For our sultan OSS nodes it was a configuration issue. We placed the two other IB ports on a different subnet and that seems to have worked. As for the ARM system it does have multiple ethernet interfaces for the computes but only one has been setup with an IP address. ip addr show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: enP2p1s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 00:22:4d:c8:10:9f brd ff:ff:ff:ff:ff:ff 3: enP2p1s0f2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:22:4d:c8:10:a0 brd ff:ff:ff:ff:ff:ff inet 10.0.1.22/24 brd 10.0.1.255 scope global enP2p1s0f2 valid_lft forever preferred_lft forever inet6 fe80::222:4dff:fec8:10a0/64 scope link valid_lft forever preferred_lft forever 4: enP6p1s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 00:22:4d:c8:10:a1 brd ff:ff:ff:ff:ff:ff

            Do any of the nodes have multiple interfaces? If so, can you please make sure you follow this general linux routing guideline:

            https://wiki.hpdd.intel.com/display/LNet/MR+Cluster+Setup

            ashehata Amir Shehata (Inactive) added a comment - Do any of the nodes have multiple interfaces? If so, can you please make sure you follow this general linux routing guideline: https://wiki.hpdd.intel.com/display/LNet/MR+Cluster+Setup

            People

              ashehata Amir Shehata (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: