Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9823

LNet fails to come up when using lctl but works with lnetctl

Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • None
    • Lustre 2.10.1, Lustre 2.11.0
    • Seen on various systems with Lustre 2.10 and Lustre 2.11.
    • 3
    • 9223372036854775807

    Description

      On several systems when attempting to bring a lustre system this is reported:

      [188273.054578] LNet: Added LNI 10.0.1.22@tcp [8/256/0/180]
      [188273.054724] LNet: Accept secure, port 988
      [191295.504584] Lustre: Lustre: Build Version: 2.10.0_dirty
      [191300.858629] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789735/real 1501789735]  req@ffff800fb09cfc80 x1574740673167376/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789740 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [191301.858634] LustreError: 22036:0:(mgc_request.c:251:do_config_log_add()) MGC128.219.141.4@tcp: failed processing log, type 1: rc = -5
      [191330.858099] Lustre: 22140:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1501789760/real 1501789760]  req@ffff800fb4910980 x1574740673167424/t0(0) o250->MGC128.219.141.4@tcp@128.219.141.4@tcp:26/25 lens 520/544 e 0 to 1 dl 1501789770 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [191332.858106] LustreError: 15c-8: MGC128.219.141.4@tcp: The configuration from log 'legs-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      [191332.858399] Lustre: Unmounted legs-client
      [191332.859241] LustreError: 22036:0:(obd_mount.c:1505:lustre_fill_super()) Unable to mount  (-5)
      
      

      After investigation this is a symptom of the LNet layer communication failure. This occurs when LNet has been setup with lctl but if one uses lnetctl then this issue appears to go away.

      Attachments

        Issue Links

          Activity

            [LU-9823] LNet fails to come up when using lctl but works with lnetctl

            Okay I git bisect to see when this failure started to happen and its due to the multirail support landing. Currently people moving to lustre 2.10 might find they can't mount lustre at all when deploying a production system.

            simmonsja James A Simmons added a comment - Okay I git bisect to see when this failure started to happen and its due to the multirail support landing. Currently people moving to lustre 2.10 might find they can't mount lustre at all when deploying a production system.

            So on the MDS I see the following lctl dump:

            00000100:00080000:5.0:1508798556.578627:0:4131:0:(pinger.c:405:ptlrpc_pinger_add_import()) adding pingable import 19afc095-abef-a794-2f84-9099c3e67329->MGS
            00000020:01000004:5.0:1508798556.578635:0:4131:0:(obd_mount_server.c:1303:server_start_targets()) starting target sultan-MDT0000
            00000020:01000004:5.0:1508798556.578694:0:4131:0:(obd_mount.c:193:lustre_start_simple()) Starting obd MDS (typ=mds)
            00000020:00000080:5.0:1508798556.578696:0:4131:0:(obd_config.c:1144:class_process_config()) processing cmd: cf001
            00000020:00000080:5.0:1508798556.623854:0:4131:0:(genops.c:414:class_newdev()) Allocate new device MDS (ffff8817d14c8000)
            00000020:00000080:5.0:1508798556.623939:0:4131:0:(obd_config.c:431:class_attach()) OBD: dev 2 attached type mds with refcount 1
            00000020:00000080:5.0:1508798556.623945:0:4131:0:(obd_config.c:1144:class_process_config()) processing cmd: cf003
            00000020:00000080:7.0:1508798556.670941:0:4131:0:(obd_config.c:542:class_setup()) finished setup of obd MDS (uuid MDS_uuid)
            00000020:01000004:7.0:1508798556.670956:0:4131:0:(obd_mount_server.c:294:server_mgc_set_fs()) Set mgc disk for /dev/sda
            00000040:01000000:7.0:1508798556.673106:0:4131:0:(llog_obd.c:210:llog_setup()) obd MGC10.37.248.67@o2ib1 ctxt 0 is initialized
            00000020:01000004:7.0:1508798556.673119:0:4131:0:(obd_mount_server.c:1208:server_register_target()) Registration sultan-MDT0000, fs=sultan, 10.37.248.155@o2ib1
            , index=0000, flags=0x1
            10000000:01000000:7.0:1508798556.673122:0:4131:0:(mgc_request.c:1253:mgc_set_info_async()) register_target sultan-MDT0000 0x10000001
            10000000:01000000:7.0:1508798556.673144:0:4131:0:(mgc_request.c:1203:mgc_target_register()) register sultan-MDT0000
            00000100:00080000:7.0:1508798556.673152:0:4131:0:(client.c:1562:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING) req@ffff8817cc750000
            x1582089946267680/t0(0) o253->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
            00000100:00000400:3.0F:1508798561.578402:0:3989:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1508798556/r
            eal 0] req@ffff8817d1b40000 x1582089946267664/t0(0) o250->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 520/544 e 0 to 1 dl 1508798561 ref 2 fl Rpc:XN/0
            /ffffffff rc 0/-1
            00000100:00080000:7.0:1508798567.672382:0:4131:0:(client.c:1170:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8817cc750000 x1582089946267680/t0(0
            ) o253->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
            00000020:00080000:7.0:1508798567.672401:0:4131:0:(obd_mount_server.c:1233:server_register_target()) sultan-MDT0000: error registering with the MGS: rc = -110 (
            not fatal)
            00000020:01000004:7.0:1508798567.672406:0:4131:0:(obd_mount_server.c:117:server_register_mount()) register mount ffff8817df73f800 from sultan-MDT0000
            10000000:01000000:7.0:1508798567.672412:0:4131:0:(mgc_request.c:2197:mgc_process_config()) parse_log sultan-MDT0000 from 0
            10000000:01000000:7.0:1508798567.672413:0:4131:0:(mgc_request.c:331:config_log_add()) adding config log sultan-MDT0000: (null)
            10000000:01000000:7.0:1508798567.672416:0:4131:0:(mgc_request.c:211:do_config_log_add()) do adding config log sultan-sptlrpc: (null)
            10000000:01000000:7.0:1508798567.672419:0:4131:0:(mgc_request.c:90:mgc_name2resid()) log sultan-sptlrpc to resid 0x6e61746c7573/0x0 (sultan)
            10000000:01000000:7.0:1508798567.672425:0:4131:0:(mgc_request.c:2062:mgc_process_log()) Process log sultan-sptlrpc: (null) from 1
            10000000:01000000:7.0:1508798567.672427:0:4131:0:(mgc_request.c:1130:mgc_enqueue()) Enqueue for sultan-sptlrpc (res 0x6e61746c7573)
            00000100:00080000:7.0:1508798567.672459:0:4131:0:(client.c:1562:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING) req@ffff8817cc750000
            x1582089946267696/t0(0) o101->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
            00000800:00000400:6.0:1508798569.567382:0:3842:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.37.248.67@o2ib1: 4295778 seconds
            00000100:00080000:1.0:1508798569.578255:0:3989:0:(import.c:1289:ptlrpc_connect_interpret()) ffff8817e66a2800 MGS: changing import state from CONNECTING to DISC
            ONN
            00000100:00080000:1.0:1508798569.578260:0:3989:0:(import.c:1336:ptlrpc_connect_interpret()) recovery of MGS on MGC10.37.248.67@o2ib1_0 failed (-110)

            simmonsja James A Simmons added a comment - So on the MDS I see the following lctl dump: 00000100:00080000:5.0:1508798556.578627:0:4131:0:(pinger.c:405:ptlrpc_pinger_add_import()) adding pingable import 19afc095-abef-a794-2f84-9099c3e67329->MGS 00000020:01000004:5.0:1508798556.578635:0:4131:0:(obd_mount_server.c:1303:server_start_targets()) starting target sultan-MDT0000 00000020:01000004:5.0:1508798556.578694:0:4131:0:(obd_mount.c:193:lustre_start_simple()) Starting obd MDS (typ=mds) 00000020:00000080:5.0:1508798556.578696:0:4131:0:(obd_config.c:1144:class_process_config()) processing cmd: cf001 00000020:00000080:5.0:1508798556.623854:0:4131:0:(genops.c:414:class_newdev()) Allocate new device MDS (ffff8817d14c8000) 00000020:00000080:5.0:1508798556.623939:0:4131:0:(obd_config.c:431:class_attach()) OBD: dev 2 attached type mds with refcount 1 00000020:00000080:5.0:1508798556.623945:0:4131:0:(obd_config.c:1144:class_process_config()) processing cmd: cf003 00000020:00000080:7.0:1508798556.670941:0:4131:0:(obd_config.c:542:class_setup()) finished setup of obd MDS (uuid MDS_uuid) 00000020:01000004:7.0:1508798556.670956:0:4131:0:(obd_mount_server.c:294:server_mgc_set_fs()) Set mgc disk for /dev/sda 00000040:01000000:7.0:1508798556.673106:0:4131:0:(llog_obd.c:210:llog_setup()) obd MGC10.37.248.67@o2ib1 ctxt 0 is initialized 00000020:01000004:7.0:1508798556.673119:0:4131:0:(obd_mount_server.c:1208:server_register_target()) Registration sultan-MDT0000, fs=sultan, 10.37.248.155@o2ib1 , index=0000, flags=0x1 10000000:01000000:7.0:1508798556.673122:0:4131:0:(mgc_request.c:1253:mgc_set_info_async()) register_target sultan-MDT0000 0x10000001 10000000:01000000:7.0:1508798556.673144:0:4131:0:(mgc_request.c:1203:mgc_target_register()) register sultan-MDT0000 00000100:00080000:7.0:1508798556.673152:0:4131:0:(client.c:1562:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING) req@ffff8817cc750000 x1582089946267680/t0(0) o253->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000100:00000400:3.0F:1508798561.578402:0:3989:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1508798556/r eal 0] req@ffff8817d1b40000 x1582089946267664/t0(0) o250->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 520/544 e 0 to 1 dl 1508798561 ref 2 fl Rpc:XN/0 /ffffffff rc 0/-1 00000100:00080000:7.0:1508798567.672382:0:4131:0:(client.c:1170:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8817cc750000 x1582089946267680/t0(0 ) o253->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000020:00080000:7.0:1508798567.672401:0:4131:0:(obd_mount_server.c:1233:server_register_target()) sultan-MDT0000: error registering with the MGS: rc = -110 ( not fatal) 00000020:01000004:7.0:1508798567.672406:0:4131:0:(obd_mount_server.c:117:server_register_mount()) register mount ffff8817df73f800 from sultan-MDT0000 10000000:01000000:7.0:1508798567.672412:0:4131:0:(mgc_request.c:2197:mgc_process_config()) parse_log sultan-MDT0000 from 0 10000000:01000000:7.0:1508798567.672413:0:4131:0:(mgc_request.c:331:config_log_add()) adding config log sultan-MDT0000: (null) 10000000:01000000:7.0:1508798567.672416:0:4131:0:(mgc_request.c:211:do_config_log_add()) do adding config log sultan-sptlrpc: (null) 10000000:01000000:7.0:1508798567.672419:0:4131:0:(mgc_request.c:90:mgc_name2resid()) log sultan-sptlrpc to resid 0x6e61746c7573/0x0 (sultan) 10000000:01000000:7.0:1508798567.672425:0:4131:0:(mgc_request.c:2062:mgc_process_log()) Process log sultan-sptlrpc: (null) from 1 10000000:01000000:7.0:1508798567.672427:0:4131:0:(mgc_request.c:1130:mgc_enqueue()) Enqueue for sultan-sptlrpc (res 0x6e61746c7573) 00000100:00080000:7.0:1508798567.672459:0:4131:0:(client.c:1562:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING) req@ffff8817cc750000 x1582089946267696/t0(0) o101->MGC10.37.248.67@o2ib1@10.37.248.67@o2ib1:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000800:00000400:6.0:1508798569.567382:0:3842:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 10.37.248.67@o2ib1: 4295778 seconds 00000100:00080000:1.0:1508798569.578255:0:3989:0:(import.c:1289:ptlrpc_connect_interpret()) ffff8817e66a2800 MGS: changing import state from CONNECTING to DISC ONN 00000100:00080000:1.0:1508798569.578260:0:3989:0:(import.c:1336:ptlrpc_connect_interpret()) recovery of MGS on MGC10.37.248.67@o2ib1_0 failed (-110)

            The problem is far worst with the latest master. It takes about 15 minutes to mount any back end disk. Once it does mount after many hours with a 56 OST/16 MDT system the client fails to mount.

            simmonsja James A Simmons added a comment - The problem is far worst with the latest master. It takes about 15 minutes to mount any back end disk. Once it does mount after many hours with a 56 OST/16 MDT system the client fails to mount.

            I just attempted to bring up our regular testing file system on normal RHEL7 x86 with the latest lustre 2.10.1 and I'm seeing this error. Will try lustre 2.10.54 next.

            simmonsja James A Simmons added a comment - I just attempted to bring up our regular testing file system on normal RHEL7 x86 with the latest lustre 2.10.1 and I'm seeing this error. Will try lustre 2.10.54 next.

            Some one reported this problem also on Power8. I gathered a debug log from the client side. I looked on the server side and I saw this bug which is this new to me.

            [606005.245327] LNetError: 2606:0:(acceptor.c:406:lnet_acceptor()) Refusing connection from 128.219.141.3: insecure port 55084
            [606023.489869] LNetError: 2606:0:(acceptor.c:406:lnet_acceptor()) Refusing connection from 128.219.141.3: insecure port 37668

            simmonsja James A Simmons added a comment - Some one reported this problem also on Power8. I gathered a debug log from the client side. I looked on the server side and I saw this bug which is this new to me. [606005.245327] LNetError: 2606:0:(acceptor.c:406:lnet_acceptor()) Refusing connection from 128.219.141.3: insecure port 55084 [606023.489869] LNetError: 2606:0:(acceptor.c:406:lnet_acceptor()) Refusing connection from 128.219.141.3: insecure port 37668

            cat /etc/SuSE-release
            SUSE Linux Enterprise Server 12 (aarch64)
            VERSION = 12
            PATCHLEVEL = 2

            1. This file is deprecated and will be removed in a future service pack or release.
            2. Please check /etc/os-release for details about this release.

            >uname -r
            4.4.59-92.20-default

            simmonsja James A Simmons added a comment - cat /etc/SuSE-release SUSE Linux Enterprise Server 12 (aarch64) VERSION = 12 PATCHLEVEL = 2 This file is deprecated and will be removed in a future service pack or release. Please check /etc/os-release for details about this release. >uname -r 4.4.59-92.20-default
            pjones Peter Jones added a comment -

            James what version of SLES12 do you mean?

            pjones Peter Jones added a comment - James what version of SLES12 do you mean?

            People

              ashehata Amir Shehata (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: