Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6 (3.10.0-957.5.1.el7.x86_64), Lustre 2.12.0
    • 3
    • 9223372036854775807

    Description

      On clients, we're using lnet.service with the following config:

      [root@sh-112-12 ~]# cat /etc/lnet.conf 
      net:
          - net type: o2ib4
            local NI(s):
              - nid:
                interfaces:
                    0: ib0
      route: 
          - net: o2ib1
            gateway: 10.9.0.[31-32]@o2ib4
          - net: o2ib5
            gateway: 10.9.0.[41-42]@o2ib4
          - net: o2ib7
            gateway: 10.9.0.[21-24]@o2ib4
      [root@sh-112-12 ~]# lctl list_nids
      10.10.112.12@tcp
      10.9.112.12@o2ib4
      
      [root@sh-112-12 ~]# dmesg | grep -i lnet
      [  397.762804] LNet: HW NUMA nodes: 2, HW CPU cores: 20, npartitions: 2
      [  398.995449] LNet: 13837:0:(socklnd.c:2655:ksocknal_enumerate_interfaces()) Ignoring interface enp4s0f1 (down)
      [  399.005708] LNet: Added LNI 10.10.112.12@tcp [8/256/0/180]
      [  399.011316] LNet: Accept secure, port 988
      [  399.060725] LNet: Using FastReg for registration
      [  399.075936] LNet: Added LNI 10.9.112.12@o2ib4 [8/256/0/180]
      

      It is unclear why it does that at this point.

       

      client network config:

      [root@sh-112-12 ~]# ip addr
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
             valid_lft forever preferred_lft forever
          inet6 ::1/128 scope host 
             valid_lft forever preferred_lft forever
      2: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
          link/ether 0c:c4:7a:dc:96:ae brd ff:ff:ff:ff:ff:ff
          inet 10.10.112.12/16 brd 10.10.255.255 scope global enp4s0f0
             valid_lft forever preferred_lft forever
          inet6 fe80::ec4:7aff:fedc:96ae/64 scope link 
             valid_lft forever preferred_lft forever
      3: enp4s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 0c:c4:7a:dc:96:af brd ff:ff:ff:ff:ff:ff
      4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
          link/infiniband 20:00:10:8b:fe:80:00:00:00:00:00:00:24:8a:07:03:00:a0:9e:20 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
          inet 10.9.112.12/16 brd 10.9.255.255 scope global ib0
             valid_lft forever preferred_lft forever
          inet6 fe80::268a:703:a0:9e20/64 scope link 
             valid_lft forever preferred_lft forever
      

       

      lnet.service origin:

      [root@sh-112-12 ~]# rpm -qf /usr/lib/systemd/system/lnet.service 
      lustre-client-2.12.0-1.el7.x86_64
      [root@sh-112-12 ~]# rpm -q --info lustre-client
      Name        : lustre-client
      Version     : 2.12.0
      Release     : 1.el7
      Architecture: x86_64
      Install Date: Wed 06 Feb 2019 10:13:52 AM PST
      Group       : System Environment/Kernel
      Size        : 2007381
      License     : GPL
      Signature   : (none)
      Source RPM  : lustre-client-2.12.0-1.el7.src.rpm
      Build Date  : Fri 21 Dec 2018 01:53:18 PM PST
      Build Host  : trevis-307-el7-x8664-3.trevis.whamcloud.com
      Relocations : (not relocatable)
      URL         : https://wiki.whamcloud.com/
      Summary     : Lustre File System
      Description :
      Userspace tools and files for the Lustre file system.
      [root@sh-112-12 ~]# cat /usr/lib/systemd/system/lnet.service 
      [Unit]
      Description=lnet management
      
      Requires=network-online.target
      After=network-online.target openibd.service rdma.service
      
      ConditionPathExists=!/proc/sys/lnet/
      
      [Service]
      Type=oneshot
      RemainAfterExit=true
      ExecStart=/sbin/modprobe lnet
      ExecStart=/usr/sbin/lnetctl lnet configure
      ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf
      ExecStop=/usr/sbin/lustre_rmmod ptlrpc
      ExecStop=/usr/sbin/lnetctl lnet unconfigure
      ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs
      
      [Install]
      WantedBy=multi-user.target
      

      This leads to many issues server-side with 2.12, as reported in LU-11888 and LU-11936.

      Thanks!
      Stephane

      Attachments

        Activity

          [LU-11937] lnet.service randomly load tcp NIDs
          pjones Peter Jones added a comment -

          ok - thanks

          pjones Peter Jones added a comment - ok - thanks
          sthiell Stephane Thiell added a comment - - edited

          Yes, good news! We appreciated the help, thanks! Sorry it took us so much time to figure that out. We've also added some "rogue NID monitoring" on the server side, just in case some clients continue to be misconfigured. We prefer to let the default lnet discovery enabled for now, but it's good to know that we do have the option to disable it we want to. I'm ok to consider this ticket resolved at this point.

          sthiell Stephane Thiell added a comment - - edited Yes, good news! We appreciated the help, thanks! Sorry it took us so much time to figure that out. We've also added some "rogue NID monitoring" on the server side, just in case some clients continue to be misconfigured. We prefer to let the default lnet discovery enabled for now, but it's good to know that we do have the option to disable it we want to. I'm ok to consider this ticket resolved at this point.
          pjones Peter Jones added a comment -

          Good news - so can we consider this ticket resolved?

          pjones Peter Jones added a comment - Good news - so can we consider this ticket resolved?

          Hi Amir,

          Thanks for your help! This was useful. I confirm that lnetctl lnet configure does not configure any networks, my bad!

          I guess we've just figured out what was wrong in our setup and you were very close: a service to mount our Lustre filesystems was doing a modprobe lustre and in some (rare) cases, the filesystem mount that followed was done at the same time as lnet.service, leading to a tcp NID to be propagated to the servers and causing the trouble I described on the server-side. We have fixed the dependencies of boot time services on our clients so this hopefully should not happen anymore!

          modprobe lustre seems to do the same as lnet net up and does configure a default tcp network if lnet is not configured yet.

          Sorry for the noise, after all, it looks like this was never a problem of lnet.service.

          sthiell Stephane Thiell added a comment - Hi Amir, Thanks for your help! This was useful. I confirm that lnetctl lnet configure does not configure any networks, my bad! I guess we've just figured out what was wrong in our setup and you were very close: a service to mount our Lustre filesystems was doing a modprobe lustre and in some (rare) cases, the filesystem mount that followed was done at the same time as lnet.service , leading to a tcp NID to be propagated to the servers and causing the trouble I described on the server-side. We have fixed the dependencies of boot time services on our clients so this hopefully should not happen anymore! modprobe lustre seems to do the same as lnet net up and does configure a default tcp network if lnet is not configured yet. Sorry for the noise, after all, it looks like this was never a problem of lnet.service .

          Hi Stephane,

          lnetctl lnet configure should not configure any networks. The default tcp would get configured if somewhere you're doing lctl net up. That would load the default tcp network.

          To disable discovery you can add

          options lnet  lnet_peer_discovery_disabled=1

          on all the nodes.

          My hunch at the moment is that there are some nodes which are using lctl net up or lnetctl lnet configure --all. This would lead to the tcp network being loaded, especially if you don't have an "options network" in your "modprobe.d/lnet.conf" file

          Would you be able to check that?

          ashehata Amir Shehata (Inactive) added a comment - Hi Stephane, lnetctl lnet configure should not configure any networks. The default tcp would get configured if somewhere you're doing lctl net up . That would load the default tcp network. To disable discovery you can add options lnet lnet_peer_discovery_disabled=1 on all the nodes. My hunch at the moment is that there are some nodes which are using  lctl net up or  lnetctl lnet configure --all . This would lead to the tcp network being loaded, especially if you don't have an " options network" in your  "modprobe.d/lnet.conf" file Would you be able to check that?

          Amir, our lnet.conf on the clients is as follow:

          [root@sh-101-01 ~]# cat /etc/lnet.conf 
          global:
              - retry_count: 0
              - health_sensitivity: 0
              - transaction_timeout: 10
          net:
              - net type: o2ib4
                local NI(s):
                  - nid:
                    interfaces:
                        0: ib0
          route: 
              - net: o2ib5
                gateway: 10.9.0.[41-42]@o2ib4
              - net: o2ib7
                gateway: 10.9.0.[21-24]@o2ib4
          

          But when lnet is loaded, it does a lnet configure before the import of that file, which I think might propagate a tcp NID in some rare case.

          How can we be sure to disable discovery everywhere without any race condition? How do you do that? We really don't use multi-rail at all in our case. Thanks!!

          sthiell Stephane Thiell added a comment - Amir, our lnet.conf on the clients is as follow: [root@sh-101-01 ~]# cat /etc/lnet.conf global: - retry_count: 0 - health_sensitivity: 0 - transaction_timeout: 10 net: - net type: o2ib4 local NI(s): - nid: interfaces: 0: ib0 route: - net: o2ib5 gateway: 10.9.0.[41-42]@o2ib4 - net: o2ib7 gateway: 10.9.0.[21-24]@o2ib4 But when lnet is loaded, it does a lnet configure before the import of that file, which I think might propagate a tcp NID in some rare case. How can we be sure to disable discovery everywhere without any race condition? How do you do that? We really don't use multi-rail at all in our case. Thanks!!

          If the TCP network is configured on the node, then it'll be propagated due to the discovery feature. There are three solutions: 1) remove the tcp nid from all the nodes if you don't need it. 2) Turn off discovery on all the nodes. 3) Explicitly configure the peers (but that would be a lot of config). I believe you can use ip2nets syntax though.

          ashehata Amir Shehata (Inactive) added a comment - If the TCP network is configured on the node, then it'll be propagated due to the discovery feature. There are three solutions: 1) remove the tcp nid from all the nodes if you don't need it. 2) Turn off discovery on all the nodes. 3) Explicitly configure the peers (but that would be a lot of config). I believe you can use ip2nets syntax though.
          [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.23.14@tcp
          peer:
              - primary nid: 10.10.23.14@tcp
                Multi-Rail: True
                peer ni:
                  - nid: 10.8.23.14@o2ib6
                    state: NA
                  - nid: 10.10.23.14@tcp
                    state: NA
          [root@fir-md1-s1 fir-MDT0000]# lctl ping 10.10.23.14@tcp
          failed to ping 10.10.23.14@tcp: Input/output error
          [root@fir-md1-s1 fir-MDT0000]# lctl ping 10.8.23.14@o2ib6
          12345-0@lo
          12345-10.8.23.14@o2ib6
          

          I was able to manually remove the TCP nid with this:

          [root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.23.14@tcp --nid 10.10.23.14@tcp
          [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.23.14@tcp
          show:
              - peer:
                    errno: -2
                    descr: "cannot get peer information: No such file or directory"
          
          [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.8.23.14@o2ib6
          show:
              - peer:
                    errno: -2
                    descr: "cannot get peer information: No such file or directory"
          
          [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.8.23.14@o2ib6
          peer:
              - primary nid: 10.8.23.14@o2ib6
                Multi-Rail: True
                peer ni:
                  - nid: 10.8.23.14@o2ib6
                    state: NA
          
          sthiell Stephane Thiell added a comment - [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.23.14@tcp peer: - primary nid: 10.10.23.14@tcp Multi-Rail: True peer ni: - nid: 10.8.23.14@o2ib6 state: NA - nid: 10.10.23.14@tcp state: NA [root@fir-md1-s1 fir-MDT0000]# lctl ping 10.10.23.14@tcp failed to ping 10.10.23.14@tcp: Input/output error [root@fir-md1-s1 fir-MDT0000]# lctl ping 10.8.23.14@o2ib6 12345-0@lo 12345-10.8.23.14@o2ib6 I was able to manually remove the TCP nid with this: [root@fir-md1-s1 fir-MDT0000]# lnetctl peer del --prim_nid 10.10.23.14@tcp --nid 10.10.23.14@tcp [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.10.23.14@tcp show: - peer: errno: -2 descr: "cannot get peer information: No such file or directory" [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.8.23.14@o2ib6 show: - peer: errno: -2 descr: "cannot get peer information: No such file or directory" [root@fir-md1-s1 fir-MDT0000]# lnetctl peer show --nid 10.8.23.14@o2ib6 peer: - primary nid: 10.8.23.14@o2ib6 Multi-Rail: True peer ni: - nid: 10.8.23.14@o2ib6 state: NA
          sthiell Stephane Thiell added a comment - - edited

          Hello Peter,

          This problem is still there and caused some trouble for us last weekend. Apparently, depiste our lnet service workaround on the clients, a tcp NID was able to make its way to Fir servers (2.12.3) which after a few hours caused a MDT deadlock.

          NOTE: For us, this is a major blocker for migrating Oak from 2.10 to 2.12, as multi-rail can cause this kind of issues, especially when storage and compute are separated and managed by different teams. A misconfigured client can cause such trouble on the server side.  Note that in our case, we don't want any tcp NID at all but there is no way to avoid that in 2.12 as far as I know. In 2.10, there is no risk of having this situation on the servers.

          We tracked down the problem today to the lnet.service script, which induces a race between lnet configure and lnet import:

           

          # cat /usr/lib/systemd/system/lnet.service
          [Unit]
          Description=lnet management
          
          Requires=network-online.target
          After=network-online.target openibd.service rdma.service opa.service
          
          ConditionPathExists=!/proc/sys/lnet/
          
          [Service]
          Type=oneshot
          RemainAfterExit=true
          ExecStart=/sbin/modprobe lnet
          ExecStart=/usr/sbin/lnetctl lnet configure
          ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf
          ExecStop=/usr/sbin/lustre_rmmod ptlrpc
          ExecStop=/usr/sbin/lnetctl lnet unconfigure
          ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs
          
          [Install]
          WantedBy=multi-user.target
           

          Even with our workaround, which works most of the case though, some clients can show up with a tcp0 NID at lnet configure and thus there is a risk of announcing themselves with a tcp NID when the filesystem tries to mount itself:

          2019-11-05T22:10:26-08:00 sh-117-11 systemd: Starting SYSV: Lustre shine mounting script...
          2019-11-05T22:10:26-08:00 sh-117-11 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2
          2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Added LNI 10.10.117.11@tcp [8/256/0/180]
          2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Accept secure, port 988
          2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Using FastReg for registration
          2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Added LNI 10.9.117.11@o2ib4 [8/256/0/180]
          2019-11-05T22:10:29-08:00 sh-117-11 kernel: LNet: Removed LNI 10.10.117.11@tcp
          2019-11-05T22:10:43-08:00 sh-117-11 kernel: LNet: 8157:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.24@o2ib4: 38 seconds
          2019-11-05T22:11:06-08:00 sh-117-11 kernel: LNet: 8157:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.24@o2ib4: 61 seconds
          2019-11-05T22:11:08-08:00 sh-117-11 shine: Starting shine:  WARNING: Nothing to mount on sh-117-11 for `regal'
          2019-11-05T22:11:08-08:00 sh-117-11 shine: WARNING: Nothing was done for `regal'.
          2019-11-05T22:11:08-08:00 sh-117-11 shine: Mount of fir on /scratch failed
          2019-11-05T22:11:08-08:00 sh-117-11 shine: >> mount.lustre: mount 10.0.10.51@o2ib7:10.0.10.52@o2ib7:/fir at /scratch failed: Input/output error
          2019-11-05T22:11:08-08:00 sh-117-11 shine: Is the MGS running?
          2019-11-05T22:11:08-08:00 sh-117-11 shine: [FAILED]
          2019-11-05T22:11:08-08:00 sh-117-11 systemd: shine.service: control process exited, code=exited status=16
          2019-11-05T22:11:08-08:00 sh-117-11 systemd: Failed to start SYSV: Lustre shine mounting script.
          2019-11-05T22:11:08-08:00 sh-117-11 systemd: Unit shine.service entered failed state.
          2019-11-05T22:11:08-08:00 sh-117-11 systemd: shine.service failed.
          2019-11-05T22:11:33-08:00 sh-117-11 systemd: Starting SYSV: Lustre shine mounting script...
          

          Then, the servers keep trying to contact these clients to the erroneous tcp NID, while they don't even have a tcp interface nor a route for those, and we end up having problems like these:

          [Sat Nov  9 23:58:14 2019][461981.327905] LustreError: 80884:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffffa0fe1e7f7400 ns: mdt-fir-MDT0000_UUID lock: ffffa11943671200/0x675684f65f9baf7 lrc: 3/0,0 mode: CR/CR res: [0x200038966:0x418:0x0].0x0 bits 0x9/0x0 rrc: 2 type: IBT flags: 0x50200000000000 nid: 10.10.117.11@tcp remote: 0x541e831b11b117da expref: 250 pid: 80884 timeout: 0 lvb_type: 0^M
          

          We also think that the backtraces below are due to MDT threads being stuck with tcp NIDs:

          [Sun Nov 10 06:24:59 2019][485187.377753] Pid: 67098, comm: mdt03_047 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M
          [Sun Nov 10 06:25:00 2019][485187.388020] Call Trace:^M
          [Sun Nov 10 06:25:00 2019][485187.390566]  [<ffffffffc10ccb75>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc]^M
          [Sun Nov 10 06:25:00 2019][485187.397597]  [<ffffffffc10cd5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]^M
          [Sun Nov 10 06:25:00 2019][485187.404884]  [<ffffffffc15d850b>] mdt_object_local_lock+0x50b/0xb20 [mdt]^M
          [Sun Nov 10 06:25:00 2019][485187.411808]  [<ffffffffc15d8b90>] mdt_object_lock_internal+0x70/0x360 [mdt]^M
          [Sun Nov 10 06:25:00 2019][485187.418892]  [<ffffffffc15da40d>] mdt_getattr_name_lock+0x101d/0x1c30 [mdt]^M
          [Sun Nov 10 06:25:00 2019][485187.425989]  [<ffffffffc15e1d25>] mdt_intent_getattr+0x2b5/0x480 [mdt]^M
          [Sun Nov 10 06:25:00 2019][485187.432638]  [<ffffffffc15debb5>] mdt_intent_policy+0x435/0xd80 [mdt]^M
          [Sun Nov 10 06:25:00 2019][485187.439230]  [<ffffffffc10b3d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]^M
          [Sun Nov 10 06:25:00 2019][485187.446073]  [<ffffffffc10dc336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]^M
          [Sun Nov 10 06:25:00 2019][485187.453272]  [<ffffffffc1164a12>] tgt_enqueue+0x62/0x210 [ptlrpc]^M
          [Sun Nov 10 06:25:00 2019][485187.459514]  [<ffffffffc116936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M
          [Sun Nov 10 06:25:00 2019][485187.466549]  [<ffffffffc111024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M
          [Sun Nov 10 06:25:00 2019][485187.474357]  [<ffffffffc1113bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]^M
          [Sun Nov 10 06:25:00 2019][485187.480780]  [<ffffffffbe8c2e81>] kthread+0xd1/0xe0^M
          [Sun Nov 10 06:25:00 2019][485187.485775]  [<ffffffffbef77c24>] ret_from_fork_nospec_begin+0xe/0x21^M
          [Sun Nov 10 06:25:00 2019][485187.492327]  [<ffffffffffffffff>] 0xffffffffffffffff^M
          

          Yesterday, MDT0 on Fir went down and was completely hung, with previously thousands messages having tcp NIDs and backtrace like above, all of that apparently due to 2 clients announcing themselves as having a tcp NID.

          Please advice if there is a way to completely disable multi-rail and avoid this situation. I would recommend to increase the severity of this issue as this has caused a lot of trouble since 2.12, but I'm glad we're finally making progress. Thanks much!
           

          sthiell Stephane Thiell added a comment - - edited Hello Peter, This problem is still there and caused some trouble for us last weekend. Apparently, depiste our lnet service workaround on the clients, a tcp NID was able to make its way to Fir servers (2.12.3) which after a few hours caused a MDT deadlock. NOTE: For us, this is a major blocker for migrating Oak from 2.10 to 2.12, as multi-rail can cause this kind of issues, especially when storage and compute are separated and managed by different teams. A misconfigured client can cause such trouble on the server side.  Note that in our case, we don't want any tcp NID at all but there is no way to avoid that in 2.12 as far as I know. In 2.10, there is no risk of having this situation on the servers. We tracked down the problem today to the lnet.service script, which induces a race between lnet configure and lnet import :   # cat /usr/lib/systemd/system/lnet.service [Unit] Description=lnet management Requires=network-online.target After=network-online.target openibd.service rdma.service opa.service ConditionPathExists=!/proc/sys/lnet/ [Service] Type=oneshot RemainAfterExit=true ExecStart=/sbin/modprobe lnet ExecStart=/usr/sbin/lnetctl lnet configure ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf ExecStop=/usr/sbin/lustre_rmmod ptlrpc ExecStop=/usr/sbin/lnetctl lnet unconfigure ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs [Install] WantedBy=multi-user.target Even with our workaround, which works most of the case though, some clients can show up with a tcp0 NID at lnet configure and thus there is a risk of announcing themselves with a tcp NID when the filesystem tries to mount itself: 2019-11-05T22:10:26-08:00 sh-117-11 systemd: Starting SYSV: Lustre shine mounting script... 2019-11-05T22:10:26-08:00 sh-117-11 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24, npartitions: 2 2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Added LNI 10.10.117.11@tcp [8/256/0/180] 2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Accept secure, port 988 2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Using FastReg for registration 2019-11-05T22:10:27-08:00 sh-117-11 kernel: LNet: Added LNI 10.9.117.11@o2ib4 [8/256/0/180] 2019-11-05T22:10:29-08:00 sh-117-11 kernel: LNet: Removed LNI 10.10.117.11@tcp 2019-11-05T22:10:43-08:00 sh-117-11 kernel: LNet: 8157:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.24@o2ib4: 38 seconds 2019-11-05T22:11:06-08:00 sh-117-11 kernel: LNet: 8157:0:(o2iblnd_cb.c:3396:kiblnd_check_conns()) Timed out tx for 10.9.0.24@o2ib4: 61 seconds 2019-11-05T22:11:08-08:00 sh-117-11 shine: Starting shine: WARNING: Nothing to mount on sh-117-11 for `regal' 2019-11-05T22:11:08-08:00 sh-117-11 shine: WARNING: Nothing was done for `regal'. 2019-11-05T22:11:08-08:00 sh-117-11 shine: Mount of fir on /scratch failed 2019-11-05T22:11:08-08:00 sh-117-11 shine: >> mount.lustre: mount 10.0.10.51@o2ib7:10.0.10.52@o2ib7:/fir at /scratch failed: Input/output error 2019-11-05T22:11:08-08:00 sh-117-11 shine: Is the MGS running? 2019-11-05T22:11:08-08:00 sh-117-11 shine: [FAILED] 2019-11-05T22:11:08-08:00 sh-117-11 systemd: shine.service: control process exited, code=exited status=16 2019-11-05T22:11:08-08:00 sh-117-11 systemd: Failed to start SYSV: Lustre shine mounting script. 2019-11-05T22:11:08-08:00 sh-117-11 systemd: Unit shine.service entered failed state. 2019-11-05T22:11:08-08:00 sh-117-11 systemd: shine.service failed. 2019-11-05T22:11:33-08:00 sh-117-11 systemd: Starting SYSV: Lustre shine mounting script... Then, the servers keep trying to contact these clients to the erroneous tcp NID, while they don't even have a tcp interface nor a route for those, and we end up having problems like these: [Sat Nov 9 23:58:14 2019][461981.327905] LustreError: 80884:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffffa0fe1e7f7400 ns: mdt-fir-MDT0000_UUID lock: ffffa11943671200/0x675684f65f9baf7 lrc: 3/0,0 mode: CR/CR res: [0x200038966:0x418:0x0].0x0 bits 0x9/0x0 rrc: 2 type: IBT flags: 0x50200000000000 nid: 10.10.117.11@tcp remote: 0x541e831b11b117da expref: 250 pid: 80884 timeout: 0 lvb_type: 0^M We also think that the backtraces below are due to MDT threads being stuck with tcp NIDs: [Sun Nov 10 06:24:59 2019][485187.377753] Pid: 67098, comm: mdt03_047 3.10.0-957.27.2.el7_lustre.pl1.x86_64 #1 SMP Mon Aug 5 15:28:37 PDT 2019^M [Sun Nov 10 06:25:00 2019][485187.388020] Call Trace:^M [Sun Nov 10 06:25:00 2019][485187.390566] [<ffffffffc10ccb75>] ldlm_completion_ast+0x4e5/0x860 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.397597] [<ffffffffc10cd5e1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.404884] [<ffffffffc15d850b>] mdt_object_local_lock+0x50b/0xb20 [mdt]^M [Sun Nov 10 06:25:00 2019][485187.411808] [<ffffffffc15d8b90>] mdt_object_lock_internal+0x70/0x360 [mdt]^M [Sun Nov 10 06:25:00 2019][485187.418892] [<ffffffffc15da40d>] mdt_getattr_name_lock+0x101d/0x1c30 [mdt]^M [Sun Nov 10 06:25:00 2019][485187.425989] [<ffffffffc15e1d25>] mdt_intent_getattr+0x2b5/0x480 [mdt]^M [Sun Nov 10 06:25:00 2019][485187.432638] [<ffffffffc15debb5>] mdt_intent_policy+0x435/0xd80 [mdt]^M [Sun Nov 10 06:25:00 2019][485187.439230] [<ffffffffc10b3d46>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.446073] [<ffffffffc10dc336>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.453272] [<ffffffffc1164a12>] tgt_enqueue+0x62/0x210 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.459514] [<ffffffffc116936a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.466549] [<ffffffffc111024b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.474357] [<ffffffffc1113bac>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]^M [Sun Nov 10 06:25:00 2019][485187.480780] [<ffffffffbe8c2e81>] kthread+0xd1/0xe0^M [Sun Nov 10 06:25:00 2019][485187.485775] [<ffffffffbef77c24>] ret_from_fork_nospec_begin+0xe/0x21^M [Sun Nov 10 06:25:00 2019][485187.492327] [<ffffffffffffffff>] 0xffffffffffffffff^M Yesterday, MDT0 on Fir went down and was completely hung, with previously thousands messages having tcp NIDs and backtrace like above, all of that apparently due to 2 clients announcing themselves as having a tcp NID. Please advice if there is a way to completely disable multi-rail and avoid this situation. I would recommend to increase the severity of this issue as this has caused a lot of trouble since 2.12, but I'm glad we're finally making progress. Thanks much!  
          sthiell Stephane Thiell added a comment - - edited

          Thanks, we're trying this drop-in file as a workaround on all clients:

          /etc/systemd/system/lnet.service.d/deps.conf

          [Unit]
          After=dkms.service network.service
          
          [Service]
          # we don't want tcp nids
          ExecStartPost=-/usr/sbin/lnetctl net del --net tcp
          
          sthiell Stephane Thiell added a comment - - edited Thanks, we're trying this drop-in file as a workaround on all clients: /etc/systemd/system/lnet.service.d/deps.conf [Unit] After=dkms.service network.service [Service] # we don't want tcp nids ExecStartPost=-/usr/sbin/lnetctl net del --net tcp

          People

            ashehata Amir Shehata (Inactive)
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: