Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11888

Unreachable client NID confusing Lustre 2.12

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0
    • None
    • CentOS 7.6
    • 9223372036854775807

    Description

      Just wanted to report this, although probably not critical. During testing of 2.12.0 on IB only (o2ib with routers), we mistakenly set up a client with two NIDs, one on tcp0:

      [root@sh-06-33 ~]# lctl list_nids
      10.10.6.33@tcp
      10.8.6.33@o2ib6
      

      This confused the Lustre servers a LOT:

      [663974.083382] LNetError: 124939:0:(peer.c:2480:lnet_peer_merge_data()) Error deleting NID 10.10.6.33@tcp from peer 10.10.6.33@tcp: -16
      [663974.095393] Lustre: MGS: Connection restored to 49c3dc53-716e-1689-4b27-666ee552afb7 (at 10.10.6.33@tcp)
      [663981.577418] LNetError: 127396:0:(lib-move.c:1980:lnet_handle_find_routed_path()) no route to 10.10.6.33@tcp from <?>
      [663981.588032] LNetError: 127396:0:(lib-move.c:1980:lnet_handle_find_routed_path()) Skipped 6171721 previous similar messages
      [663981.599239] Lustre: 127396:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1548442345/real 1548442345]  req@ffff9149c37fb600 x1623580872768976/t0(0) o104->fir-MDT0000@10.10.6.33@tcp:15/16 lens 296/224 e 0 to 1 dl 1548442356 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
      [663981.626855] Lustre: 127396:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 6167666 previous similar messages
      [664132.508056] LustreError: 127396:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.10.6.33@tcp) failed to reply to blocking AST (req@ffff9149c37fb600 x1623580872768976 status 0 rc -110), evict it ns: mdt-fir-MDT0000_UUID lock: ffff9129d5ad1f80/0xffdc3423e2d5331a lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.10.6.33@tcp remote: 0x12093504df377bf7 expref: 109 pid: 125204 timeout: 664264 lvb_type: 0
      [664132.550562] LustreError: 138-a: fir-MDT0000: A client on nid 10.10.6.33@tcp was evicted due to a lock blocking callback time out: rc -110
      [664132.563014] LustreError: 125084:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 151s: evicting client at 10.10.6.33@tcp  ns: mdt-fir-MDT0000_UUID lock: ffff9129d5ad1f80/0xffdc3423e2d5331a lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.10.6.33@tcp remote: 0x12093504df377bf7 expref: 110 pid: 125204 timeout: 0 lvb_type: 0
      

      But the main problem is that and even after the client rebooted without the tcp0 NID, the server was still logging things like:

      [664150.993807] Lustre: fir-MDT0000: Connection restored to 49c3dc53-716e-1689-4b27-666ee552afb7 (at 10.10.6.33@tcp)
      

      Re: this last line, it was after the client has rebooted. While it looks like the server only prints the first client NID, but in that case it remembered the last client's tcp0 nid, which is weird...

      The servers are using o2ib only:

      [root@fir-md1-s1 ~]# lctl list_nids
      10.0.10.51@o2ib7
      [root@fir-md1-s1 ~]# lctl route_list
      net              o2ib4 hops 4294967295 gw                10.0.10.210@o2ib7 up pri 0
      net              o2ib4 hops 4294967295 gw                10.0.10.209@o2ib7 up pri 0
      net              o2ib4 hops 4294967295 gw                10.0.10.211@o2ib7 up pri 0
      net              o2ib4 hops 4294967295 gw                10.0.10.212@o2ib7 up pri 0
      net              o2ib6 hops 4294967295 gw                10.0.10.202@o2ib7 up pri 0
      net              o2ib6 hops 4294967295 gw                10.0.10.204@o2ib7 up pri 0
      net              o2ib6 hops 4294967295 gw                10.0.10.201@o2ib7 up pri 0
      net              o2ib6 hops 4294967295 gw                10.0.10.203@o2ib7 up pri 0
      

      We were wondering how it is even possible. The solution to fix this in a timely manner was to restart the Lustre servers.

      Stephane

      Attachments

        Issue Links

          Activity

            [LU-11888] Unreachable client NID confusing Lustre 2.12

            Hello,
            I believe I just ran into this same issue as well.
            Both clients and servers are RHEL 7.8, running 2.12.5, MOFED 4.9.

            We had a single client refusing to mount the filesystem. This client has NID:

            [root@cpu-p-10 ~]# lctl list_nids
            10.44.161.10@o2ib2
            

            However the server logged the following similar to Stefane:

            Nov 02 13:29:53 rds-mds7 kernel: LNetError: 168122:0:(peer.c:2453:lnet_peer_merge_data()) Error deleting NID 10.43.161.10@tcp from peer 10.43.161.10@tcp: -16
            

            That IP address is the IP of the ethernet interface on this node (only differs with the second octet). Likely this error started when LNET must have been previously misconfigured or no config was given to lnetctl so it used the first TCP interface on the node.

            Similar to Stefane, this error persisted across reboots. Fortunately however, I could fix it by just manually deleting the peer entry on the server:

            # Before
            [root@rds-mds7 ~]# lnetctl peer show --nid 10.43.161.10@tcp
            peer:
                - primary nid: 10.43.161.10@tcp
                  Multi-Rail: True
                  peer ni:
                    - nid: 10.44.161.10@o2ib2
                      state: NA
                    - nid: 10.43.161.10@tcp
                      state: NA
            
            [root@rds-mds7 ~]# lnetctl peer del --prim_nid 10.43.161.10@tcp                                                                                                                                                     
            [root@rds-mds7 ~]# lnetctl peer show --nid 10.43.161.10@tcp                                                                                                                                                         
            show:
                - peer:
                      errno: -2
                      descr: "cannot get peer information: No such file or directory"
            

            Now the mount works correctly.

            I don't think this adds anything, but just wanted to +1 this ticket. First time we've seen this, so will keep an eye out if this happens again.

            I took a look at LU-11840 linked, but the workaround described there (disabling discovery on the client) didn't fix this for me.

            Cheers,
            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Hello, I believe I just ran into this same issue as well. Both clients and servers are RHEL 7.8, running 2.12.5, MOFED 4.9. We had a single client refusing to mount the filesystem. This client has NID: [root@cpu-p-10 ~]# lctl list_nids 10.44.161.10@o2ib2 However the server logged the following similar to Stefane: Nov 02 13:29:53 rds-mds7 kernel: LNetError: 168122:0:(peer.c:2453:lnet_peer_merge_data()) Error deleting NID 10.43.161.10@tcp from peer 10.43.161.10@tcp: -16 That IP address is the IP of the ethernet interface on this node (only differs with the second octet). Likely this error started when LNET must have been previously misconfigured or no config was given to lnetctl so it used the first TCP interface on the node. Similar to Stefane, this error persisted across reboots. Fortunately however, I could fix it by just manually deleting the peer entry on the server: # Before [root@rds-mds7 ~]# lnetctl peer show --nid 10.43.161.10@tcp peer: - primary nid: 10.43.161.10@tcp Multi-Rail: True peer ni: - nid: 10.44.161.10@o2ib2 state: NA - nid: 10.43.161.10@tcp state: NA [root@rds-mds7 ~]# lnetctl peer del --prim_nid 10.43.161.10@tcp [root@rds-mds7 ~]# lnetctl peer show --nid 10.43.161.10@tcp show: - peer: errno: -2 descr: "cannot get peer information: No such file or directory" Now the mount works correctly. I don't think this adds anything, but just wanted to +1 this ticket. First time we've seen this, so will keep an eye out if this happens again. I took a look at LU-11840 linked, but the workaround described there (disabling discovery on the client) didn't fix this for me. Cheers, Matt
            ashehata Amir Shehata (Inactive) added a comment - This looks similar to this issue: https://jira.whamcloud.com/browse/LU-11840
            pjones Peter Jones added a comment -

            Sonia

            Could you please investigate?

            Peter

            pjones Peter Jones added a comment - Sonia Could you please investigate? Peter

            People

              sharmaso Sonia Sharma (Inactive)
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: