[LU-11888] Unreachable client NID confusing Lustre 2.12 Created: 25/Jan/19  Updated: 02/Nov/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Stephane Thiell Assignee: Sonia Sharma (Inactive)
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS 7.6


Issue Links:
Duplicate
is duplicated by LU-11936 High ldlm load, slow/unusable filesystem Reopened
Related
is related to LU-14107 Keep peer representation consistent a... Open
Rank (Obsolete): 9223372036854775807

 Description   

Just wanted to report this, although probably not critical. During testing of 2.12.0 on IB only (o2ib with routers), we mistakenly set up a client with two NIDs, one on tcp0:

[root@sh-06-33 ~]# lctl list_nids
10.10.6.33@tcp
10.8.6.33@o2ib6

This confused the Lustre servers a LOT:

[663974.083382] LNetError: 124939:0:(peer.c:2480:lnet_peer_merge_data()) Error deleting NID 10.10.6.33@tcp from peer 10.10.6.33@tcp: -16
[663974.095393] Lustre: MGS: Connection restored to 49c3dc53-716e-1689-4b27-666ee552afb7 (at 10.10.6.33@tcp)
[663981.577418] LNetError: 127396:0:(lib-move.c:1980:lnet_handle_find_routed_path()) no route to 10.10.6.33@tcp from <?>
[663981.588032] LNetError: 127396:0:(lib-move.c:1980:lnet_handle_find_routed_path()) Skipped 6171721 previous similar messages
[663981.599239] Lustre: 127396:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1548442345/real 1548442345]  req@ffff9149c37fb600 x1623580872768976/t0(0) o104->fir-MDT0000@10.10.6.33@tcp:15/16 lens 296/224 e 0 to 1 dl 1548442356 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
[663981.626855] Lustre: 127396:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 6167666 previous similar messages
[664132.508056] LustreError: 127396:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.10.6.33@tcp) failed to reply to blocking AST (req@ffff9149c37fb600 x1623580872768976 status 0 rc -110), evict it ns: mdt-fir-MDT0000_UUID lock: ffff9129d5ad1f80/0xffdc3423e2d5331a lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.10.6.33@tcp remote: 0x12093504df377bf7 expref: 109 pid: 125204 timeout: 664264 lvb_type: 0
[664132.550562] LustreError: 138-a: fir-MDT0000: A client on nid 10.10.6.33@tcp was evicted due to a lock blocking callback time out: rc -110
[664132.563014] LustreError: 125084:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 151s: evicting client at 10.10.6.33@tcp  ns: mdt-fir-MDT0000_UUID lock: ffff9129d5ad1f80/0xffdc3423e2d5331a lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.10.6.33@tcp remote: 0x12093504df377bf7 expref: 110 pid: 125204 timeout: 0 lvb_type: 0

But the main problem is that and even after the client rebooted without the tcp0 NID, the server was still logging things like:

[664150.993807] Lustre: fir-MDT0000: Connection restored to 49c3dc53-716e-1689-4b27-666ee552afb7 (at 10.10.6.33@tcp)

Re: this last line, it was after the client has rebooted. While it looks like the server only prints the first client NID, but in that case it remembered the last client's tcp0 nid, which is weird...

The servers are using o2ib only:

[root@fir-md1-s1 ~]# lctl list_nids
10.0.10.51@o2ib7
[root@fir-md1-s1 ~]# lctl route_list
net              o2ib4 hops 4294967295 gw                10.0.10.210@o2ib7 up pri 0
net              o2ib4 hops 4294967295 gw                10.0.10.209@o2ib7 up pri 0
net              o2ib4 hops 4294967295 gw                10.0.10.211@o2ib7 up pri 0
net              o2ib4 hops 4294967295 gw                10.0.10.212@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.202@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.204@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.201@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.203@o2ib7 up pri 0

We were wondering how it is even possible. The solution to fix this in a timely manner was to restart the Lustre servers.

Stephane



 Comments   
Comment by Peter Jones [ 25/Jan/19 ]

Sonia

Could you please investigate?

Peter

Comment by Amir Shehata (Inactive) [ 28/Jan/19 ]

This looks similar to this issue:

https://jira.whamcloud.com/browse/LU-11840

Comment by Matt Rásó-Barnett (Inactive) [ 02/Nov/20 ]

Hello,
I believe I just ran into this same issue as well.
Both clients and servers are RHEL 7.8, running 2.12.5, MOFED 4.9.

We had a single client refusing to mount the filesystem. This client has NID:

[root@cpu-p-10 ~]# lctl list_nids
10.44.161.10@o2ib2

However the server logged the following similar to Stefane:

Nov 02 13:29:53 rds-mds7 kernel: LNetError: 168122:0:(peer.c:2453:lnet_peer_merge_data()) Error deleting NID 10.43.161.10@tcp from peer 10.43.161.10@tcp: -16

That IP address is the IP of the ethernet interface on this node (only differs with the second octet). Likely this error started when LNET must have been previously misconfigured or no config was given to lnetctl so it used the first TCP interface on the node.

Similar to Stefane, this error persisted across reboots. Fortunately however, I could fix it by just manually deleting the peer entry on the server:

# Before
[root@rds-mds7 ~]# lnetctl peer show --nid 10.43.161.10@tcp
peer:
    - primary nid: 10.43.161.10@tcp
      Multi-Rail: True
      peer ni:
        - nid: 10.44.161.10@o2ib2
          state: NA
        - nid: 10.43.161.10@tcp
          state: NA

[root@rds-mds7 ~]# lnetctl peer del --prim_nid 10.43.161.10@tcp                                                                                                                                                     
[root@rds-mds7 ~]# lnetctl peer show --nid 10.43.161.10@tcp                                                                                                                                                         
show:
    - peer:
          errno: -2
          descr: "cannot get peer information: No such file or directory"

Now the mount works correctly.

I don't think this adds anything, but just wanted to +1 this ticket. First time we've seen this, so will keep an eye out if this happens again.

I took a look at LU-11840 linked, but the workaround described there (disabling discovery on the client) didn't fix this for me.

Cheers,
Matt

Generated at Sat Feb 10 02:47:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.