Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.12.0
Labels:
None
Environment:
CentOS 7.6

Rank (Obsolete):
9223372036854775807

Description

Just wanted to report this, although probably not critical. During testing of 2.12.0 on IB only (o2ib with routers), we mistakenly set up a client with two NIDs, one on tcp0:

[root@sh-06-33 ~]# lctl list_nids
10.10.6.33@tcp
10.8.6.33@o2ib6

This confused the Lustre servers a LOT:

[663974.083382] LNetError: 124939:0:(peer.c:2480:lnet_peer_merge_data()) Error deleting NID 10.10.6.33@tcp from peer 10.10.6.33@tcp: -16
[663974.095393] Lustre: MGS: Connection restored to 49c3dc53-716e-1689-4b27-666ee552afb7 (at 10.10.6.33@tcp)
[663981.577418] LNetError: 127396:0:(lib-move.c:1980:lnet_handle_find_routed_path()) no route to 10.10.6.33@tcp from <?>
[663981.588032] LNetError: 127396:0:(lib-move.c:1980:lnet_handle_find_routed_path()) Skipped 6171721 previous similar messages
[663981.599239] Lustre: 127396:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1548442345/real 1548442345]  req@ffff9149c37fb600 x1623580872768976/t0(0) o104->fir-MDT0000@10.10.6.33@tcp:15/16 lens 296/224 e 0 to 1 dl 1548442356 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
[663981.626855] Lustre: 127396:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 6167666 previous similar messages
[664132.508056] LustreError: 127396:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.10.6.33@tcp) failed to reply to blocking AST (req@ffff9149c37fb600 x1623580872768976 status 0 rc -110), evict it ns: mdt-fir-MDT0000_UUID lock: ffff9129d5ad1f80/0xffdc3423e2d5331a lrc: 4/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.10.6.33@tcp remote: 0x12093504df377bf7 expref: 109 pid: 125204 timeout: 664264 lvb_type: 0
[664132.550562] LustreError: 138-a: fir-MDT0000: A client on nid 10.10.6.33@tcp was evicted due to a lock blocking callback time out: rc -110
[664132.563014] LustreError: 125084:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 151s: evicting client at 10.10.6.33@tcp  ns: mdt-fir-MDT0000_UUID lock: ffff9129d5ad1f80/0xffdc3423e2d5331a lrc: 3/0,0 mode: PR/PR res: [0x200000007:0x1:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.10.6.33@tcp remote: 0x12093504df377bf7 expref: 110 pid: 125204 timeout: 0 lvb_type: 0

But the main problem is that and even after the client rebooted without the tcp0 NID, the server was still logging things like:

[664150.993807] Lustre: fir-MDT0000: Connection restored to 49c3dc53-716e-1689-4b27-666ee552afb7 (at 10.10.6.33@tcp)

Re: this last line, it was after the client has rebooted. While it looks like the server only prints the first client NID, but in that case it remembered the last client's tcp0 nid, which is weird...

The servers are using o2ib only:

[root@fir-md1-s1 ~]# lctl list_nids
10.0.10.51@o2ib7
[root@fir-md1-s1 ~]# lctl route_list
net              o2ib4 hops 4294967295 gw                10.0.10.210@o2ib7 up pri 0
net              o2ib4 hops 4294967295 gw                10.0.10.209@o2ib7 up pri 0
net              o2ib4 hops 4294967295 gw                10.0.10.211@o2ib7 up pri 0
net              o2ib4 hops 4294967295 gw                10.0.10.212@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.202@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.204@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.201@o2ib7 up pri 0
net              o2ib6 hops 4294967295 gw                10.0.10.203@o2ib7 up pri 0

We were wondering how it is even possible. The solution to fix this in a timely manner was to restart the Lustre servers.

Stephane

Attachments

Issue Links

is duplicated by

LU-11936 High ldlm load, slow/unusable filesystem

Reopened

is related to

LU-14107 Keep peer representation consistent across LNet and the LNDs

Open

Activity

People

Assignee:: Sonia Sharma (Inactive)

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 25/Jan/19 7:19 PM

Updated:: 02/Nov/20 6:23 PM