[LU-4531] frequent evictions and timeouts on routed lnet Created: 23/Jan/14  Updated: 18/Jul/14  Resolved: 18/Jul/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Manish Patel (Inactive) Assignee: Isaac Huang (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File 2014-01-23-SR29545-jan19.errs     File dc2_lnet_router.log.gz     File lnet_router_logs_02_21_2014.gz    
Severity: 3
Rank (Obsolete): 12392

 Description   

IU's dc2 filesystem is primarily served out via IB, but there is also a tcp connected cluster routed through 4 LNET routers (dc2xfer[01-04]).

We've been getting a lot of client evictions recently and aren't sure the best way to troubleshoot them.

Here's an example of one of the evictions that happened on Jan 19:
Client [149.165.226.203@tcp] logs from Jan 19:

Jan 19 03:33:40 c3 kernel: Lustre: dc2-OST0038-osc-ffff8860122a1000: Connection to service dc2-OST0038 via nid 10.10.0.6@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Jan 19 03:33:40 c3 kernel: LustreError: 18821:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
Jan 19 03:33:43 c3 kernel: LustreError: 167-0: This client was evicted by dc2-OST0038; in progress operations using this service will fail.
Jan 19 03:33:43 c3 kernel: Lustre: Server dc2-OST0038_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Jan 19 03:33:43 c3 kernel: LustreError: 40645:0:(ldlm_resource.c:521:ldlm_namespace_cleanup()) Namespace dc2-OST0038-osc-ffff8860122a1000 resource refcount nonzero (2) after lock cleanup; forcing cleanup.
Jan 19 03:33:43 c3 kernel: LustreError: 40645:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff887793d31200 (12440460/0/0/0) (rc: 2)
Jan 19 03:33:43 c3 kernel: LustreError: 38780:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
Jan 19 03:33:43 c3 kernel: Lustre: dc2-OST0038-osc-ffff8860122a1000: Connection restored to service dc2-OST0038 using nid 10.10.0.6@o2ib.
Jan 19 03:33:43 c3 kernel: LustreError: 18821:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
Jan 19 04:12:18 c3 kernel: Lustre: dc2-OST0054-osc-ffff8860122a1000: Connection to service dc2-OST0054 via nid 10.10.0.9@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Jan 19 04:12:21 c3 kernel: LustreError: 18678:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
Jan 19 04:12:21 c3 kernel: LustreError: 167-0: This client was evicted by dc2-OST0054; in progress operations using this service will fail.
Jan 19 04:12:21 c3 kernel: Lustre: Server dc2-OST0054_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Jan 19 04:12:21 c3 kernel: LustreError: 46325:0:(ldlm_resource.c:521:ldlm_namespace_cleanup()) Namespace dc2-OST0054-osc-ffff8860122a1000 resource refcount nonzero (2) after lock cleanup; forcing cleanup.
Jan 19 04:12:21 c3 kernel: LustreError: 46325:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff887fd34beb40 (12159828/0/0/0) (rc: 2)
Jan 19 04:12:21 c3 kernel: Lustre: dc2-OST0054-osc-ffff8860122a1000: Connection restored to service dc2-OST0054 using nid 10.10.0.9@o2ib.
Jan 19 04:12:21 c3 kernel: LustreError: 46324:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
Jan 19 04:12:22 c3 kernel: LustreError: 18678:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
Jan 19 08:30:58 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786163853766 sent from dc2-OST0095-osc-ffff8860122a1000 to NID 10.10.0.15@o2ib 22s ago has timed out (22s prior to deadline).
Jan 19 08:30:58 c3 kernel: Lustre: dc2-OST0095-osc-ffff8860122a1000: Connection to service dc2-OST0095 via nid 10.10.0.15@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Jan 19 08:31:00 c3 kernel: Lustre: dc2-OST0095-osc-ffff8860122a1000: Connection restored to service dc2-OST0095 using nid 10.10.0.15@o2ib.
Jan 19 08:31:00 c3 kernel: Lustre: Server dc2-OST0095_UUID version (2.1.6.0) is much newer than client version (1.8.9)
Jan 19 15:10:06 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984600 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 22s ago has timed out (22s prior to deadline).
Jan 19 15:10:06 c3 kernel: Lustre: dc2-OST003f-osc-ffff8860122a1000: Connection to service dc2-OST003f via nid 10.10.0.7@o2ib was lost; in progress operations using this service will wait for recovery to complete.
Jan 19 15:10:07 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984602 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 24s ago has timed out (22s prior to deadline).
Jan 19 15:10:09 c3 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.7@o2ib. The ost_connect operation failed with -16
Jan 19 15:10:09 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984604 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 26s ago has timed out (22s prior to deadline).
Jan 19 15:10:11 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984607 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 28s ago has timed out (22s prior to deadline).
Jan 19 15:10:15 c3 kernel: Lustre: dc2-OST003f-osc-ffff8860122a1000: Connection restored to service dc2-OST003f using nid 10.10.0.7@o2ib.
Jan 19 15:10:15 c3 kernel: Lustre: Server dc2-OST003f_UUID version (2.1.6.0) is much newer than client version (1.8.9)

Router logs

dc2xfer01: Jan 19 02:19:47 dc2xfer01 kernel: : LustreError: 3529:0:(lib-move.c:1957:lnet_parse_get()) 149.165.235.151@tcp: Unable to send REPLY for GET from 12345-149.165.226.207@tcp: -113
dc2xfer01: Jan 19 02:20:46 dc2xfer01 kernel: : LustreError: 3536:0:(socklnd_cb.c:2520:ksocknal_check_peer_timeouts()) Total 8 stale ZC_REQs for peer 149.165.226.205@tcp detected; the oldest(ffff880ffee94000) timed out 3 secs ago, resid: 0, wmem: 8020736
dc2xfer01: Jan 19 02:21:12 dc2xfer01 kernel: : LustreError: 3530:0:(lib-move.c:1957:lnet_parse_get()) 149.165.235.151@tcp: Unable to send REPLY for GET from 12345-149.165.226.205@tcp: -113
dc2xfer01: Jan 19 08:02:42 dc2xfer01 kernel: : LustreError: 3536:0:(socklnd_cb.c:2520:ksocknal_check_peer_timeouts()) Total 8 stale ZC_REQs for peer 149.165.226.205@tcp detected; the oldest(ffff880fcd326000) timed out 8 secs ago, resid: 0, wmem: 7883680

I'll attach the OSS logs. It looks like there are some Bulk IO errors, but does that indicate an issue between the OSS and the router, or the router and the client? In the past when I've seen those errors, there hasn't been a router in between to confuse the issue.



 Comments   
Comment by Peter Jones [ 24/Jan/14 ]

Thanks for the submission

Comment by Bruno Faccini (Inactive) [ 24/Jan/14 ]

Hello Kit,
Are all the impacted Clients running with 1.8.9 ?

Comment by Kit Westneat (Inactive) [ 27/Jan/14 ]

They are all at 1.8.x. The majority are at 1.8.9 I think, but we have seen it on older versions as well.

Comment by Manish Patel (Inactive) [ 05/Feb/14 ]

Hi Bruno,

Upon further investigating on the network side we seeing -ve min values for rtr min and tx min value for TCP connections.

Here is output from

cat /proc/sys/lnet/peers

nid refs state last max rtr min tx min queue
10.10.10.154@o2ib 3 up -1 8 8 8 8 -176 0
10.10.10.156@o2ib 1 NA -1 8 8 8 8 -1742 0
10.10.10.157@o2ib 1 NA -1 8 8 8 8 -1610 0
10.10.10.158@o2ib 1 NA -1 8 8 8 8 -32916 0
10.10.10.162@tcp 3 up 173 8 8 -2 6 -111 1241232

cat /proc/sys/lnet/peers
nid refs state last max rtr min tx min queue
172.16.32.14@tcp 1 up 25 8 8 -1 8 -7 0
172.16.32.15@tcp 1 up 148 8 8 -2 8 -110 0
172.16.32.16@tcp 4 up 73 8 8 -2 5 -119 2101464
172.16.32.17@tcp 1 up 5 8 8 -2 8 -110 0
172.16.32.18@tcp 1 up 43 8 8 -2 8 -116 0
172.16.32.19@tcp 3 up 173 8 8 -2 6 -111 1241232

Does that "-ve min" values mean it needs tuning for "peer_buffer_credits".

Comment by Manish Patel (Inactive) [ 11/Feb/14 ]

Here are the Router logs incase if required.

Comment by Peter Jones [ 18/Feb/14 ]

Isaac will help with this one

Comment by Isaac Huang (Inactive) [ 18/Feb/14 ]

I've checked the router log, and there's several errors that indicate possible configuration problems:

(lib-move.c:1957:lnet_parse_get()) 10.10.0.151@o2ib: Unable to send REPLY for GET from 12345-149.165.235.17@tcp: -22
No route to 149.165.235.17@tcp via from 10.10.0.151@o2ib
(lib-move.c:2272:lnet_parse()) 149.165.235.17@tcp, src 149.165.235.17@tcp: Bad dest nid 10.10.0.151@o2ib (it's my nid but on a different network)

These were caused either by bad routing configurations or admin doing wrong thing, e.g. ping router's IB NID from a TCP client.

Oct 16 13:42:42 dc2xfer03 kernel: : Lustre: Added LNI 149.165.235.153@tcp [8/256/0/180]
Oct 16 13:43:39 dc2xfer03 kernel: : LustreError: 2703:0:(lib-move.c:1957:lnet_parse_get()) 149.165.235.153@tcp: Unable to send REPLY for GET from 12345-149.165.228.160@tcp: -113

So LNET was started at 13:42:42, then less than 1 minute later 149.165.228.160@tcp was considered dead. This often indicates that the peer_timeout was configured with a value too small.

I think the 1st step to troubleshoot would be to make sure all LNET/LND options are configured properly. Please give me the module options for lnet and ksocklnd and ko2iblnd on clients, routers, and servers. I'll double check all the configurations.

Comment by Manish Patel (Inactive) [ 18/Feb/14 ]

Hi Isaac,

Thank You, here are the details about the LNET/LND options configured which you requested.

------------------
Lustre Servers
------------------

options lnet ip2nets="o2ib0 10.10.*.*; tcp0(eth2) 149.165.*.*;"
options lnet routes="tcp0 10.10.0.[151-154]@o2ib0; o2ib0 149.165.235.[151-154]@tcp0; gni0 10.10.0.[51-72]@o2ib0; gni1 10.10.0.[100-101]@o2ib0;"
options lnet live_router_check_interval="60"
options lnet dead_router_check_interval="60"
options lnet check_routers_before_use="1"
options lnet auto_down="1"
options lnet avoid_asym_router_failure="1"
options lnet router_ping_timeout=50

options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
options ko2iblnd credits=2048 ntx=2048

------------------
Lustre Clients
------------------

options lnet networks="tcp0(eth1)"
options libcfs libcfs_panic_on_lbug=1
options lnet routes="o2ib 149.165.235.[151-154]@tcp0"
options lnet check_routers_before_use=1
options lnet router_ping_timeout=50
options lnet avoid_asym_router_failure=1
options lnet dead_router_check_interval=60
options lnet live_router_check_interval=60

--------------------------------------
Router Nodes 'dc2xfer01 to dc2xfer04'
--------------------------------------

options lnet networks="tcp(eth7),o2ib(ib0)"

--------------------------------------
Router Nodes 'dc2xfer05 to dc2xfer08'
--------------------------------------

options lnet networks="o2ib0(ib0), tcp0(eth7)"
options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
options ko2iblnd credits=2048 ntx=2048

Let me know if you need any more details on it or anything is missing.

Thank You,
Manish

Comment by Manish Patel (Inactive) [ 21/Feb/14 ]

Hi Isaac,

Thank You, here are the details about the LNET/LND options configured which you requested.

------------------
Lustre Servers
------------------

options lnet ip2nets="o2ib0 10.10.*.*; tcp0(eth2) 149.165.*.*;"
options lnet routes="tcp0 10.10.0.[151-154]@o2ib0; o2ib0 149.165.235.[151-154]@tcp0; gni0 10.10.0.[51-72]@o2ib0; gni1 10.10.0.[100-101]@o2ib0;"
options lnet live_router_check_interval="60"
options lnet dead_router_check_interval="60"
options lnet check_routers_before_use="1"
options lnet auto_down="1"
options lnet avoid_asym_router_failure="1"
options lnet router_ping_timeout=50

options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
options ko2iblnd credits=2048 ntx=2048

------------------
Lustre Clients
------------------

options lnet networks="tcp0(eth1)"
options libcfs libcfs_panic_on_lbug=1
options lnet routes="o2ib 149.165.235.[151-154]@tcp0"
options lnet check_routers_before_use=1
options lnet router_ping_timeout=50
options lnet avoid_asym_router_failure=1
options lnet dead_router_check_interval=60
options lnet live_router_check_interval=60

--------------------------------------
Router Nodes 'dc2xfer01 to dc2xfer04'
--------------------------------------

options lnet networks="tcp(eth7),o2ib(ib0)"

--------------------------------------
Router Nodes 'dc2xfer05 to dc2xfer08'
--------------------------------------

options lnet networks="o2ib0(ib0), tcp0(eth7)"
options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
options ko2iblnd credits=2048 ntx=2048

I have a new set of router logs which I have attached here. Also IU reported that ignore the incidence which occurred on 2/4 as they're related to a configuration error was made on dc2mds01 when bringing up its new public interface. To correct the error they modified dc2mds01's lnet options to exclude eth2, which is its new pubic interface.

From:

options lnet ip2nets="o2ib0 10.10.*.*; tcp0(eth2) 149.165.*.*;"

To:

options lnet ip2nets="o2ib0 10.10.*.*;"

Also, please ignore any LNet routing issues reported between 10/31/2013

  • 11/1/2013 as they too are related to a configuration error made from their side.

Let me know if you need any more details on it or anything is missing.

Thank You,
Manish

Comment by Isaac Huang (Inactive) [ 21/Feb/14 ]

Several things not quite right in the configurations:

  • For avoid_asym_router_failure to work properly, routers need to know how often clients/servers are supposed to ping them. So on all routers please add:
    options lnet dead_router_check_interval=60
    options lnet live_router_check_interval=60
    The value for the check interval options on routers should be the smallest of the corresponding values on clients and servers, e.g. dead_router_check_interval(router) = min(dead_router_check_interval(clients), dead_router_check_interval(servers)).
  • The peer_timeout should ALWAYS be disabled on clients and servers. So on all clients and servers, please add:
    options ko2iblnd peer_timeout=0
    options ksocklnd peer_timeout=0
  • On router Nodes 'dc2xfer05 to dc2xfer08', ko2iblnd peer_timeout=0. Why was it disabled on these routers but not on 'dc2xfer01 to dc2xfer04'? Why disable it only for the IB but not the TCP? It doesn't make much sense to me.
Comment by Isaac Huang (Inactive) [ 21/Feb/14 ]

Lots of errors like:
lnet_parse_get()) 149.165.235.151@tcp: Unable to send REPLY for GET from 12345-149.165.229.156@tcp: -113

So routers were dropping pings from clients because they thought the clients were dead. The clients were apparent not dead because pings just arrived from them. Something could be wrong with the socklnd peer_timeout mechanism. To narrow it down, please on all routers disable peer timeout completely:
options ko2iblnd peer_timeout=0
options ksocklnd peer_timeout=0

Also, have you ever done a LNet selftest stress test between clients and servers?

Comment by Manish Patel (Inactive) [ 24/Feb/14 ]

Hi Isaac,

I have checked with the IU about the Nodes 'dc2xfer05 to dc2xfer08' and they responded that initially they intended to use it as a router nodes but it has never been configured or used as a router and they are normal Lustre Client nodes, that's why the "peer_timeout" is disabled. Here are my few questions

1. Do you still suggest that we disable "peer_timeout" on router nodes too.
2. what is the recommended setting when clients need to mount both routed and non-routed Lustre file systems, in that scenario what should be the "peer_timeout" settings.

Thank You,
Manish

Comment by Isaac Huang (Inactive) [ 25/Feb/14 ]

The only nodes where peer_timeout is supposed to work are routers. So it should be disabled everywhere except on routers. It was a common mistake to enable it on clients and servers, so some time ago we added a patch that'd simply ignore peer_timeout setting and always disable it except on routers. But I can't remember what Lustre version that patch was landed on. So it's a good idea to always explicitly disable peer_timeout on clients and servers.

The peer_timeout works on routers for certain error scenarios. Normally it should always be ON on routers. Here I'm suggesting to disable it on routers because I suspect that there's a bug in peer_timeout mechanism that could cause routers to treat good clients/servers as dead and hence dropping messages unnecessarily. We'd be able to narrow it down by disabling peer_timeout on routers. This is for troubleshooting only.

Whether it's routed or not, peer_timeout should be always off on both clients and servers.

Comment by Christopher Morrone [ 21/May/14 ]

peer_timeout was enabled by default in the socklnd and o2iblnd back in 2009. If that should not have been set, that was a bug in lustre/lnet, not a "common mistake" in configuration.

so some time ago we added a patch that'd simply ignore peer_timeout setting and always disable it except on routers. But I can't remember what Lustre version that patch was landed on.

Could you please track that down? It would be very helpful to note which commit addressed this.

Comment by Christopher Morrone [ 21/May/14 ]

I did some legwork and it looks to me like the peer_timeouts-should-only-be-used-on-router-nodes bug was addressed in the following commit:

commit 00243fccc1977e4dee8041f4c0f9854373598dc2
Author:     Isaac Huang <he.huang@intel.com>
AuthorDate: Tue Mar 19 13:20:53 2013 -0600
Commit:     Oleg Drokin <oleg.drokin@intel.com>
CommitDate: Fri Apr 12 21:31:56 2013 -0400

    LU-2133 lnet: wrong peer state reported

The commit description should have made this change far more clear. The state of the code surrounding peer_timeout is pretty subtle and not adequately document either. We have some technical debt here.

But it looks to me like Lustre 2.4 and later do not require the work-around of manually setting peer_timeout=0 on non-router nodes. Do you agree?

Comment by Chris Horn [ 28/May/14 ]

Isaac Huang added a comment - 21/Feb/14 6:16 PM
The value for the check interval options on routers should be the smallest of the corresponding values on clients and servers, e.g. dead_router_check_interval(router) = min(dead_router_check_interval(clients), dead_router_check_interval(servers)).

Hi Isaac,

This advice for the dead/live router check interval is not consistent with what is written in the Lustre OPs manual. The OPs manual states that the maximum corresponding value on clients and servers should be used:

The following router checker parameters must be set to the maximum value of the corresponding setting for this option on any client or server:

dead_router_check_interval

live_router_check_interval

router_ping_timeout

Can you please clarify which is correct? I suspect your version is correct, in which case we should open a ticket to fix the OPs manual.

Note, also, that the OPs manual mentions setting router_ping_timeout on the router, but you didn't mention it in your earlier comment. Is it necessary to set the router_ping_timeout on a router?

Comment by Christopher Morrone [ 12/Jun/14 ]

But it looks to me like Lustre 2.4 and later do not require the work-around of manually setting peer_timeout=0 on non-router nodes. Do you agree?

We could use an answer to this question.

Comment by Isaac Huang (Inactive) [ 13/Jun/14 ]

The bug was LU-630, fixed by http://review.whamcloud.com/#/c/2646/, which was then back ported to b2_1 http://review.whamcloud.com/#/c/1797/ and landed for 2.1.2 and 2.3 and later.

Comment by Isaac Huang (Inactive) [ 13/Jun/14 ]

The OP manual was correct, my previous comment wrong.

Comment by Peter Jones [ 18/Jul/14 ]

As per DDN ok to close

Generated at Sat Feb 10 01:43:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.