[LU-4531] frequent evictions and timeouts on routed lnet Created: 23/Jan/14 Updated: 18/Jul/14 Resolved: 18/Jul/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Manish Patel (Inactive) | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 12392 |
| Description |
|
IU's dc2 filesystem is primarily served out via IB, but there is also a tcp connected cluster routed through 4 LNET routers (dc2xfer[01-04]). We've been getting a lot of client evictions recently and aren't sure the best way to troubleshoot them. Here's an example of one of the evictions that happened on Jan 19: Jan 19 03:33:40 c3 kernel: Lustre: dc2-OST0038-osc-ffff8860122a1000: Connection to service dc2-OST0038 via nid 10.10.0.6@o2ib was lost; in progress operations using this service will wait for recovery to complete. Jan 19 03:33:40 c3 kernel: LustreError: 18821:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway Jan 19 03:33:43 c3 kernel: LustreError: 167-0: This client was evicted by dc2-OST0038; in progress operations using this service will fail. Jan 19 03:33:43 c3 kernel: Lustre: Server dc2-OST0038_UUID version (2.1.6.0) is much newer than client version (1.8.9) Jan 19 03:33:43 c3 kernel: LustreError: 40645:0:(ldlm_resource.c:521:ldlm_namespace_cleanup()) Namespace dc2-OST0038-osc-ffff8860122a1000 resource refcount nonzero (2) after lock cleanup; forcing cleanup. Jan 19 03:33:43 c3 kernel: LustreError: 40645:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff887793d31200 (12440460/0/0/0) (rc: 2) Jan 19 03:33:43 c3 kernel: LustreError: 38780:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5 Jan 19 03:33:43 c3 kernel: Lustre: dc2-OST0038-osc-ffff8860122a1000: Connection restored to service dc2-OST0038 using nid 10.10.0.6@o2ib. Jan 19 03:33:43 c3 kernel: LustreError: 18821:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107 Jan 19 04:12:18 c3 kernel: Lustre: dc2-OST0054-osc-ffff8860122a1000: Connection to service dc2-OST0054 via nid 10.10.0.9@o2ib was lost; in progress operations using this service will wait for recovery to complete. Jan 19 04:12:21 c3 kernel: LustreError: 18678:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway Jan 19 04:12:21 c3 kernel: LustreError: 167-0: This client was evicted by dc2-OST0054; in progress operations using this service will fail. Jan 19 04:12:21 c3 kernel: Lustre: Server dc2-OST0054_UUID version (2.1.6.0) is much newer than client version (1.8.9) Jan 19 04:12:21 c3 kernel: LustreError: 46325:0:(ldlm_resource.c:521:ldlm_namespace_cleanup()) Namespace dc2-OST0054-osc-ffff8860122a1000 resource refcount nonzero (2) after lock cleanup; forcing cleanup. Jan 19 04:12:21 c3 kernel: LustreError: 46325:0:(ldlm_resource.c:526:ldlm_namespace_cleanup()) Resource: ffff887fd34beb40 (12159828/0/0/0) (rc: 2) Jan 19 04:12:21 c3 kernel: Lustre: dc2-OST0054-osc-ffff8860122a1000: Connection restored to service dc2-OST0054 using nid 10.10.0.9@o2ib. Jan 19 04:12:21 c3 kernel: LustreError: 46324:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5 Jan 19 04:12:22 c3 kernel: LustreError: 18678:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107 Jan 19 08:30:58 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786163853766 sent from dc2-OST0095-osc-ffff8860122a1000 to NID 10.10.0.15@o2ib 22s ago has timed out (22s prior to deadline). Jan 19 08:30:58 c3 kernel: Lustre: dc2-OST0095-osc-ffff8860122a1000: Connection to service dc2-OST0095 via nid 10.10.0.15@o2ib was lost; in progress operations using this service will wait for recovery to complete. Jan 19 08:31:00 c3 kernel: Lustre: dc2-OST0095-osc-ffff8860122a1000: Connection restored to service dc2-OST0095 using nid 10.10.0.15@o2ib. Jan 19 08:31:00 c3 kernel: Lustre: Server dc2-OST0095_UUID version (2.1.6.0) is much newer than client version (1.8.9) Jan 19 15:10:06 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984600 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 22s ago has timed out (22s prior to deadline). Jan 19 15:10:06 c3 kernel: Lustre: dc2-OST003f-osc-ffff8860122a1000: Connection to service dc2-OST003f via nid 10.10.0.7@o2ib was lost; in progress operations using this service will wait for recovery to complete. Jan 19 15:10:07 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984602 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 24s ago has timed out (22s prior to deadline). Jan 19 15:10:09 c3 kernel: LustreError: 11-0: an error occurred while communicating with 10.10.0.7@o2ib. The ost_connect operation failed with -16 Jan 19 15:10:09 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984604 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 26s ago has timed out (22s prior to deadline). Jan 19 15:10:11 c3 kernel: Lustre: 2949:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1456786164984607 sent from dc2-OST003f-osc-ffff8860122a1000 to NID 10.10.0.7@o2ib 28s ago has timed out (22s prior to deadline). Jan 19 15:10:15 c3 kernel: Lustre: dc2-OST003f-osc-ffff8860122a1000: Connection restored to service dc2-OST003f using nid 10.10.0.7@o2ib. Jan 19 15:10:15 c3 kernel: Lustre: Server dc2-OST003f_UUID version (2.1.6.0) is much newer than client version (1.8.9) Router logs dc2xfer01: Jan 19 02:19:47 dc2xfer01 kernel: : LustreError: 3529:0:(lib-move.c:1957:lnet_parse_get()) 149.165.235.151@tcp: Unable to send REPLY for GET from 12345-149.165.226.207@tcp: -113 dc2xfer01: Jan 19 02:20:46 dc2xfer01 kernel: : LustreError: 3536:0:(socklnd_cb.c:2520:ksocknal_check_peer_timeouts()) Total 8 stale ZC_REQs for peer 149.165.226.205@tcp detected; the oldest(ffff880ffee94000) timed out 3 secs ago, resid: 0, wmem: 8020736 dc2xfer01: Jan 19 02:21:12 dc2xfer01 kernel: : LustreError: 3530:0:(lib-move.c:1957:lnet_parse_get()) 149.165.235.151@tcp: Unable to send REPLY for GET from 12345-149.165.226.205@tcp: -113 dc2xfer01: Jan 19 08:02:42 dc2xfer01 kernel: : LustreError: 3536:0:(socklnd_cb.c:2520:ksocknal_check_peer_timeouts()) Total 8 stale ZC_REQs for peer 149.165.226.205@tcp detected; the oldest(ffff880fcd326000) timed out 8 secs ago, resid: 0, wmem: 7883680 I'll attach the OSS logs. It looks like there are some Bulk IO errors, but does that indicate an issue between the OSS and the router, or the router and the client? In the past when I've seen those errors, there hasn't been a router in between to confuse the issue. |
| Comments |
| Comment by Peter Jones [ 24/Jan/14 ] |
|
Thanks for the submission |
| Comment by Bruno Faccini (Inactive) [ 24/Jan/14 ] |
|
Hello Kit, |
| Comment by Kit Westneat (Inactive) [ 27/Jan/14 ] |
|
They are all at 1.8.x. The majority are at 1.8.9 I think, but we have seen it on older versions as well. |
| Comment by Manish Patel (Inactive) [ 05/Feb/14 ] |
|
Hi Bruno, Upon further investigating on the network side we seeing -ve min values for rtr min and tx min value for TCP connections. Here is output from cat /proc/sys/lnet/peers nid refs state last max rtr min tx min queue cat /proc/sys/lnet/peers Does that "-ve min" values mean it needs tuning for "peer_buffer_credits". |
| Comment by Manish Patel (Inactive) [ 11/Feb/14 ] |
|
Here are the Router logs incase if required. |
| Comment by Peter Jones [ 18/Feb/14 ] |
|
Isaac will help with this one |
| Comment by Isaac Huang (Inactive) [ 18/Feb/14 ] |
|
I've checked the router log, and there's several errors that indicate possible configuration problems: (lib-move.c:1957:lnet_parse_get()) 10.10.0.151@o2ib: Unable to send REPLY for GET from 12345-149.165.235.17@tcp: -22 These were caused either by bad routing configurations or admin doing wrong thing, e.g. ping router's IB NID from a TCP client. Oct 16 13:42:42 dc2xfer03 kernel: : Lustre: Added LNI 149.165.235.153@tcp [8/256/0/180] So LNET was started at 13:42:42, then less than 1 minute later 149.165.228.160@tcp was considered dead. This often indicates that the peer_timeout was configured with a value too small. I think the 1st step to troubleshoot would be to make sure all LNET/LND options are configured properly. Please give me the module options for lnet and ksocklnd and ko2iblnd on clients, routers, and servers. I'll double check all the configurations. |
| Comment by Manish Patel (Inactive) [ 18/Feb/14 ] |
|
Hi Isaac, Thank You, here are the details about the LNET/LND options configured which you requested. ------------------ options lnet ip2nets="o2ib0 10.10.*.*; tcp0(eth2) 149.165.*.*;" options lnet routes="tcp0 10.10.0.[151-154]@o2ib0; o2ib0 149.165.235.[151-154]@tcp0; gni0 10.10.0.[51-72]@o2ib0; gni1 10.10.0.[100-101]@o2ib0;" options lnet live_router_check_interval="60" options lnet dead_router_check_interval="60" options lnet check_routers_before_use="1" options lnet auto_down="1" options lnet avoid_asym_router_failure="1" options lnet router_ping_timeout=50 options ko2iblnd timeout=100 peer_timeout=0 keepalive=30 options ko2iblnd credits=2048 ntx=2048 ------------------ options lnet networks="tcp0(eth1)" options libcfs libcfs_panic_on_lbug=1 options lnet routes="o2ib 149.165.235.[151-154]@tcp0" options lnet check_routers_before_use=1 options lnet router_ping_timeout=50 options lnet avoid_asym_router_failure=1 options lnet dead_router_check_interval=60 options lnet live_router_check_interval=60 -------------------------------------- options lnet networks="tcp(eth7),o2ib(ib0)" -------------------------------------- options lnet networks="o2ib0(ib0), tcp0(eth7)" options ko2iblnd timeout=100 peer_timeout=0 keepalive=30 options ko2iblnd credits=2048 ntx=2048 Let me know if you need any more details on it or anything is missing. Thank You, |
| Comment by Manish Patel (Inactive) [ 21/Feb/14 ] |
|
Hi Isaac, Thank You, here are the details about the LNET/LND options configured which you requested. ------------------ options lnet ip2nets="o2ib0 10.10.*.*; tcp0(eth2) 149.165.*.*;" options lnet routes="tcp0 10.10.0.[151-154]@o2ib0; o2ib0 149.165.235.[151-154]@tcp0; gni0 10.10.0.[51-72]@o2ib0; gni1 10.10.0.[100-101]@o2ib0;" options lnet live_router_check_interval="60" options lnet dead_router_check_interval="60" options lnet check_routers_before_use="1" options lnet auto_down="1" options lnet avoid_asym_router_failure="1" options lnet router_ping_timeout=50 options ko2iblnd timeout=100 peer_timeout=0 keepalive=30 options ko2iblnd credits=2048 ntx=2048 ------------------ options lnet networks="tcp0(eth1)" options libcfs libcfs_panic_on_lbug=1 options lnet routes="o2ib 149.165.235.[151-154]@tcp0" options lnet check_routers_before_use=1 options lnet router_ping_timeout=50 options lnet avoid_asym_router_failure=1 options lnet dead_router_check_interval=60 options lnet live_router_check_interval=60 -------------------------------------- options lnet networks="tcp(eth7),o2ib(ib0)" -------------------------------------- options lnet networks="o2ib0(ib0), tcp0(eth7)" options ko2iblnd timeout=100 peer_timeout=0 keepalive=30 options ko2iblnd credits=2048 ntx=2048 I have a new set of router logs which I have attached here. Also IU reported that ignore the incidence which occurred on 2/4 as they're related to a configuration error was made on dc2mds01 when bringing up its new public interface. To correct the error they modified dc2mds01's lnet options to exclude eth2, which is its new pubic interface. From: options lnet ip2nets="o2ib0 10.10.*.*; tcp0(eth2) 149.165.*.*;" To: options lnet ip2nets="o2ib0 10.10.*.*;" Also, please ignore any LNet routing issues reported between 10/31/2013
Let me know if you need any more details on it or anything is missing. Thank You, |
| Comment by Isaac Huang (Inactive) [ 21/Feb/14 ] |
|
Several things not quite right in the configurations:
|
| Comment by Isaac Huang (Inactive) [ 21/Feb/14 ] |
|
Lots of errors like: So routers were dropping pings from clients because they thought the clients were dead. The clients were apparent not dead because pings just arrived from them. Something could be wrong with the socklnd peer_timeout mechanism. To narrow it down, please on all routers disable peer timeout completely: Also, have you ever done a LNet selftest stress test between clients and servers? |
| Comment by Manish Patel (Inactive) [ 24/Feb/14 ] |
|
Hi Isaac, I have checked with the IU about the Nodes 'dc2xfer05 to dc2xfer08' and they responded that initially they intended to use it as a router nodes but it has never been configured or used as a router and they are normal Lustre Client nodes, that's why the "peer_timeout" is disabled. Here are my few questions 1. Do you still suggest that we disable "peer_timeout" on router nodes too. Thank You, |
| Comment by Isaac Huang (Inactive) [ 25/Feb/14 ] |
|
The only nodes where peer_timeout is supposed to work are routers. So it should be disabled everywhere except on routers. It was a common mistake to enable it on clients and servers, so some time ago we added a patch that'd simply ignore peer_timeout setting and always disable it except on routers. But I can't remember what Lustre version that patch was landed on. So it's a good idea to always explicitly disable peer_timeout on clients and servers. The peer_timeout works on routers for certain error scenarios. Normally it should always be ON on routers. Here I'm suggesting to disable it on routers because I suspect that there's a bug in peer_timeout mechanism that could cause routers to treat good clients/servers as dead and hence dropping messages unnecessarily. We'd be able to narrow it down by disabling peer_timeout on routers. This is for troubleshooting only. Whether it's routed or not, peer_timeout should be always off on both clients and servers. |
| Comment by Christopher Morrone [ 21/May/14 ] |
|
peer_timeout was enabled by default in the socklnd and o2iblnd back in 2009. If that should not have been set, that was a bug in lustre/lnet, not a "common mistake" in configuration.
Could you please track that down? It would be very helpful to note which commit addressed this. |
| Comment by Christopher Morrone [ 21/May/14 ] |
|
I did some legwork and it looks to me like the peer_timeouts-should-only-be-used-on-router-nodes bug was addressed in the following commit: commit 00243fccc1977e4dee8041f4c0f9854373598dc2
Author: Isaac Huang <he.huang@intel.com>
AuthorDate: Tue Mar 19 13:20:53 2013 -0600
Commit: Oleg Drokin <oleg.drokin@intel.com>
CommitDate: Fri Apr 12 21:31:56 2013 -0400
LU-2133 lnet: wrong peer state reported
The commit description should have made this change far more clear. The state of the code surrounding peer_timeout is pretty subtle and not adequately document either. We have some technical debt here. But it looks to me like Lustre 2.4 and later do not require the work-around of manually setting peer_timeout=0 on non-router nodes. Do you agree? |
| Comment by Chris Horn [ 28/May/14 ] |
Hi Isaac, This advice for the dead/live router check interval is not consistent with what is written in the Lustre OPs manual. The OPs manual states that the maximum corresponding value on clients and servers should be used:
Can you please clarify which is correct? I suspect your version is correct, in which case we should open a ticket to fix the OPs manual. Note, also, that the OPs manual mentions setting router_ping_timeout on the router, but you didn't mention it in your earlier comment. Is it necessary to set the router_ping_timeout on a router? |
| Comment by Christopher Morrone [ 12/Jun/14 ] |
We could use an answer to this question. |
| Comment by Isaac Huang (Inactive) [ 13/Jun/14 ] |
|
The bug was |
| Comment by Isaac Huang (Inactive) [ 13/Jun/14 ] |
|
The OP manual was correct, my previous comment wrong. |
| Comment by Peter Jones [ 18/Jul/14 ] |
|
As per DDN ok to close |