[LU-5758] enabling avoid_asym_router_failure prvents the bring up of ORNL production systems Created: 16/Oct/14 Updated: 15/Jul/15 Resolved: 15/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3, Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Any 2.4/2.5 clients running against 2.4.3 or 2.5.3 servers. |
||
| Issue Links: |
|
||||||||||||
| Epic/Theme: | ORNL | ||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 16157 | ||||||||||||
| Description |
|
With the deployment of Lustre 2.5 center wide at ORNL we encountered problems being up the production system due to avoid_sym_router_failure |
| Comments |
| Comment by James A Simmons [ 16/Oct/14 ] |
|
This appears to be related to |
| Comment by Peter Jones [ 16/Oct/14 ] |
|
Isaac/Liang Do you agree that this seems related to the work in Peter |
| Comment by Isaac Huang (Inactive) [ 16/Oct/14 ] |
|
I'd suspect yes, but with no debugging information I can't tell for sure. It'd be sufficient to have Lustre debug logs and /proc/sys/lnet/routers on a couple of servers (including the MDS/MGS) and some clients once the problem has happened. |
| Comment by James A Simmons [ 16/Oct/14 ] |
|
Don't worry I can easily get the data you want. I can reproduce at small scale. Tell me what data points you want and I can break the test system. |
| Comment by Isaac Huang (Inactive) [ 16/Oct/14 ] |
|
lctl dk dumps and /proc/sys/lnet/routers on a couple of servers (including the MDS/MGS) and some clients once the problem has happened, and any error messages showing up in dmesg. |
| Comment by Andreas Dilger [ 02/Dec/14 ] |
|
Will the patches in |
| Comment by James A Simmons [ 02/Dec/14 ] |
|
Yes the patches for |
| Comment by James A Simmons [ 15/Dec/14 ] |
|
Today we tested the patch from |
| Comment by Liang Zhen (Inactive) [ 16/Dec/14 ] |
|
Hi James, could you please post modprobe conf from these nodes, and also check this on the router which has unplugged cable (after you unplugged the cable for a few minutes):
And check these on a node which is on functional side of that router:
|
| Comment by James A Simmons [ 30/Dec/14 ] |
|
Unplugged router side facing client. client facing: 10.38.144.12 jsimmons@client:/proc/sys/lnet$ cat nis jsimmons@client:/proc/sys/lnet$ cat routers jsimmons@client:/proc/sys/lnet$ cat routes jsimmons@server:/proc/sys/lnet$ cat routers jsimmons@server:/proc/sys/lnet$cat routes |
| Comment by James A Simmons [ 30/Dec/14 ] |
|
This time we unplugged the IB interface facing the server: jsimmons@client:/proc/sys/lnet$ cat nis jsimmons@client:/proc/sys/lnet$ cat routers jsimmons@client:/proc/sys/lnet$ cat routes jsimmons@server:/proc/sys/lnet$ cat routers jsimmons@server:/proc/sys/lnet$ cat routes |
| Comment by Liang Zhen (Inactive) [ 31/Dec/14 ] |
|
Hi James, thanks for information, but the "lnet/nis" I asked for was the router not client |
| Comment by James A Simmons [ 31/Dec/14 ] |
|
We ran with the |
| Comment by Liang Zhen (Inactive) [ 31/Dec/14 ] |
|
Looks like when unplugging IB interface facing the client, router status (on server) is correct: jsimmons@server:/proc/sys/lnet$ cat routers ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router 4 1 5 up 4 1 NA 1 10.36.145.12@o2ib <<<<<<<<<<< it has 1 down_ni, which is correct jsimmons@server:/proc/sys/lnet$cat routes Routing disabled net hops priority state router o2ib4 1 0 down 10.36.145.12@o2ib <<<<<<<<< status is down, which is correct So these outputs are correct to me. However, while unplugging IB interface facing server, router status (on client) is incorrect: jsimmons@client:/proc/sys/lnet$ cat routers ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router 4 1 4 up 33 1 NA 0 10.38.144.12@o2ib4 <<<<<<< down_ni is 0, which is wrong 4 1 2 up 19 1 NA 0 10.38.144.13@o2ib4 jsimmons@client:/proc/sys/lnet$ cat routes Routing disabled net hops priority state router o2ib 1 0 up 10.38.144.12@o2ib4 <<<<<<<<<<< status is up, which is wrong o2ib 1 0 up 10.38.144.13@o2ib4 This is incorrect, it is also very strange to me, because there is no server/client difference from point of view of LNet. I actually did the same test in our lab and router status was correct on both client & server.
|
| Comment by James A Simmons [ 31/Dec/14 ] |
|
Yep both clients and servers are running with a properly patched version ( |
| Comment by Liang Zhen (Inactive) [ 04/Jan/15 ] |
|
Hi James, because it normally takes about 2 minutes to make a NI as "DOWN" (depends on router_ping_timeout and dead/live_router_check_interval), may I ask how long you waited after unplugging interface, and what are values for those module parameters? I checked your logs: facing client, last alive news is 42 seconds ago, seems it is not long enough nid status alive refs peer rtr max tx min 10.38.144.12@o2ib4 up 42 1 63 0 640 640 640 facing server, last alive news is only 16 seconds ago nid status alive refs peer rtr max tx min 10.36.145.12@o2ib up 16 3 63 0 640 640 634 Because in previous logs, we saw inconsistent results from server/client (server saw correct NI status on router, but client didn't), so I suspect this test should wait a little longer. |
| Comment by James A Simmons [ 04/Jan/15 ] |
|
If I remember right for this set of test we waited for 5 minutes. Then we waited another 5 minutes after that. So it remained in the same state. For the next round of test I will ensure that the logs are taken after the timer expires. |
| Comment by Jian Yu [ 08/Jan/15 ] |
Here is the patch created by Liang on Lustre b2_5 branch: http://review.whamcloud.com/12435 |
| Comment by James A Simmons [ 08/Jan/15 ] |
|
The test were done with 12435. It helped in that we can now mount lustre with ARF but now ARF itself doesn't work. |
| Comment by Jian Yu [ 08/Jan/15 ] |
|
I see, thank you James for the clarification. Hi Liang, |
| Comment by Liang Zhen (Inactive) [ 09/Jan/15 ] |
|
Yujian, yes we need more logs to find out the reason. James, in the next round of test, could you please sample lnet/nis on router for each 10 seconds and recording it for total 300 seconds. Also, it is helpful to let us know these values on router: live_router_check_interval, live_router_check_interval, router_ping_timeout. And if possible, could you post your lnet source at here so I can check all patches. Thanks in advance. |
| Comment by James A Simmons [ 13/Jan/15 ] |
|
Liang I'm setting up the test. I should have results soon. I think you want someone else besides the two above live_router_check_intervals |
| Comment by Liang Zhen (Inactive) [ 14/Jan/15 ] |
|
Hi James, because there are many discussions already, I think we should summarise where we are now, please check and correct me if I'm wrong or missed something :
I also have another question:
Isaac,do you have any advice or insight on this issue? |
| Comment by Jian Yu [ 05/Feb/15 ] |
|
Hi Liang, Now, they have a question about ARF: Hi James, |
| Comment by Liang Zhen (Inactive) [ 12/Feb/15 ] |
|
it's ok to mix them, in that case client/server just can't find out failed remote NI on routers. |
| Comment by Jian Yu [ 12/Mar/15 ] |
|
Hi James, |
| Comment by James A Simmons [ 15/Jul/15 ] |
|
No problems. We have been running ARF for months now. You can close this ticket. |
| Comment by Peter Jones [ 15/Jul/15 ] |
|
Great - thanks James |