[LU-5758] enabling avoid_asym_router_failure prvents the bring up of ORNL production systems Created: 16/Oct/14  Updated: 15/Jul/15  Resolved: 15/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3, Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: soak
Environment:

Any 2.4/2.5 clients running against 2.4.3 or 2.5.3 servers.


Issue Links:
Related
is related to LU-6060 ARF doesn't detect lack of interface ... Resolved
is related to LU-5485 first mount always fail with avoid_as... Resolved
Epic/Theme: ORNL
Severity: 3
Rank (Obsolete): 16157

 Description   

With the deployment of Lustre 2.5 center wide at ORNL we encountered problems being up the production system due to avoid_sym_router_failure
being enabled by default. The LNET fabric would fail to come up when enabled. Once it was turned off by default everything returned to normal. This would a useful feature at have for this scale of a system. This problem can be easily reproduced at smaller scale in both non-FGR and FGR set ups.



 Comments   
Comment by James A Simmons [ 16/Oct/14 ]

This appears to be related to LU-5485. Please create a link to this ticket. Thank you.

Comment by Peter Jones [ 16/Oct/14 ]

Isaac/Liang

Do you agree that this seems related to the work in LU-5485?

Peter

Comment by Isaac Huang (Inactive) [ 16/Oct/14 ]

I'd suspect yes, but with no debugging information I can't tell for sure. It'd be sufficient to have Lustre debug logs and /proc/sys/lnet/routers on a couple of servers (including the MDS/MGS) and some clients once the problem has happened.

Comment by James A Simmons [ 16/Oct/14 ]

Don't worry I can easily get the data you want. I can reproduce at small scale. Tell me what data points you want and I can break the test system.

Comment by Isaac Huang (Inactive) [ 16/Oct/14 ]

lctl dk dumps and /proc/sys/lnet/routers on a couple of servers (including the MDS/MGS) and some clients once the problem has happened, and any error messages showing up in dmesg.

Comment by Andreas Dilger [ 02/Dec/14 ]

Will the patches in LU-5485 fix this problem as well? This is on the 2.7.0 fix list and we need to know if someone needs to be working on this or if those patches are enough?

Comment by James A Simmons [ 02/Dec/14 ]

Yes the patches for LU-5485 do resolve this from my small scale test. I haven't tested at full scale yet and will not be able to until after the start of January.

Comment by James A Simmons [ 15/Dec/14 ]

Today we tested the patch from LU-5485 with a 500 node cluster. The good news is that enabling avoid_asym_router_failure did not prevent the bring up of the file system. The bad news is ARF failed to work. We unplugged one side of the router and waited. Still the one functional side of the router was still marked up which it shouldn't be.

Comment by Liang Zhen (Inactive) [ 16/Dec/14 ]

Hi James, could you please post modprobe conf from these nodes, and also check this on the router which has unplugged cable (after you unplugged the cable for a few minutes):

  • cat /proc/sys/lnet/nis

And check these on a node which is on functional side of that router:

  • cat /proc/sys/lnet/routes
  • cat /proc/sys/lnet/routers
Comment by James A Simmons [ 30/Dec/14 ]

Unplugged router side facing client.

client facing: 10.38.144.12
server facing: 10.36.145.12

jsimmons@client:/proc/sys/lnet$ cat nis
nid status alive refs peer rtr max tx min
0@lo up 0 2 0 0 0 0 0
0@lo up 0 0 0 0 0 0 0
0@lo up 0 0 0 0 0 0 0
0@lo up 0 0 0 0 0 0 0
10.38.146.45@o2ib4 up -1 1 63 0 640 640 640
10.38.146.45@o2ib4 up -1 0 63 0 640 640 640
10.38.146.45@o2ib4 up -1 1 63 0 640 640 622
10.38.146.45@o2ib4 up -1 1 63 0 640 639 623

jsimmons@client:/proc/sys/lnet$ cat routers
ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router
5 1 3 down 51 0 68 0 10.38.144.12@o2ib4
4 1 2 up 0 1 NA 0 10.38.144.13@o2ib4

jsimmons@client:/proc/sys/lnet$ cat routes
Routing disabled
net hops priority state router
o2ib 1 0 down 10.38.144.12@o2ib4
o2ib 1 0 up 10.38.144.13@o2ib4

jsimmons@server:/proc/sys/lnet$ cat routers
ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router
4 1 5 up 4 1 NA 1 10.36.145.12@o2ib

jsimmons@server:/proc/sys/lnet$cat routes
Routing disabled
net hops priority state router
o2ib4 1 0 down 10.36.145.12@o2ib

Comment by James A Simmons [ 30/Dec/14 ]

This time we unplugged the IB interface facing the server:

jsimmons@client:/proc/sys/lnet$ cat nis
nid status alive refs peer rtr max tx min
0@lo up 0 2 0 0 0 0 0
0@lo up 0 0 0 0 0 0 0
0@lo up 0 0 0 0 0 0 0
0@lo up 0 0 0 0 0 0 0
10.38.146.45@o2ib4 up -1 1 63 0 640 640 640
10.38.146.45@o2ib4 up -1 0 63 0 640 640 640
10.38.146.45@o2ib4 up -1 1 63 0 640 640 616
10.38.146.45@o2ib4 up -1 1 63 0 640 640 623

jsimmons@client:/proc/sys/lnet$ cat routers
ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router
4 1 4 up 33 1 NA 0 10.38.144.12@o2ib4
4 1 2 up 19 1 NA 0 10.38.144.13@o2ib4

jsimmons@client:/proc/sys/lnet$ cat routes
Routing disabled
net hops priority state router
o2ib 1 0 up 10.38.144.12@o2ib4
o2ib 1 0 up 10.38.144.13@o2ib4

jsimmons@server:/proc/sys/lnet$ cat routers
ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router
5 1 6 down 76 0 NA 0 10.36.145.12@o2ib

jsimmons@server:/proc/sys/lnet$ cat routes
Routing disabled
net hops priority state router
o2ib4 1 0 down 10.36.145.12@o2ib

Comment by Liang Zhen (Inactive) [ 31/Dec/14 ]

Hi James, thanks for information, but the "lnet/nis" I asked for was the router not client . I'm wondering if the unplugging here is same as LU-6060 (LNet Interface is also absent)

Comment by James A Simmons [ 31/Dec/14 ]

We ran with the LU-6060 patch as well.

Comment by Liang Zhen (Inactive) [ 31/Dec/14 ]

Looks like when unplugging IB interface facing the client, router status (on server) is correct:

jsimmons@server:/proc/sys/lnet$ cat routers
ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router
4 1 5 up 4 1 NA 1 10.36.145.12@o2ib  <<<<<<<<<<< it has 1 down_ni, which is correct

jsimmons@server:/proc/sys/lnet$cat routes
Routing disabled
net hops priority state router
o2ib4 1 0 down 10.36.145.12@o2ib <<<<<<<<< status is down, which is correct

So these outputs are correct to me.

However, while unplugging IB interface facing server, router status (on client) is incorrect:

jsimmons@client:/proc/sys/lnet$ cat routers 
ref rtr_ref alive_cnt state last_ping ping_sent deadline down_ni router
4 1 4 up 33 1 NA 0 10.38.144.12@o2ib4  <<<<<<< down_ni is 0, which is wrong
4 1 2 up 19 1 NA 0 10.38.144.13@o2ib4
jsimmons@client:/proc/sys/lnet$ cat routes
Routing disabled
net hops priority state router
o2ib 1 0 up 10.38.144.12@o2ib4   <<<<<<<<<<< status is up, which is wrong
o2ib 1 0 up 10.38.144.13@o2ib4

This is incorrect, it is also very strange to me, because there is no server/client difference from point of view of LNet. I actually did the same test in our lab and router status was correct on both client & server.
So I have a couple of questions:

  • could you please confirm that both client & server are running with the same patches.
  • As I previously mentioned, could you please post output of "cat /proc/sys/lnet/nis" on router while running these tests?
Comment by James A Simmons [ 31/Dec/14 ]

Yep both clients and servers are running with a properly patched version (LU-5485/LU-6060). I just sent to ftp.whamcloud.com/uploads/LU-5758/arf-test.tgz. The files are all taken from the router. The files with client in them means that the router interface facing the clients was unplugged. The files labeled server means the router interfaces facing the servers were unplugged. Normal is everything plugged in. The reason for multiple client and server files is that lctl dk were taken a few times each a few seconds apart.

Comment by Liang Zhen (Inactive) [ 04/Jan/15 ]

Hi James, because it normally takes about 2 minutes to make a NI as "DOWN" (depends on router_ping_timeout and dead/live_router_check_interval), may I ask how long you waited after unplugging interface, and what are values for those module parameters?

I checked your logs:

facing client, last alive news is 42 seconds ago, seems it is not long enough

nid                                      status alive refs peer  rtr   max    tx   min
10.38.144.12@o2ib4           up    42    1   63    0   640   640   640

facing server, last alive news is only 16 seconds ago

nid                                      status alive refs peer  rtr   max    tx   min
10.36.145.12@o2ib            up    16    3   63    0   640   640   634

Because in previous logs, we saw inconsistent results from server/client (server saw correct NI status on router, but client didn't), so I suspect this test should wait a little longer.
If this is true, I have some changes for LU-5485, which will avoid to use potentially dead router and help this case, but I need more time to make it ready for product.

Comment by James A Simmons [ 04/Jan/15 ]

If I remember right for this set of test we waited for 5 minutes. Then we waited another 5 minutes after that. So it remained in the same state. For the next round of test I will ensure that the logs are taken after the timer expires.

Comment by Jian Yu [ 08/Jan/15 ]

If this is true, I have some changes for LU-5485, which will avoid to use potentially dead router and help this case, but I need more time to make it ready for product.

Here is the patch created by Liang on Lustre b2_5 branch: http://review.whamcloud.com/12435

Comment by James A Simmons [ 08/Jan/15 ]

The test were done with 12435. It helped in that we can now mount lustre with ARF but now ARF itself doesn't work.

Comment by Jian Yu [ 08/Jan/15 ]

I see, thank you James for the clarification.

Hi Liang,
Would you like James to gather more logs to help investigate the ARF issue?

Comment by Liang Zhen (Inactive) [ 09/Jan/15 ]

Yujian, yes we need more logs to find out the reason.

James, in the next round of test, could you please sample lnet/nis on router for each 10 seconds and recording it for total 300 seconds.

Also, it is helpful to let us know these values on router: live_router_check_interval, live_router_check_interval, router_ping_timeout. And if possible, could you post your lnet source at here so I can check all patches.

Thanks in advance.

Comment by James A Simmons [ 13/Jan/15 ]

Liang I'm setting up the test. I should have results soon. I think you want someone else besides the two above live_router_check_intervals

Comment by Liang Zhen (Inactive) [ 14/Jan/15 ]

Hi James, because there are many discussions already, I think we should summarise where we are now, please check and correct me if I'm wrong or missed something :

  • there are two LNet patches in your environment
  • both patches are applied to all nodes (client, server and router)
  • If there is any other LNet patch, could you post link of patches, or tarball of lnet.
  • ARF can't detect unplugged IB interface on router with above changes
    • in the comment posted at 31/Dec/14 8:59 AM click here , we found router status on client is wrong while unplugging IB interface facing server. But router status on server is correct after unplugging IB interface facing client.
  • This problem could either because router can't set NI status to "DOWN" for unknown reason, or because client/server record wrong NI status for router by mistake. To find out which one is the real reason, first we need to check NI status on router after unplugging NI. Because it will take a few minutes for LNet to detect and mark a NI as down, so we need to sample /proc/sys/lnet/nis for 5 minutes, and sample once per 10 seconds.
    • if NI status stays UP forever, then we know router has defect and it can't change NI status.
    • if NI status turns to DOWN, it's probably a bug on non-router node, could you check /proc/sys/lnet/routes and /proc/sys/lnet/routers on client and server.
  • we also need to know these values in your environment:
    • live_router_check_interval
    • live_router_check_interval
    • router_ping_timeout

I also have another question:

  • The "unplugging IB interface" at here has any difference with the experiment for LU-6060?

Isaac,do you have any advice or insight on this issue?

Comment by Jian Yu [ 05/Feb/15 ]

Hi Liang,
With the two LNet patches you mentioned above, ORNL tested ARF on a small scale cluster and found it worked. They are going to test it on a large scale cluster.

Now, they have a question about ARF:
"Does this need to be completely on or completely off, or is it possible to have some clusters have it enabled and not others?"

Hi James,
Could you please explain more in case I did not describe the question clearly? Thank you.

Comment by Liang Zhen (Inactive) [ 12/Feb/15 ]

it's ok to mix them, in that case client/server just can't find out failed remote NI on routers.

Comment by Jian Yu [ 12/Mar/15 ]

Hi James,
Does the ARF issue still exist? If no, can we close this ticket as resolved?

Comment by James A Simmons [ 15/Jul/15 ]

No problems. We have been running ARF for months now. You can close this ticket.

Comment by Peter Jones [ 15/Jul/15 ]

Great - thanks James

Generated at Sat Feb 10 01:54:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.