[LU-6748] excessive client reconnect to OSS servers under heavy IO work load. Created: 19/Jun/15  Updated: 31/Aug/15  Resolved: 31/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-6723 Setting map_on_demand for o2iblnd dri... Resolved
Related
is related to LU-6723 Setting map_on_demand for o2iblnd dri... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While testing the last pre-2.8 code I noticed heavy client reconnects to OSS servers. The error on the client side was:

Lustre: sultan-OST0008-osc-ffff8803ea302800: Connection to sultan-OST0008 (at 10.37.248.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 55 previous similar messages
Lustre: 5355:0:(client.c:2009:ptlrpc_expire_one_request()) Skipped 61 previous similar messages
Lustre: 5350:0:(client.c:2009:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434742560/real 1434742560] req@ffff8803c23fb6c0 x1504421695570504/t0(0) o8->sultan-OST0023-osc-ffff8803ea302800@10.37.248.72@o2ib1:28/4 lens 520/544 e 0 to 1 dl 1434742568 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 5350:0:(client.c:2009:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Lustre: sultan-OST0000-osc-ffff8803ea302800: Connection restored to sultan-OST0000 (at 10.37.248.69@o2ib1)
Lustre: Skipped 27 previous similar messages
Lustre: 5356:0:(client.c:2009:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1434742782/real 1434742782] req@ffff8803c1b639c0 x1504421695572244/t0(0) o400->sultan-OST0034-osc-ffff8803ea302800@10.37.248.69@o2ib1:28/4 lens 224/224 e 0 to 1 dl 1434742789 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: sultan-OST0000-osc-ffff8803ea302800: Connection to sultan-OST0000 (at 10.37.248.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 41 previous similar messages
Lustre: 5356:0:(client.c:2009:ptlrpc_expire_one_request()) Skipped 73 previous similar messages
Lustre: sultan-OST0000-osc-ffff8803ea302800: Connection restored to sultan-OST0000 (at 10.37.248.69@o2ib1)
Lustre: Skipped 41 previous similar messages
Lustre: sultan-OST0003-osc-ffff8803ea302800: Connection restored to sultan-OST0003 (at 10.37.248.72@o2ib1)
Lustre: Skipped 27 previous similar messages

and the messages seen on the OSS side are:

20639.820176] Lustre: sultan-OST0008: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
[20639.829910] Lustre: Skipped 20 previous similar messages
[20676.881745] Lustre: sultan-OST000c: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
[20676.891462] Lustre: Skipped 29 previous similar messages
[20868.910972] Lustre: sultan-OST0004: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
[20868.920682] Lustre: Skipped 23 previous similar messages
[20906.993360] Lustre: sultan-OST0000: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
[20906.993364] Lustre: sultan-OST0004: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting
[20906.993368] Lustre: Skipped 17 previous similar messages
[20907.018191] Lustre: Skipped 11 previous similar messages

This occured when I ran a file per process IOR job across 20 nodes with 32 threads per client.



 Comments   
Comment by James A Simmons [ 19/Jun/15 ]

Uploaded logs to ftp.whamcloud.com/uploads/LU-6748/*

Comment by Jinshan Xiong (Inactive) [ 19/Jun/15 ]

the client was experiencing slow reply, but OSS even has ever sent early replies.

Here are some suspicious messages:

<node_health:5.1> APID:831 (xtcheckhealth) WARNING: Advanced_features and anyapid check are both configured on. Application test could falsely mark nodes unhealthy.

<node_health:5.1> RESID:3043 (xtcheckhealth) WARNING: Advanced_features and anyapid check are both configured on. Application test could falsely mark nodes unhealthy.

I believe they are from GNI lnd, did this message imply anything?

Comment by James A Simmons [ 20/Jun/15 ]

That is the Cray health check warning the file system is sick. The message comes from a user land utility.

Comment by Jinshan Xiong (Inactive) [ 20/Jun/15 ]

When did you notice the regression? Can you identify which patches, or a range of date when the patches were landed?

From your description, it looks like this is related to ldlm lock handling. Please check the LRU size on the client side and how many locks are cached in the LRU when the problem is reproduced. If possible, please dump the locks so that we can make further investigation.

Comment by James A Simmons [ 22/Jun/15 ]

Just as I thought. I found the source of the regression. I had map_on_demand set on the OSS servers. I unmounted the file system and restarted the o2iblnd layer without map_on_demand set and I stopped seeing reconnecting issues. So the map_on_demand is a problem on normal systems as well.

Comment by James A Simmons [ 27/Aug/15 ]

We can close this as a duplicate of LU-6723

Generated at Sat Feb 10 02:02:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.