[LU-6748] excessive client reconnect to OSS servers under heavy IO work load. Created: 19/Jun/15 Updated: 31/Aug/15 Resolved: 31/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
While testing the last pre-2.8 code I noticed heavy client reconnects to OSS servers. The error on the client side was: Lustre: sultan-OST0008-osc-ffff8803ea302800: Connection to sultan-OST0008 (at 10.37.248.69@o2ib1) was lost; in progress operations using this service will wait for recovery to complete and the messages seen on the OSS side are: 20639.820176] Lustre: sultan-OST0008: Client 57c62113-31f1-f463-ffeb-9d0c7541279d (at 26@gni1) reconnecting This occured when I ran a file per process IOR job across 20 nodes with 32 threads per client. |
| Comments |
| Comment by James A Simmons [ 19/Jun/15 ] |
|
Uploaded logs to ftp.whamcloud.com/uploads/ |
| Comment by Jinshan Xiong (Inactive) [ 19/Jun/15 ] |
|
the client was experiencing slow reply, but OSS even has ever sent early replies. Here are some suspicious messages: <node_health:5.1> APID:831 (xtcheckhealth) WARNING: Advanced_features and anyapid check are both configured on. Application test could falsely mark nodes unhealthy. <node_health:5.1> RESID:3043 (xtcheckhealth) WARNING: Advanced_features and anyapid check are both configured on. Application test could falsely mark nodes unhealthy. I believe they are from GNI lnd, did this message imply anything? |
| Comment by James A Simmons [ 20/Jun/15 ] |
|
That is the Cray health check warning the file system is sick. The message comes from a user land utility. |
| Comment by Jinshan Xiong (Inactive) [ 20/Jun/15 ] |
|
When did you notice the regression? Can you identify which patches, or a range of date when the patches were landed? From your description, it looks like this is related to ldlm lock handling. Please check the LRU size on the client side and how many locks are cached in the LRU when the problem is reproduced. If possible, please dump the locks so that we can make further investigation. |
| Comment by James A Simmons [ 22/Jun/15 ] |
|
Just as I thought. I found the source of the regression. I had map_on_demand set on the OSS servers. I unmounted the file system and restarted the o2iblnd layer without map_on_demand set and I stopped seeing reconnecting issues. So the map_on_demand is a problem on normal systems as well. |
| Comment by James A Simmons [ 27/Aug/15 ] |
|
We can close this as a duplicate of |