[LU-828] Lustre Client Unstable Created: 09/Nov/11  Updated: 27/Jan/12  Resolved: 27/Jan/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0
Fix Version/s: None

Type: Task Priority: Critical
Reporter: Chakravarthy N Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: o2iblnd
Environment:

RHEL 5.5
2* MDS servers, 6*OSS servers handling 3OST each (HW config: Dual Intel Westmere processor with 6 cores each, 24GB Memory)
no. of clients: 368


Attachments: File HOME_MESSAGES.rar     File SCRATCH_MESSAGES.rar     File cn363-computenode_messages.rar     Zip Archive performance.zip    
Epic: client
Rank (Obsolete): 10203

 Description   

Hi ,

We have lustre 2.0 at our setup with 2 mds servers and 6 oss servers.

Facing an issue where "lfs check servers" output varies from node to node.

Please find below the outputs of two different clients took at the same time.

Request your help in solving this issue.

Also find attached var log messags of the node having this error.

[root <at> cn367 ~]# lfs check servers

scratch-OST0011-osc-ffff810c01a9c000 active.

[root <at> mgmt00 ~]# lfs check servers

error: check 'scratch-OST0011-osc-ffff810c056c4000' Resource temporarily unavailable

Thanks & Regards,
N.Chakravarthy.



 Comments   
Comment by Cliff White (Inactive) [ 09/Nov/11 ]

Check the system logs for the two nodes, it is quite possible for one client to be happy and a second client to be unable to reach a particular server. This can be caused by network issues or other errors Check system logs and dmesg for both clients, and the OSS involved, there should be some addition errors that will help you sort out the situation.

Comment by Chakravarthy N [ 09/Nov/11 ]

Thanks for your reply...

I could see errors in logs as below... Does it means that it's due to mcast packet drops in infiniband network?

ADDRCONF(NETDEV_UP): ib0: link is not ready
ib0: enabling connected mode will cause multicast packet drops
ib0: mtu > 4092 will cause multicast packet drops.
ib0: mtu > 4092 will cause multicast packet drops.
ib1: enabling connected mode will cause multicast packet drops
ib1: mtu > 4092 will cause multicast packet drops.
ib1: mtu > 4092 will cause multicast packet drops.

LustreError: 11-0: an error occurred while communicating with 10.2.2.187@o2ib. The mds_getxattr operation failed with -95
LustreError: 11-0: an error occurred while communicating with 10.2.2.187@o2ib. The mds_getxattr operation failed with -95
LustreError: 11-0: an error occurred while communicating with 10.2.2.187@o2ib. The mds_getxattr operation failed with -95
LustreError: 11-0: an error occurred while communicating with 10.2.2.187@o2ib. The mds_getxattr operation failed with -95
LustreError: 11-0: an error occurred while communicating with 10.2.2.187@o2ib. The mds_getxattr operation failed with -95

Comment by Chakravarthy N [ 09/Nov/11 ]

The only difference i could see is this node has two IB interfaces configured in load balancing mode, the lnet entry is "options lnet networks=o2ib0". Hope it's ok.

Comment by Chakravarthy N [ 09/Nov/11 ]

Just wanted to update on this issue, whenever we do "ls" on the lustre fs then only it's showing this error. We've checked the same with the non problematic node as well the issue remains...

Please suggest...

Comment by Chakravarthy N [ 14/Nov/11 ]

Appreciate your help ASAP, since the entire production is on toss...

Please do the needful.

Comment by Raghavendra Badiger [ 15/Nov/11 ]

Sample compute node syslog messages where issue is noticed

Comment by Raghavendra Badiger [ 15/Nov/11 ]

Syslog messages of lustre MGS,MDS,OSS Server nodes for /home lustre filesystem

Comment by Raghavendra Badiger [ 15/Nov/11 ]

Syslog messages of lustre MGS,MDS,OSS Server nodes for /scratch lustre filesystem

Comment by Raghavendra Badiger [ 15/Nov/11 ]

Hi Cliff,

I have uploaded following 3 archive files having Syslog messages of client node where these problem is noticed, and MGS,MDS,OSS server nodes messages for both /home /scratch lustre filesystems. Could you please quickly glance through the logs to see what is the cause for the issue and unexpected symptoms (like Resource temporarily unavailable and ls,cat cmd hangs etc) reported here.

cn363-computenode_messages.rar
HOME_MESSAGES.rar
SCRATCH_MESSAGES.rar

[root <at> cn367 ~]# lfs check servers
scratch-OST0011-osc-ffff810c01a9c000 active.

[root <at> mgmt00 ~]# lfs check servers
error: check 'scratch-OST0011-osc-ffff810c056c4000' Resource temporarily unavailable

Additional symptoms are,
When we try to ls dir /scratch/qgr/R8 dir it stucks. And resource become unavialbale.
We tried with ls --color=none , in this case we are able to list the files.
But when we do ls -l --color=none it again stucks.
we are not able to cat the file on which ls -l stucks others are working fine.

Thanks & Regards
-Raghu

Comment by Cliff White (Inactive) [ 15/Nov/11 ]

If your infiniband network is dropping packets, that would cause this issue.

Comment by Chakravarthy N [ 28/Nov/11 ]

Cliff,

Just an update on this issue...

We've taken the downtime of the entire lustre and "ls, du" everthing started working..

To my understanding the recovery, open files and caching in the client has solved the issue.

Could you please suggest some permanant solution for this issue like clearing the cache, close open files automatically without doing lfsck?

Appreciate your early help on this..

Comment by Chakravarthy N [ 29/Nov/11 ]

Cliif,

Appreciate your suggestions on this, since we are in a bad shape... Please do the needful.

Comment by Andreas Dilger [ 27/Jan/12 ]

Since you are an unsupported customer, the only thing I can suggest is that you upgrade to the latest Lustre 2.1.0 release to determine if this is fixing your problem. To get Whamcloud support for your system, please contact info@whamcloud.com for more information.

Generated at Sat Feb 10 01:10:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.