[LU-828] Lustre Client Unstable Created: 09/Nov/11 Updated: 27/Jan/12 Resolved: 27/Jan/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.0.0 |
| Fix Version/s: | None |
| Type: | Task | Priority: | Critical |
| Reporter: | Chakravarthy N | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | o2iblnd | ||
| Environment: |
RHEL 5.5 |
||
| Attachments: |
|
| Epic: | client |
| Rank (Obsolete): | 10203 |
| Description |
|
Hi , We have lustre 2.0 at our setup with 2 mds servers and 6 oss servers. Facing an issue where "lfs check servers" output varies from node to node. Please find below the outputs of two different clients took at the same time. Request your help in solving this issue. Also find attached var log messags of the node having this error. [root <at> cn367 ~]# lfs check servers scratch-OST0011-osc-ffff810c01a9c000 active. [root <at> mgmt00 ~]# lfs check servers error: check 'scratch-OST0011-osc-ffff810c056c4000' Resource temporarily unavailable Thanks & Regards, |
| Comments |
| Comment by Cliff White (Inactive) [ 09/Nov/11 ] |
|
Check the system logs for the two nodes, it is quite possible for one client to be happy and a second client to be unable to reach a particular server. This can be caused by network issues or other errors Check system logs and dmesg for both clients, and the OSS involved, there should be some addition errors that will help you sort out the situation. |
| Comment by Chakravarthy N [ 09/Nov/11 ] |
|
Thanks for your reply... I could see errors in logs as below... Does it means that it's due to mcast packet drops in infiniband network? ADDRCONF(NETDEV_UP): ib0: link is not ready LustreError: 11-0: an error occurred while communicating with 10.2.2.187@o2ib. The mds_getxattr operation failed with -95 |
| Comment by Chakravarthy N [ 09/Nov/11 ] |
|
The only difference i could see is this node has two IB interfaces configured in load balancing mode, the lnet entry is "options lnet networks=o2ib0". Hope it's ok. |
| Comment by Chakravarthy N [ 09/Nov/11 ] |
|
Just wanted to update on this issue, whenever we do "ls" on the lustre fs then only it's showing this error. We've checked the same with the non problematic node as well the issue remains... Please suggest... |
| Comment by Chakravarthy N [ 14/Nov/11 ] |
|
Appreciate your help ASAP, since the entire production is on toss... Please do the needful. |
| Comment by Raghavendra Badiger [ 15/Nov/11 ] |
|
Sample compute node syslog messages where issue is noticed |
| Comment by Raghavendra Badiger [ 15/Nov/11 ] |
|
Syslog messages of lustre MGS,MDS,OSS Server nodes for /home lustre filesystem |
| Comment by Raghavendra Badiger [ 15/Nov/11 ] |
|
Syslog messages of lustre MGS,MDS,OSS Server nodes for /scratch lustre filesystem |
| Comment by Raghavendra Badiger [ 15/Nov/11 ] |
|
Hi Cliff, I have uploaded following 3 archive files having Syslog messages of client node where these problem is noticed, and MGS,MDS,OSS server nodes messages for both /home /scratch lustre filesystems. Could you please quickly glance through the logs to see what is the cause for the issue and unexpected symptoms (like Resource temporarily unavailable and ls,cat cmd hangs etc) reported here. cn363-computenode_messages.rar [root <at> cn367 ~]# lfs check servers [root <at> mgmt00 ~]# lfs check servers Additional symptoms are, Thanks & Regards |
| Comment by Cliff White (Inactive) [ 15/Nov/11 ] |
|
If your infiniband network is dropping packets, that would cause this issue. |
| Comment by Chakravarthy N [ 28/Nov/11 ] |
|
Cliff, Just an update on this issue... We've taken the downtime of the entire lustre and "ls, du" everthing started working.. To my understanding the recovery, open files and caching in the client has solved the issue. Could you please suggest some permanant solution for this issue like clearing the cache, close open files automatically without doing lfsck? Appreciate your early help on this.. |
| Comment by Chakravarthy N [ 29/Nov/11 ] |
|
Cliif, Appreciate your suggestions on this, since we are in a bad shape... Please do the needful. |
| Comment by Andreas Dilger [ 27/Jan/12 ] |
|
Since you are an unsupported customer, the only thing I can suggest is that you upgrade to the latest Lustre 2.1.0 release to determine if this is fixing your problem. To get Whamcloud support for your system, please contact info@whamcloud.com for more information. |