[LU-1344] Evicted Clients Created: 26/Apr/12  Updated: 19/Oct/12  Resolved: 19/Oct/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Dennis Nelson Assignee: Liang Zhen (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

CentOS 5.5 on Lustre servers
RHEL 6.1 on clients


Attachments: Text File lustre-client-errors.txt     Text File lustre-mds-errors.txt     Text File lustre-server-errors.txt    
Severity: 3
Rank (Obsolete): 4030

 Description   

Customer reported a number of clients were evicted. All clients had difficulties communicating with OSTs on a single OSS. Johann has looked at the client logs but I did not have server logs at the time. I now have the server logs. I have attached them to this ticket. I need recommendations on how to prevent this from happening in the future.

Should I consider changing the OBD timeout from the default 100s?

Should I consider reducing the number of OST service threads (default 256)?



 Comments   
Comment by Johann Lombardi (Inactive) [ 26/Apr/12 ]

The server logs confirm that there are network problems.
Liang (our lnet expert) is having another look at the logs just in case.

Comment by Cliff White (Inactive) [ 26/Apr/12 ]

obd timeout is now automatically adjusted, there should be no need to change this.
You can check for the file 'timeouts' - there is one for various services under /proc/fs/lustre.
That file provides a history and will show if there are timeout issues.

The errors appear to show a network/other failure rather than a server overload. The client evictions
track to errors on the server.
There are only LustreErrors in the server logs. Are there any indications of a network failure?
What was the load on the server when the clients dropped connections?
I would suggest upgrading to Lustre 1.8.7 as there are improvements in that release.

Comment by Dennis Nelson [ 26/Apr/12 ]

So, I understand that obd timeout mostley deprcated with the introduction of adaptive timeouts. That is the reason it is still set to the default. A coworker pointed out the following passage from the manual and suggested that it might be helpful to increase it:

In previous Lustre versions, the static obd_timeout (/proc/sys/lustre/timeout) value was used as the maximum completion time for all RPCs; this value also affected the client-server ping interval and initial recovery timer. Now, with adaptive timeouts, obd_timeout is only used for the ping interval and initial recovery estimate. When a client reconnects during recovery, the server uses the client's timeout value to reset the recovery wait period; i.e., the server learns how long the client had been willing to wait, and takes this into account when adjusting the recovery period.

I found that there is some sar data being collected and cpu idle time never went below 90%. The reason I am thinking that this might be load related is based on the following server entry:

Apr 20 20:05:14 lfs-oss-1-13 kernel: Lustre: Service thread pid 32236 completed after 278.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

Also, the clients did not have any issues with OSTs on any of the other servers during this time.

All kernel errors are directed to /var/log/kern.log. You got that entire file. /var/log/messages is empty of any messages at all after the last boot 12 days ago. I have not seen any logs on the server that indicate there was a network issue at the time.

Comment by Peter Jones [ 27/Apr/12 ]

Liang is reviewing the logs

Comment by Dennis Nelson [ 03/May/12 ]

Any update on the review of the logs?

Comment by Kit Westneat (Inactive) [ 07/Sep/12 ]

We haven't seen this since, we can close it.

Generated at Sat Feb 10 01:15:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.