[LU-1344] Evicted Clients Created: 26/Apr/12 Updated: 19/Oct/12 Resolved: 19/Oct/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Dennis Nelson | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 5.5 on Lustre servers |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4030 |
| Description |
|
Customer reported a number of clients were evicted. All clients had difficulties communicating with OSTs on a single OSS. Johann has looked at the client logs but I did not have server logs at the time. I now have the server logs. I have attached them to this ticket. I need recommendations on how to prevent this from happening in the future. Should I consider changing the OBD timeout from the default 100s? Should I consider reducing the number of OST service threads (default 256)? |
| Comments |
| Comment by Johann Lombardi (Inactive) [ 26/Apr/12 ] |
|
The server logs confirm that there are network problems. |
| Comment by Cliff White (Inactive) [ 26/Apr/12 ] |
|
obd timeout is now automatically adjusted, there should be no need to change this. The errors appear to show a network/other failure rather than a server overload. The client evictions |
| Comment by Dennis Nelson [ 26/Apr/12 ] |
|
So, I understand that obd timeout mostley deprcated with the introduction of adaptive timeouts. That is the reason it is still set to the default. A coworker pointed out the following passage from the manual and suggested that it might be helpful to increase it: In previous Lustre versions, the static obd_timeout (/proc/sys/lustre/timeout) value was used as the maximum completion time for all RPCs; this value also affected the client-server ping interval and initial recovery timer. Now, with adaptive timeouts, obd_timeout is only used for the ping interval and initial recovery estimate. When a client reconnects during recovery, the server uses the client's timeout value to reset the recovery wait period; i.e., the server learns how long the client had been willing to wait, and takes this into account when adjusting the recovery period. I found that there is some sar data being collected and cpu idle time never went below 90%. The reason I am thinking that this might be load related is based on the following server entry: Apr 20 20:05:14 lfs-oss-1-13 kernel: Lustre: Service thread pid 32236 completed after 278.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Also, the clients did not have any issues with OSTs on any of the other servers during this time. All kernel errors are directed to /var/log/kern.log. You got that entire file. /var/log/messages is empty of any messages at all after the last boot 12 days ago. I have not seen any logs on the server that indicate there was a network issue at the time. |
| Comment by Peter Jones [ 27/Apr/12 ] |
|
Liang is reviewing the logs |
| Comment by Dennis Nelson [ 03/May/12 ] |
|
Any update on the review of the logs? |
| Comment by Kit Westneat (Inactive) [ 07/Sep/12 ] |
|
We haven't seen this since, we can close it. |