[LU-5617] MDS hang and would like to know the cause Created: 12/Sep/14 Updated: 21/Nov/16 Resolved: 21/Nov/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Minor |
| Reporter: | Haisong Cai (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Linux monkey-mds-10-3.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux |
||
| Attachments: |
|
| Rank (Obsolete): | 15712 |
| Description |
|
We had MDS hanging this morning, to a point where signing on server was denied. We had to power cycling the server in order to regain access to it. I am uploading "/var/log/messages" and 2 kernel trace dumping. thanks |
| Comments |
| Comment by Cliff White (Inactive) [ 12/Sep/14 ] |
|
It appears from the MDS log that something was happening prior to the start of the log you attached. Can we get the log from the MDS for the 24 hours prior to the first log? |
| Comment by Haisong Cai (Inactive) [ 12/Sep/14 ] |
|
Hi Cliff, I am including /var/log/messages for MDS since last log rotation (Sept 7) here. |
| Comment by Peter Jones [ 13/Sep/14 ] |
|
Niu Is there anything that you can determine from the information provided? Peter |
| Comment by Niu Yawei (Inactive) [ 15/Sep/14 ] |
|
I saw lots of page allocation failures, looks your system is running short of memory, the reason is unclear to me, but it can be alleviated by tuning the vm parameters:
|
| Comment by Haisong Cai (Inactive) [ 15/Sep/14 ] |
|
Hi Niu, Can you recommend a value to set for vm.min_free_kbytes? thanks, |
| Comment by Niu Yawei (Inactive) [ 15/Sep/14 ] |
What's the current value? I don't have experience on tuning these values, I think you need to try some bigger value and see how it works. (but don't set it too large, probably less than 5% of total memory?) |
| Comment by Niu Yawei (Inactive) [ 21/Nov/16 ] |
|
It because the system was running out of memory, it could be caused by |