[LU-5617] MDS hang and would like to know the cause Created: 12/Sep/14  Updated: 21/Nov/16  Resolved: 21/Nov/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: Haisong Cai (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Linux monkey-mds-10-3.local 2.6.32-358.23.2.el6_lustre.x86_64 #1 SMP Thu Dec 19 19:57:45 PST 2013 x86_64 x86_64 x86_64 GNU/Linux


Attachments: Text File lustre-log.1410510357.3995.gz     Text File lustre-log.1410524785.3872.gz     File monkey-mds-10-3.messages.gz     File monkey-mds-10-3_messages_all.gz    
Rank (Obsolete): 15712

 Description   

We had MDS hanging this morning, to a point where signing on server was denied. We had to power cycling the server in order to regain access to it.

I am uploading "/var/log/messages" and 2 kernel trace dumping.
I would need you help in interpreting from these logs and let me know
where is likely the problem, the Lustre, networking, client overloading, and etc.

thanks



 Comments   
Comment by Cliff White (Inactive) [ 12/Sep/14 ]

It appears from the MDS log that something was happening prior to the start of the log you attached. Can we get the log from the MDS for the 24 hours prior to the first log?
Were there any indications of network errors?

Comment by Haisong Cai (Inactive) [ 12/Sep/14 ]

Hi Cliff,

I am including /var/log/messages for MDS since last log rotation (Sept 7) here.
For both Sept 10 and Sept 11, there were nothing logged.

Comment by Peter Jones [ 13/Sep/14 ]

Niu

Is there anything that you can determine from the information provided?

Peter

Comment by Niu Yawei (Inactive) [ 15/Sep/14 ]

I saw lots of page allocation failures, looks your system is running short of memory, the reason is unclear to me, but it can be alleviated by tuning the vm parameters:

  • Increasing the vm.min_free_kbytes;
  • Set the vm.zone_reclaim_mode to 1;
Comment by Haisong Cai (Inactive) [ 15/Sep/14 ]

Hi Niu,

Can you recommend a value to set for vm.min_free_kbytes?
Our MDS has 24GB RAM.

thanks,
Haisong

Comment by Niu Yawei (Inactive) [ 15/Sep/14 ]

Can you recommend a value to set for vm.min_free_kbytes?
Our MDS has 24GB RAM.

What's the current value? I don't have experience on tuning these values, I think you need to try some bigger value and see how it works. (but don't set it too large, probably less than 5% of total memory?)

Comment by Niu Yawei (Inactive) [ 21/Nov/16 ]

It because the system was running out of memory, it could be caused by LU-5726. Dup of LU-5726.

Generated at Sat Feb 10 01:53:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.