[LU-10203] A stonith device 'Monitor' operation report "Time Out" in Lustre HA cluster with pacemaker Created: 07/Nov/17  Updated: 27/Feb/18  Resolved: 27/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question/Request Priority: Critical
Reporter: sebg-crd-pm (Inactive) Assignee: Brian Murrell (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Environment:

REHL7.3, Lustre 2.10.0


Attachments: PNG File Selection_017.png     PNG File Selection_018.png    
Rank (Obsolete): 9223372036854775807

 Description   

Hi All,

We have got one problem about using stonith device in Lustre HA (MGS/MDS) cluster with pacemaker.

Our stonith device often reports "Time out" errors in "Monitor" operation, and then fails to start.

Finally, it is in "Stopped" state.

Please kindly give us suggestions for debugging this issue.

Thanks!

 

 
 

 



 Comments   
Comment by Andreas Dilger [ 07/Nov/17 ]

Since the Lustre MDS code is running in the kernel, it is possible that the HA threads running on the server can be starved if there is a high load, so the current timeout is not long enough. You might consider to update the token timeout in the /etc/corosync/corosync.conf file.

Comment by Peter Jones [ 09/Nov/17 ]

Brian

Any additional advice to provide here?

Peter

Comment by Brian Murrell (Inactive) [ 09/Nov/17 ]

Is this a Pacemaker configuration that IML constructed or one that you built up yourself?

In any case, it looks like your fencing devices are not functioning properly.  It could be a configuration problem or unfortunately just par for the course for IPMI fencing devices.

Comment by Brian Murrell (Inactive) [ 20/Nov/17 ]

Hi sebg-crd-pm.  Do you have any more information you can add to this ticket or shall I close it?

Generated at Sat Feb 10 02:32:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.