[LU-10203] A stonith device 'Monitor' operation report "Time Out" in Lustre HA cluster with pacemaker Created: 07/Nov/17 Updated: 27/Feb/18 Resolved: 27/Feb/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Critical |
| Reporter: | sebg-crd-pm (Inactive) | Assignee: | Brian Murrell (Inactive) |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
REHL7.3, Lustre 2.10.0 |
||
| Attachments: |
|
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hi All, We have got one problem about using stonith device in Lustre HA (MGS/MDS) cluster with pacemaker. Our stonith device often reports "Time out" errors in "Monitor" operation, and then fails to start. Finally, it is in "Stopped" state. Please kindly give us suggestions for debugging this issue. Thanks!
|
| Comments |
| Comment by Andreas Dilger [ 07/Nov/17 ] |
|
Since the Lustre MDS code is running in the kernel, it is possible that the HA threads running on the server can be starved if there is a high load, so the current timeout is not long enough. You might consider to update the token timeout in the /etc/corosync/corosync.conf file. |
| Comment by Peter Jones [ 09/Nov/17 ] |
|
Brian Any additional advice to provide here? Peter |
| Comment by Brian Murrell (Inactive) [ 09/Nov/17 ] |
|
Is this a Pacemaker configuration that IML constructed or one that you built up yourself? In any case, it looks like your fencing devices are not functioning properly. It could be a configuration problem or unfortunately just par for the course for IPMI fencing devices. |
| Comment by Brian Murrell (Inactive) [ 20/Nov/17 ] |
|
Hi sebg-crd-pm. Do you have any more information you can add to this ticket or shall I close it? |