[LU-767] OSS takes over all resources and STONITHs it's cluster partner without any warning Created: 17/Oct/11  Updated: 16/Feb/12  Resolved: 16/Feb/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: David Martin (Inactive) Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre 1.8.0.1 running on Red Hat Linux Enterprise 5.3 using Linux High Availability Heartbeat version 2.1.4-4.1


Severity: 2
Rank (Obsolete): 6542

 Description   

We experienced an issue where an OSS took over the resources of it's cluster partner and then initiated a STONITH event and reset the partner without any notification that the event was going to happen. According to our log files for the network, the MDS, and the OSS', there appeared to be nothing wrong with the OSS that was reset. We are trying to determine why this may have happened and would like to request any assistance you may be able to provide. We have the ha.cf file configured as follows sending unicast packets only between the cluster partners:

keepalive 6
warntime 30
deadtime 90
initdead 180

The keepalives are being sent through two interfaces. One is through an Infiniband switch and the other is a direct ethernet connection using a crossover cable between the two devices.

Would upgrading to the most current Linux-HA version possibly remediate this issue or do you think it would cause other issues due to the current Lustre and Linux OS versions we are using?

Any assistance you can provide would be greatly appreciated. Thank you.



 Comments   
Comment by Peter Jones [ 17/Oct/11 ]

Cliff

Could you please advise on this one?

Thanks

Peter

Comment by David Martin (Inactive) [ 17/Oct/11 ]

Thanks Peter!

Comment by Cliff White (Inactive) [ 17/Oct/11 ]

The current Linux-HA should be fine. Lustre requires nothing especially fancy from failover, we work with anything that can manage a Filesystem resource.

  • Did you have any other monitoring setup beyond the Heartbeat pinger?
  • Did you check the linux-ha logs? There should be an indication there as to what happened.
Comment by David Martin (Inactive) [ 18/Oct/11 ]

Cliff,

We do not have any other monitoring setup beyond Heartbeat.

We did check the linux-ha logs and there was no indication of what happened. However, I'm not so sure we have our logging level set that would provide a verbose enough log. Do you have a recommendation as to what level we should be logging?

Dave

Comment by Cliff White (Inactive) [ 19/Oct/11 ]

I used to use level 3 as default for debug logs, basically I would tail -f the log file, and adjust according to how much goop is spewed.
How often is this failover occurring? Any other events going on at that time? Any indication of a network hiccup from other nodes?
If Heartbeat pinger is the only monitoring, than either pinger failed (which should put something in logs) or somehow a takeover was ordered (which should also show in logs)
Have you checked the system log on the node issuing the STONITH? There should be something there.

Comment by Cliff White (Inactive) [ 25/Jan/12 ]

What is your current state? Is this still an issue?

Comment by David Martin (Inactive) [ 16/Feb/12 ]

Cliff,

We have not had a re-occurrence of this issue. We believe it may be a stability issue with 1.8.0.1 and RedHat 5.3. We are looking at upgrading to 1.8.6.

Thanks.
Dave

Comment by Cliff White (Inactive) [ 16/Feb/12 ]

Okay, I will close this issue, please re-open if you have further information.

Generated at Sat Feb 10 01:10:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.