Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
None
-
Lustre 1.8.0.1 running on Red Hat Linux Enterprise 5.3 using Linux High Availability Heartbeat version 2.1.4-4.1
-
2
-
6542
Description
We experienced an issue where an OSS took over the resources of it's cluster partner and then initiated a STONITH event and reset the partner without any notification that the event was going to happen. According to our log files for the network, the MDS, and the OSS', there appeared to be nothing wrong with the OSS that was reset. We are trying to determine why this may have happened and would like to request any assistance you may be able to provide. We have the ha.cf file configured as follows sending unicast packets only between the cluster partners:
keepalive 6
warntime 30
deadtime 90
initdead 180
The keepalives are being sent through two interfaces. One is through an Infiniband switch and the other is a direct ethernet connection using a crossover cable between the two devices.
Would upgrading to the most current Linux-HA version possibly remediate this issue or do you think it would cause other issues due to the current Lustre and Linux OS versions we are using?
Any assistance you can provide would be greatly appreciated. Thank you.
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker
While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA