[LU-767] OSS takes over all resources and STONITHs it's cluster partner without any warning Created: 17/Oct/11 Updated: 16/Feb/12 Resolved: 16/Feb/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | David Martin (Inactive) | Assignee: | Cliff White (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 1.8.0.1 running on Red Hat Linux Enterprise 5.3 using Linux High Availability Heartbeat version 2.1.4-4.1 |
||
| Severity: | 2 |
| Rank (Obsolete): | 6542 |
| Description |
|
We experienced an issue where an OSS took over the resources of it's cluster partner and then initiated a STONITH event and reset the partner without any notification that the event was going to happen. According to our log files for the network, the MDS, and the OSS', there appeared to be nothing wrong with the OSS that was reset. We are trying to determine why this may have happened and would like to request any assistance you may be able to provide. We have the ha.cf file configured as follows sending unicast packets only between the cluster partners: keepalive 6 The keepalives are being sent through two interfaces. One is through an Infiniband switch and the other is a direct ethernet connection using a crossover cable between the two devices. Would upgrading to the most current Linux-HA version possibly remediate this issue or do you think it would cause other issues due to the current Lustre and Linux OS versions we are using? Any assistance you can provide would be greatly appreciated. Thank you. |
| Comments |
| Comment by Peter Jones [ 17/Oct/11 ] |
|
Cliff Could you please advise on this one? Thanks Peter |
| Comment by David Martin (Inactive) [ 17/Oct/11 ] |
|
Thanks Peter! |
| Comment by Cliff White (Inactive) [ 17/Oct/11 ] |
|
The current Linux-HA should be fine. Lustre requires nothing especially fancy from failover, we work with anything that can manage a Filesystem resource.
|
| Comment by David Martin (Inactive) [ 18/Oct/11 ] |
|
Cliff, We do not have any other monitoring setup beyond Heartbeat. We did check the linux-ha logs and there was no indication of what happened. However, I'm not so sure we have our logging level set that would provide a verbose enough log. Do you have a recommendation as to what level we should be logging? Dave |
| Comment by Cliff White (Inactive) [ 19/Oct/11 ] |
|
I used to use level 3 as default for debug logs, basically I would tail -f the log file, and adjust according to how much goop is spewed. |
| Comment by Cliff White (Inactive) [ 25/Jan/12 ] |
|
What is your current state? Is this still an issue? |
| Comment by David Martin (Inactive) [ 16/Feb/12 ] |
|
Cliff, We have not had a re-occurrence of this issue. We believe it may be a stability issue with 1.8.0.1 and RedHat 5.3. We are looking at upgrading to 1.8.6. Thanks. |
| Comment by Cliff White (Inactive) [ 16/Feb/12 ] |
|
Okay, I will close this issue, please re-open if you have further information. |