Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-767

OSS takes over all resources and STONITHs it's cluster partner without any warning

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • Lustre 1.8.0.1 running on Red Hat Linux Enterprise 5.3 using Linux High Availability Heartbeat version 2.1.4-4.1
    • 2
    • 6542

    Description

      We experienced an issue where an OSS took over the resources of it's cluster partner and then initiated a STONITH event and reset the partner without any notification that the event was going to happen. According to our log files for the network, the MDS, and the OSS', there appeared to be nothing wrong with the OSS that was reset. We are trying to determine why this may have happened and would like to request any assistance you may be able to provide. We have the ha.cf file configured as follows sending unicast packets only between the cluster partners:

      keepalive 6
      warntime 30
      deadtime 90
      initdead 180

      The keepalives are being sent through two interfaces. One is through an Infiniband switch and the other is a direct ethernet connection using a crossover cable between the two devices.

      Would upgrading to the most current Linux-HA version possibly remediate this issue or do you think it would cause other issues due to the current Lustre and Linux OS versions we are using?

      Any assistance you can provide would be greatly appreciated. Thank you.

      Attachments

        Issue Links

          Activity

            People

              cliffw Cliff White (Inactive)
              martindw1 David Martin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: