Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1504

the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service

    XMLWordPrintable

Details

    • Task
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.0.0, Lustre 1.8.6
    • Clustering
    • 4043

    Description

      Hello Support,

      One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:

      Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs

      The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.

      "Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."

      You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):

      "https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"

      Full information for the drop-off:

      Claim ID: vuAFoSBUoReVuaje
      Claim Passcode: RfTmXJZFVdUGzbLk
      Date of Drop-Off: 2012-06-11 12:23:20-0400

      Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
      if you need any further info?

      Thanks
      Terry
      Penguin Tech Support
      Ph: 415-954-2833

      Attachments

        1. lustre-failure-120619-1.gz
          21 kB
        2. mds0a-messages.gz
          5 kB
        3. oss4-messages.gz
          52 kB
        4. headnode-messages.gz
          623 kB
        5. oss3-vmstat.log
          9 kB

        Issue Links

          Activity

            People

              cliffw Cliff White (Inactive)
              adizon Archie Dizon
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: