Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1394

lctl ping fails with EIO on freshly restarted node

    XMLWordPrintable

Details

    • Bug
    • Resolution: Low Priority
    • Major
    • None
    • Lustre 2.3.0, Lustre 2.1.1
    • None
    • 3
    • 4035

    Description

      I have two nodes, "oss1" and "mds1". I can lctl ping from oss1 to mds1 with no problem, proving basic infrastructure is correct.

      If I kill mds1 (i.e. pull the power cord) and boot it back up and start lnet, the first lctl ping from oss1 to it will fail with an EIO. A subsequent lctl ping will succeed.

      In terms of network traffic when this happens, 3 TCP sessions from oss1 to mds1 are established, serially – that is 1 is opened and then closed, 2 is opened and then closed and 3 is opened and left open. The first two exchange a few packets each and close down gracefully. The third seems to be much longer lived and is in fact the session where both the failed and then the successful lctl ping happen and it continues to live on.

      Clearly this problem is not an artefact of TCP or IP and is entirely an artefact of LNET itself, so it seems that LNET ought to be able to handle this situation more gracefully.

      Attachments

        Activity

          People

            doug Doug Oucharek (Inactive)
            brian Brian Murrell (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: