Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8528

MDT lock callback timer expiration and evictions under light load

Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Running a 1000-node user job (call it "ben") on the jade cluster results in 'lock callback timer expired' messages on the MDS console and transactions begin taking a very long time or failing entirely when the client is evicted.

      The first lock timeouts are seen within 5 minutes of starting the job.

      After the MDS stops responding inThe MDS is still up and debug logs can be dumped; I'll attach some.

      There is no evidence of network issues; the fabric in the compute cluster appears clean, the router nodes and compute nodes report no peers down, and initially the clients report good connections to the server. Networking monitoring tools also indicate no network issues.

      Attachments

        1. 08-24.for_intel.tgz
          1.01 MB
          Olaf Faaland
        2. cider-mds1.console.1471978512
          15 kB
          Olaf Faaland
        3. console.jade2074
          18 kB
          Olaf Faaland
        4. dk.jade2119.1471973342
          88 kB
          Olaf Faaland
        5. ps.ef.jade2119.1471973574
          115 kB
          Olaf Faaland
        6. stacks.cider-mds1.1471973508
          455 kB
          Olaf Faaland

        Activity

          People

            yong.fan nasf (Inactive)
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: