Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8528

MDT lock callback timer expiration and evictions under light load

Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Running a 1000-node user job (call it "ben") on the jade cluster results in 'lock callback timer expired' messages on the MDS console and transactions begin taking a very long time or failing entirely when the client is evicted.

      The first lock timeouts are seen within 5 minutes of starting the job.

      After the MDS stops responding inThe MDS is still up and debug logs can be dumped; I'll attach some.

      There is no evidence of network issues; the fabric in the compute cluster appears clean, the router nodes and compute nodes report no peers down, and initially the clients report good connections to the server. Networking monitoring tools also indicate no network issues.

      Attachments

        1. 08-24.for_intel.tgz
          1.01 MB
        2. cider-mds1.console.1471978512
          15 kB
        3. console.jade2074
          18 kB
        4. dk.jade2119.1471973342
          88 kB
        5. ps.ef.jade2119.1471973574
          115 kB
        6. stacks.cider-mds1.1471973508
          455 kB

        Activity

          [LU-8528] MDT lock callback timer expiration and evictions under light load

          That is related with SELinux.

          yong.fan nasf (Inactive) added a comment - That is related with SELinux.
          ofaaland Olaf Faaland added a comment -

          John, Fan,
          Yes, after disabling SELinux on jade, the job runs successfully, repeatedly. Sorry for the delay responding, I was out Friday. You can mark this issue resolved.

          ofaaland Olaf Faaland added a comment - John, Fan, Yes, after disabling SELinux on jade, the job runs successfully, repeatedly. Sorry for the delay responding, I was out Friday. You can mark this issue resolved.
          yong.fan nasf (Inactive) added a comment - - edited

          Olaf, have you tried with SELinux disabled as John suggested? It is suspected that SELinux caused your trouble. Even if there might be other reason(s), we still suggest you to disable SELinux on Jade. Because your current system (Lustre-2.5.5 based) does not support SELinux. Please refer to LU-5560 for detail.

          yong.fan nasf (Inactive) added a comment - - edited Olaf, have you tried with SELinux disabled as John suggested? It is suspected that SELinux caused your trouble. Even if there might be other reason(s), we still suggest you to disable SELinux on Jade. Because your current system (Lustre-2.5.5 based) does not support SELinux. Please refer to LU-5560 for detail.
          jhammond John Hammond added a comment -

          Hi Olaf, would it be possible to disable SELinux on jade and run again?

          jhammond John Hammond added a comment - Hi Olaf, would it be possible to disable SELinux on jade and run again?
          ofaaland Olaf Faaland added a comment -

          We have another compute cluster, catalyst, which mounts the same filesystem (cider/lsf/lscratchf).

          On catalyst I do not reproduce the problem even at 100 nodes x 8 processes per node:
          srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script

          There are many differences between catalyst and jade, but two differences are:
          OS: jade runs RHEL 7.2, catalyst runs RHEL 6.8
          selinux: jade has selinux enforcing, catalyst has it disabled

          ofaaland Olaf Faaland added a comment - We have another compute cluster, catalyst, which mounts the same filesystem (cider/lsf/lscratchf). On catalyst I do not reproduce the problem even at 100 nodes x 8 processes per node: srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script There are many differences between catalyst and jade, but two differences are: OS: jade runs RHEL 7.2, catalyst runs RHEL 6.8 selinux: jade has selinux enforcing, catalyst has it disabled
          ofaaland Olaf Faaland added a comment - - edited

          Hello Fan,

          I have a probable reproducer:

          $ cat ~/projects/toss-3380/mkdir_script
          #!/usr/bin/bash
          mkdir -p /p/lscratchf/faaland1/mkdirp/${SLURM_JOBID}/a/b/c/d/e/f/g/$(hostname)/$$
          

          Run this way on jade (compute cluster described above) it reproduces the symptoms of the problem, except that after the compute nodes are rebooted the server recovers on its own:
          srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script

          Run this way on jade (compute cluster described above) it does not produce any symptoms; the directories are created quickly and the job succeeds:
          srun -ppbatch -N 100 -n 100 ~/projects/toss-3380/mkdir_script

          ofaaland Olaf Faaland added a comment - - edited Hello Fan, I have a probable reproducer: $ cat ~/projects/toss-3380/mkdir_script #!/usr/bin/bash mkdir -p /p/lscratchf/faaland1/mkdirp/${SLURM_JOBID}/a/b/c/d/e/f/g/$(hostname)/$$ Run this way on jade (compute cluster described above) it reproduces the symptoms of the problem, except that after the compute nodes are rebooted the server recovers on its own: srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script Run this way on jade (compute cluster described above) it does not produce any symptoms; the directories are created quickly and the job succeeds: srun -ppbatch -N 100 -n 100 ~/projects/toss-3380/mkdir_script
          ofaaland Olaf Faaland added a comment -

          The user reports he has successfully run the same job at 1024 nodes on Sierra, several times within the last few months. The Sierra cluster is at lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64. Sierra is running a RHEL 6.7 derivative (the same one as the luster servers cider*).

          ofaaland Olaf Faaland added a comment - The user reports he has successfully run the same job at 1024 nodes on Sierra, several times within the last few months. The Sierra cluster is at lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64. Sierra is running a RHEL 6.7 derivative (the same one as the luster servers cider*).
          ofaaland Olaf Faaland added a comment -

          Hello Fan,
          There's a repo visible to Intel folk that Chris pushes to (that I believe is hosted by Inel). Peter and Andreas know the details, others might as well.
          thanks,
          Olaf

          ofaaland Olaf Faaland added a comment - Hello Fan, There's a repo visible to Intel folk that Chris pushes to (that I believe is hosted by Inel). Peter and Andreas know the details, others might as well. thanks, Olaf

          Where can I get the source code that you are testing lustre-2.5.5-8chaos and lustre-2.5.5-9chaos? https://github.com/LLNL/lustre ? I cannot find related tags.

          yong.fan nasf (Inactive) added a comment - Where can I get the source code that you are testing lustre-2.5.5-8chaos and lustre-2.5.5-9chaos? https://github.com/LLNL/lustre ? I cannot find related tags.
          ofaaland Olaf Faaland added a comment -

          This same job run on 100 nodes x 18 cores/node produces the same warning messages on the console of the MDS, but does not cause the file system to hang (at least not consistently). At 1000 nodes x 18 cores/node it does consistently cause the file system to hang.

          ofaaland Olaf Faaland added a comment - This same job run on 100 nodes x 18 cores/node produces the same warning messages on the console of the MDS, but does not cause the file system to hang (at least not consistently). At 1000 nodes x 18 cores/node it does consistently cause the file system to hang.
          ofaaland Olaf Faaland added a comment -

          I attached data from the Aug 24 run I mentioned. It includes job_stats data from the MDS for the job, as well as consoles, stacks, and lustre debug logs. It's called 08-24.for_intel.tgz

          ofaaland Olaf Faaland added a comment - I attached data from the Aug 24 run I mentioned. It includes job_stats data from the MDS for the job, as well as consoles, stacks, and lustre debug logs. It's called 08-24.for_intel.tgz

          People

            yong.fan nasf (Inactive)
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: