[LU-8528] MDT lock callback timer expiration and evictions under light load - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl
Environment:

Hide
servers: cider cluster
rhel 6.8 derivative
lustre-2.5.5-8chaos_2.6.32_642.3.1.1chaos.ch5.5.x86_64.x86_64

clients: jade cluster
rhel 7.2 derivative
kernel-3.10.0-327.28.2.1chaos.ch6.x86_64
lustre-2.5.5-9chaos.2.ch6.x86_64

Show
servers: cider cluster rhel 6.8 derivative lustre-2.5.5-8chaos_2.6.32_642.3.1.1chaos.ch5.5.x86_64.x86_64 clients: jade cluster rhel 7.2 derivative kernel-3.10.0-327.28.2.1chaos.ch6.x86_64 lustre-2.5.5-9chaos.2.ch6.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Running a 1000-node user job (call it "ben") on the jade cluster results in 'lock callback timer expired' messages on the MDS console and transactions begin taking a very long time or failing entirely when the client is evicted.

The first lock timeouts are seen within 5 minutes of starting the job.

After the MDS stops responding inThe MDS is still up and debug logs can be dumped; I'll attach some.

There is no evidence of network issues; the fabric in the compute cluster appears clean, the router nodes and compute nodes report no peers down, and initially the clients report good connections to the server. Networking monitoring tools also indicate no network issues.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

08-24.for_intel.tgz
1.01 MB
25/Aug/16 3:57 PM
cider-mds1.console.1471978512
15 kB
23/Aug/16 8:51 PM
console.jade2074
18 kB
23/Aug/16 9:14 PM
dk.jade2119.1471973342
88 kB
23/Aug/16 8:52 PM
ps.ef.jade2119.1471973574
115 kB
23/Aug/16 8:52 PM
stacks.cider-mds1.1471973508
455 kB
23/Aug/16 8:54 PM

Activity

[LU-8528] MDT lock callback timer expiration and evictions under light load

nasf (Inactive) added a comment - 29/Aug/16 3:37 PM

That is related with SELinux.

nasf (Inactive) added a comment - 29/Aug/16 3:37 PM That is related with SELinux.

Olaf Faaland added a comment - 29/Aug/16 3:24 PM

John, Fan,
Yes, after disabling SELinux on jade, the job runs successfully, repeatedly. Sorry for the delay responding, I was out Friday. You can mark this issue resolved.

Olaf Faaland added a comment - 29/Aug/16 3:24 PM John, Fan, Yes, after disabling SELinux on jade, the job runs successfully, repeatedly. Sorry for the delay responding, I was out Friday. You can mark this issue resolved.

nasf (Inactive) added a comment - 29/Aug/16 8:32 AM - edited

Olaf, have you tried with SELinux disabled as John suggested? It is suspected that SELinux caused your trouble. Even if there might be other reason(s), we still suggest you to disable SELinux on Jade. Because your current system (Lustre-2.5.5 based) does not support SELinux. Please refer to ~~LU-5560~~ for detail.

nasf (Inactive) added a comment - 29/Aug/16 8:32 AM - edited Olaf, have you tried with SELinux disabled as John suggested? It is suspected that SELinux caused your trouble. Even if there might be other reason(s), we still suggest you to disable SELinux on Jade. Because your current system (Lustre-2.5.5 based) does not support SELinux. Please refer to LU-5560 for detail.

John Hammond added a comment - 27/Aug/16 3:45 PM

Hi Olaf, would it be possible to disable SELinux on jade and run again?

John Hammond added a comment - 27/Aug/16 3:45 PM Hi Olaf, would it be possible to disable SELinux on jade and run again?

Olaf Faaland added a comment - 26/Aug/16 1:08 AM

We have another compute cluster, catalyst, which mounts the same filesystem (cider/lsf/lscratchf).

On catalyst I do not reproduce the problem even at 100 nodes x 8 processes per node:
srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script

There are many differences between catalyst and jade, but two differences are:
OS: jade runs RHEL 7.2, catalyst runs RHEL 6.8
selinux: jade has selinux enforcing, catalyst has it disabled

Olaf Faaland added a comment - 26/Aug/16 1:08 AM We have another compute cluster, catalyst, which mounts the same filesystem (cider/lsf/lscratchf). On catalyst I do not reproduce the problem even at 100 nodes x 8 processes per node: srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script There are many differences between catalyst and jade, but two differences are: OS: jade runs RHEL 7.2, catalyst runs RHEL 6.8 selinux: jade has selinux enforcing, catalyst has it disabled

Olaf Faaland added a comment - 26/Aug/16 12:39 AM - edited

Hello Fan,

I have a probable reproducer:

$ cat ~/projects/toss-3380/mkdir_script
#!/usr/bin/bash
mkdir -p /p/lscratchf/faaland1/mkdirp/${SLURM_JOBID}/a/b/c/d/e/f/g/$(hostname)/$$

Run this way on jade (compute cluster described above) it reproduces the symptoms of the problem, except that after the compute nodes are rebooted the server recovers on its own:
srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script

Run this way on jade (compute cluster described above) it does not produce any symptoms; the directories are created quickly and the job succeeds:
srun -ppbatch -N 100 -n 100 ~/projects/toss-3380/mkdir_script

Olaf Faaland added a comment - 26/Aug/16 12:39 AM - edited Hello Fan, I have a probable reproducer: $ cat ~/projects/toss-3380/mkdir_script #!/usr/bin/bash mkdir -p /p/lscratchf/faaland1/mkdirp/${SLURM_JOBID}/a/b/c/d/e/f/g/$(hostname)/$$ Run this way on jade (compute cluster described above) it reproduces the symptoms of the problem, except that after the compute nodes are rebooted the server recovers on its own: srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script Run this way on jade (compute cluster described above) it does not produce any symptoms; the directories are created quickly and the job succeeds: srun -ppbatch -N 100 -n 100 ~/projects/toss-3380/mkdir_script

Olaf Faaland added a comment - 25/Aug/16 6:14 PM

The user reports he has successfully run the same job at 1024 nodes on Sierra, several times within the last few months. The Sierra cluster is at lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64. Sierra is running a RHEL 6.7 derivative (the same one as the luster servers cider*).

Olaf Faaland added a comment - 25/Aug/16 6:14 PM The user reports he has successfully run the same job at 1024 nodes on Sierra, several times within the last few months. The Sierra cluster is at lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64. Sierra is running a RHEL 6.7 derivative (the same one as the luster servers cider*).

Olaf Faaland added a comment - 25/Aug/16 5:42 PM

Hello Fan,
There's a repo visible to Intel folk that Chris pushes to (that I believe is hosted by Inel). Peter and Andreas know the details, others might as well.
thanks,
Olaf

Olaf Faaland added a comment - 25/Aug/16 5:42 PM Hello Fan, There's a repo visible to Intel folk that Chris pushes to (that I believe is hosted by Inel). Peter and Andreas know the details, others might as well. thanks, Olaf

nasf (Inactive) added a comment - 25/Aug/16 5:29 PM

Where can I get the source code that you are testing lustre-2.5.5-8chaos and lustre-2.5.5-9chaos? https://github.com/LLNL/lustre ? I cannot find related tags.

nasf (Inactive) added a comment - 25/Aug/16 5:29 PM Where can I get the source code that you are testing lustre-2.5.5-8chaos and lustre-2.5.5-9chaos? https://github.com/LLNL/lustre ? I cannot find related tags.

Olaf Faaland added a comment - 25/Aug/16 4:28 PM

This same job run on 100 nodes x 18 cores/node produces the same warning messages on the console of the MDS, but does not cause the file system to hang (at least not consistently). At 1000 nodes x 18 cores/node it does consistently cause the file system to hang.

Olaf Faaland added a comment - 25/Aug/16 4:28 PM This same job run on 100 nodes x 18 cores/node produces the same warning messages on the console of the MDS, but does not cause the file system to hang (at least not consistently). At 1000 nodes x 18 cores/node it does consistently cause the file system to hang.

Olaf Faaland added a comment - 25/Aug/16 3:59 PM

I attached data from the Aug 24 run I mentioned. It includes job_stats data from the MDS for the job, as well as consoles, stacks, and lustre debug logs. It's called 08-24.for_intel.tgz

Olaf Faaland added a comment - 25/Aug/16 3:59 PM I attached data from the Aug 24 run I mentioned. It includes job_stats data from the MDS for the job, as well as consoles, stacks, and lustre debug logs. It's called 08-24.for_intel.tgz

People

Assignee:: nasf (Inactive)

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 23/Aug/16 6:50 PM

Updated:: 29/Nov/16 12:32 AM

Resolved:: 29/Aug/16 3:37 PM