[LU-8528] MDT lock callback timer expiration and evictions under light load Created: 23/Aug/16  Updated: 29/Nov/16  Resolved: 29/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Olaf Faaland Assignee: nasf (Inactive)
Resolution: Not a Bug Votes: 0
Labels: llnl
Environment:

servers: cider cluster
rhel 6.8 derivative
lustre-2.5.5-8chaos_2.6.32_642.3.1.1chaos.ch5.5.x86_64.x86_64

clients: jade cluster
rhel 7.2 derivative
kernel-3.10.0-327.28.2.1chaos.ch6.x86_64
lustre-2.5.5-9chaos.2.ch6.x86_64


Attachments: File 08-24.for_intel.tgz     File cider-mds1.console.1471978512     File console.jade2074     File dk.jade2119.1471973342     File ps.ef.jade2119.1471973574     File stacks.cider-mds1.1471973508    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running a 1000-node user job (call it "ben") on the jade cluster results in 'lock callback timer expired' messages on the MDS console and transactions begin taking a very long time or failing entirely when the client is evicted.

The first lock timeouts are seen within 5 minutes of starting the job.

After the MDS stops responding inThe MDS is still up and debug logs can be dumped; I'll attach some.

There is no evidence of network issues; the fabric in the compute cluster appears clean, the router nodes and compute nodes report no peers down, and initially the clients report good connections to the server. Networking monitoring tools also indicate no network issues.



 Comments   
Comment by Olaf Faaland [ 23/Aug/16 ]

I overlooked setting the priority. This is critical; it is preventing us from doing early user testing of a new, large, cluster.

Comment by Olaf Faaland [ 23/Aug/16 ]

Running mdtest at this scale, and larger scale, does not trigger this issue.

Comment by Olaf Faaland [ 23/Aug/16 ]

For reference when looking at the attached logs: the job was started on 10/23 at 10:26am.
I'll attach lustre debug logs from a few randomly selected clients and from the mds. I'll also attach stacks from the selected clients and from the mds.
I don't know the details of what I/O the job attempts to do, we'll look into that.

Comment by Olaf Faaland [ 23/Aug/16 ]

From another cluster which also mounts the lscratchf, 'df' started taking >3 minutes to return at 10:30am and started responding normally only after the jade compute nodes had been powered off and the MDS had been crashed and recovered. That test is done every 5 minutes; at 10:25am 'df' took < 3 minutes.

Comment by Olaf Faaland [ 23/Aug/16 ]

jade2074 is the node with nid 192.168.136.42@o2ib24 which is called out in the MDS console log.

I've attached the console log from jade2074.

Unfortunately I don't have stacks or a lustre debug log from jade2074, and the node has been rebooted.

Comment by Olaf Faaland [ 23/Aug/16 ]

This is reproducible on this cluster/filesystem. We're working on reproducing the issue on a smaller cluster where we can leave the nodes in the bad state as long as necessary to gather data.

Comment by Olaf Faaland [ 24/Aug/16 ]

Hi, can someone take a look at this?
thanks,
Olaf

Comment by Peter Jones [ 24/Aug/16 ]

Fan Yong

Can you please assist with this ticket?

Thanks

Peter

Comment by Olaf Faaland [ 24/Aug/16 ]

Fan,

I'm about to run another test at different scale, but expect it will likely produce the same symptoms. Is there anything in particular you want me to gather? Right now, I'm planning on collecting this:

  • console logs from clients and servers that report anything lustre related after the job starts
  • lctl dk with +rpctrace, and stacks, from:
    • the mds
    • the first client the mdt mentions in lock callback timer expired message
    • the first client that complains in its log about anything lustre related after the job starts
  • list of downed peers from router nodes
  • contents of /proc/fs/lustre/mdc/*/state from the clients that report anything in console logs

The mdt lustre debug logs will likely be too big to send you, so you'll need to tell me what you would like me to extract.

I can gather different information - let me know what is helpful.

thanks,
Olaf

Comment by Olaf Faaland [ 24/Aug/16 ]

I forgot to specify above, but the client nodes and router nodes appear to be stable the entire time. I haven't observed that any nodes oom, or crash. So I see no reason the mdt wouldn't be receiving responses to requests sent to the clients.

Comment by Olaf Faaland [ 25/Aug/16 ]

Tested the job at slightly different scale 8/24 and got the same symptoms.
I notice when looking at the stacks on the MDS that it has many threads in this stack:

[<ffffffffa0cf3dd5>] ptlrpc_wait_event+0x2d5/0x2e0 [ptlrpc]
[<ffffffffa0cfe31f>] ptlrpc_main+0x94f/0x1af0 [ptlrpc]
[<ffffffff810a6cbe>] kthread+0x9e/0xc0
[<ffffffff8100c2ca>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

which looks to me like they are idle waiting for requests to arrive. Am I mistaken?
This confuses me because metadata operations such as statfs or stat of the filesystem root, on any clients mounting the filesystem (not just those participating in the ben job) return after long delays (several minutes) or possibly never (meaning I gave up). I'd thought this was because the service threads were busy but that seems not to be the case.

Comment by Olaf Faaland [ 25/Aug/16 ]

I attached data from the Aug 24 run I mentioned. It includes job_stats data from the MDS for the job, as well as consoles, stacks, and lustre debug logs. It's called 08-24.for_intel.tgz

Comment by Olaf Faaland [ 25/Aug/16 ]

This same job run on 100 nodes x 18 cores/node produces the same warning messages on the console of the MDS, but does not cause the file system to hang (at least not consistently). At 1000 nodes x 18 cores/node it does consistently cause the file system to hang.

Comment by nasf (Inactive) [ 25/Aug/16 ]

Where can I get the source code that you are testing lustre-2.5.5-8chaos and lustre-2.5.5-9chaos? https://github.com/LLNL/lustre ? I cannot find related tags.

Comment by Olaf Faaland [ 25/Aug/16 ]

Hello Fan,
There's a repo visible to Intel folk that Chris pushes to (that I believe is hosted by Inel). Peter and Andreas know the details, others might as well.
thanks,
Olaf

Comment by Olaf Faaland [ 25/Aug/16 ]

The user reports he has successfully run the same job at 1024 nodes on Sierra, several times within the last few months. The Sierra cluster is at lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64. Sierra is running a RHEL 6.7 derivative (the same one as the luster servers cider*).

Comment by Olaf Faaland [ 26/Aug/16 ]

Hello Fan,

I have a probable reproducer:

$ cat ~/projects/toss-3380/mkdir_script
#!/usr/bin/bash
mkdir -p /p/lscratchf/faaland1/mkdirp/${SLURM_JOBID}/a/b/c/d/e/f/g/$(hostname)/$$

Run this way on jade (compute cluster described above) it reproduces the symptoms of the problem, except that after the compute nodes are rebooted the server recovers on its own:
srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script

Run this way on jade (compute cluster described above) it does not produce any symptoms; the directories are created quickly and the job succeeds:
srun -ppbatch -N 100 -n 100 ~/projects/toss-3380/mkdir_script

Comment by Olaf Faaland [ 26/Aug/16 ]

We have another compute cluster, catalyst, which mounts the same filesystem (cider/lsf/lscratchf).

On catalyst I do not reproduce the problem even at 100 nodes x 8 processes per node:
srun -ppbatch -N 100 -n 800 ~/projects/toss-3380/mkdir_script

There are many differences between catalyst and jade, but two differences are:
OS: jade runs RHEL 7.2, catalyst runs RHEL 6.8
selinux: jade has selinux enforcing, catalyst has it disabled

Comment by John Hammond [ 27/Aug/16 ]

Hi Olaf, would it be possible to disable SELinux on jade and run again?

Comment by nasf (Inactive) [ 29/Aug/16 ]

Olaf, have you tried with SELinux disabled as John suggested? It is suspected that SELinux caused your trouble. Even if there might be other reason(s), we still suggest you to disable SELinux on Jade. Because your current system (Lustre-2.5.5 based) does not support SELinux. Please refer to LU-5560 for detail.

Comment by Olaf Faaland [ 29/Aug/16 ]

John, Fan,
Yes, after disabling SELinux on jade, the job runs successfully, repeatedly. Sorry for the delay responding, I was out Friday. You can mark this issue resolved.

Comment by nasf (Inactive) [ 29/Aug/16 ]

That is related with SELinux.

Generated at Sat Feb 10 02:18:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.