[LU-8528] MDT lock callback timer expiration and evictions under light load Created: 23/Aug/16 Updated: 29/Nov/16 Resolved: 29/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Olaf Faaland | Assignee: | nasf (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
servers: cider cluster clients: jade cluster |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Running a 1000-node user job (call it "ben") on the jade cluster results in 'lock callback timer expired' messages on the MDS console and transactions begin taking a very long time or failing entirely when the client is evicted. The first lock timeouts are seen within 5 minutes of starting the job. After the MDS stops responding inThe MDS is still up and debug logs can be dumped; I'll attach some. There is no evidence of network issues; the fabric in the compute cluster appears clean, the router nodes and compute nodes report no peers down, and initially the clients report good connections to the server. Networking monitoring tools also indicate no network issues. |
| Comments |
| Comment by Olaf Faaland [ 23/Aug/16 ] |
|
I overlooked setting the priority. This is critical; it is preventing us from doing early user testing of a new, large, cluster. |
| Comment by Olaf Faaland [ 23/Aug/16 ] |
|
Running mdtest at this scale, and larger scale, does not trigger this issue. |
| Comment by Olaf Faaland [ 23/Aug/16 ] |
|
For reference when looking at the attached logs: the job was started on 10/23 at 10:26am. |
| Comment by Olaf Faaland [ 23/Aug/16 ] |
|
From another cluster which also mounts the lscratchf, 'df' started taking >3 minutes to return at 10:30am and started responding normally only after the jade compute nodes had been powered off and the MDS had been crashed and recovered. That test is done every 5 minutes; at 10:25am 'df' took < 3 minutes. |
| Comment by Olaf Faaland [ 23/Aug/16 ] |
|
jade2074 is the node with nid 192.168.136.42@o2ib24 which is called out in the MDS console log. I've attached the console log from jade2074. Unfortunately I don't have stacks or a lustre debug log from jade2074, and the node has been rebooted. |
| Comment by Olaf Faaland [ 23/Aug/16 ] |
|
This is reproducible on this cluster/filesystem. We're working on reproducing the issue on a smaller cluster where we can leave the nodes in the bad state as long as necessary to gather data. |
| Comment by Olaf Faaland [ 24/Aug/16 ] |
|
Hi, can someone take a look at this? |
| Comment by Peter Jones [ 24/Aug/16 ] |
|
Fan Yong Can you please assist with this ticket? Thanks Peter |
| Comment by Olaf Faaland [ 24/Aug/16 ] |
|
Fan, I'm about to run another test at different scale, but expect it will likely produce the same symptoms. Is there anything in particular you want me to gather? Right now, I'm planning on collecting this:
The mdt lustre debug logs will likely be too big to send you, so you'll need to tell me what you would like me to extract. I can gather different information - let me know what is helpful. thanks, |
| Comment by Olaf Faaland [ 24/Aug/16 ] |
|
I forgot to specify above, but the client nodes and router nodes appear to be stable the entire time. I haven't observed that any nodes oom, or crash. So I see no reason the mdt wouldn't be receiving responses to requests sent to the clients. |
| Comment by Olaf Faaland [ 25/Aug/16 ] |
|
Tested the job at slightly different scale 8/24 and got the same symptoms. [<ffffffffa0cf3dd5>] ptlrpc_wait_event+0x2d5/0x2e0 [ptlrpc] [<ffffffffa0cfe31f>] ptlrpc_main+0x94f/0x1af0 [ptlrpc] [<ffffffff810a6cbe>] kthread+0x9e/0xc0 [<ffffffff8100c2ca>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff which looks to me like they are idle waiting for requests to arrive. Am I mistaken? |
| Comment by Olaf Faaland [ 25/Aug/16 ] |
|
I attached data from the Aug 24 run I mentioned. It includes job_stats data from the MDS for the job, as well as consoles, stacks, and lustre debug logs. It's called 08-24.for_intel.tgz |
| Comment by Olaf Faaland [ 25/Aug/16 ] |
|
This same job run on 100 nodes x 18 cores/node produces the same warning messages on the console of the MDS, but does not cause the file system to hang (at least not consistently). At 1000 nodes x 18 cores/node it does consistently cause the file system to hang. |
| Comment by nasf (Inactive) [ 25/Aug/16 ] |
|
Where can I get the source code that you are testing lustre-2.5.5-8chaos and lustre-2.5.5-9chaos? https://github.com/LLNL/lustre ? I cannot find related tags. |
| Comment by Olaf Faaland [ 25/Aug/16 ] |
|
Hello Fan, |
| Comment by Olaf Faaland [ 25/Aug/16 ] |
|
The user reports he has successfully run the same job at 1024 nodes on Sierra, several times within the last few months. The Sierra cluster is at lustre-2.5.5-6chaos_2.6.32_573.26.1.1chaos.ch5.4.x86_64.x86_64. Sierra is running a RHEL 6.7 derivative (the same one as the luster servers cider*). |
| Comment by Olaf Faaland [ 26/Aug/16 ] |
|
Hello Fan, I have a probable reproducer: $ cat ~/projects/toss-3380/mkdir_script
#!/usr/bin/bash
mkdir -p /p/lscratchf/faaland1/mkdirp/${SLURM_JOBID}/a/b/c/d/e/f/g/$(hostname)/$$
Run this way on jade (compute cluster described above) it reproduces the symptoms of the problem, except that after the compute nodes are rebooted the server recovers on its own: Run this way on jade (compute cluster described above) it does not produce any symptoms; the directories are created quickly and the job succeeds: |
| Comment by Olaf Faaland [ 26/Aug/16 ] |
|
We have another compute cluster, catalyst, which mounts the same filesystem (cider/lsf/lscratchf). On catalyst I do not reproduce the problem even at 100 nodes x 8 processes per node: There are many differences between catalyst and jade, but two differences are: |
| Comment by John Hammond [ 27/Aug/16 ] |
|
Hi Olaf, would it be possible to disable SELinux on jade and run again? |
| Comment by nasf (Inactive) [ 29/Aug/16 ] |
|
Olaf, have you tried with SELinux disabled as John suggested? It is suspected that SELinux caused your trouble. Even if there might be other reason(s), we still suggest you to disable SELinux on Jade. Because your current system (Lustre-2.5.5 based) does not support SELinux. Please refer to |
| Comment by Olaf Faaland [ 29/Aug/16 ] |
|
John, Fan, |
| Comment by nasf (Inactive) [ 29/Aug/16 ] |
|
That is related with SELinux. |