[LU-11092] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [ptlrpcd_00_18:4222] Created: 20/Jun/18 Updated: 21/Jan/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Julien Wallior | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
SLES12SP3 – 4.4.132-94.33-default client – lustre 2.10.4 |
||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
I'm running a robinhood scan on a lustre filesystem with 80M files. As I'm getting memory pressure on the system (80M * ~1kB per lustre_inode_cache is big) and the system tries to shrink the slabs, I get those errors in /var/log/messages: 2018-06-20T10:49:34.165875-04:00 btsotbal3000 kernel: [ 4889.674359] Lustre: lustre-MDT0000-mdc-ffff883fd53bb000: Connection to lustre-MDT0000 (at 10.11.201.11@o2ib) was lost; in progress operations using this service will wait for recov At that point, all the CPUs go to 100% and it seems like we are not making progress anymore. I can repro by running robinhood for a few minutes and dropping the cache (echo 2 > /proc/sys/vm/drop_caches). I managed to fix it by limiting lru_max_age to 10s (ldlm.namespace.*.lru_max_age).
Overall, it feels like when the slab shrinker run, if there are too many locks, reclaiming those locks takes a while, prevents the rpc stack to make progress, and then we hit a race condition in LNetMDUnlink.
|
| Comments |
| Comment by Andreas Dilger [ 08/Aug/18 ] |
|
There is a patch under development that might be able to help this - it reduces the number of client RPCs in flight if it is not getting a timely response from the servers. This helps reduce the load on the servers, and avoids timeouts on the clients. |
| Comment by Andreas Dilger [ 13/Aug/18 ] |
|
There are patches for master and b2_10 in |
| Comment by Julien Wallior [ 14/Aug/18 ] |
|
We tried with the patch from Overall, since we lowered lru_max_age, we haven't had any issue related to this bug. |
| Comment by Andreas Dilger [ 15/Aug/18 ] |
|
When you run without the reduced lru_max_age, how many locks accumulate on the clients/MDS? This can be seen on the clients and servers with lctl get_param ldlm.namespaces.*MDT*.lock_count. |
| Comment by Andreas Dilger [ 15/Aug/18 ] |
|
Is it the MDS or the client that is running short of memory? It looks like the client is getting timeouts on the MDS RPCs, so finding out where the memory pressure is will determine what to investigate/fix. In addition to the lock_count, please provide the output of /proc/meminfo and /proc/slabinfo on the RBH client and MDS. |
| Comment by Julien Wallior [ 15/Aug/18 ] |
|
When I revert lru_max_age to default (3900000), I only do it on the client and I can repro. (this is live prod filesystem, so I can only break it so much). Regarding who's running out of memory, I think it's the client (I can either wait for it to run out of memory, or create the issue earlier by dropping the cache on the client). Now the client has 256GB, so even for a 80M filesystem, I would hope that should be enough to run rbh. Finally, regarding the timeouts, they say "@@@ Request sent has timed out for sent delay". I was thinking that meant the client was stuck on something (clearing the memory) and didn't get to send the request on time. But maybe it means the client can't put the request on the wire because the server told it to wait (I'm not sure if that's possible). The client was freshly rebooted before starting the run. After running rhb --scan for ~20min, I got the following results. I can get the data on the OSS too if you need them. mds832.txt |
| Comment by Andreas Dilger [ 16/Aug/18 ] |
|
It might be with so much RAM on the client that it isn't trying very hard to cancel locks on the client. Setting the lock max age to 10s is probably too short to be useful for other locks on the system (e.g. directories that are being accessed repeatedly). Another option would be to set a limit on the number of locks on the client and increase the max age to something better like 10 minutes, like lctl set_param ldlm.namespaces.*.lru_size=100000 ldlm.namespaces.*.lru_max_age=600000 or similar. If we assumed all of the RAM in use on the client was used by Lustre, the 1M inodes cached would be using about 13KB/inode (including 5M locks), which means that it would consume at most 13MB per OST for the inode/lock cache. It may be that you are seeing more locks than inodes if e.g. you have a default stripe count = 4 (one MDT lock, and 4 OST locks per file). It probably makes sense to put some reasonable upper limit on the LRU size, since I've seen similar problems of too many cached locks on other systems with a lot of RAM. Even if there is enough free RAM it just slows things down when there are many millions of files cached for no real reason. We might also need to take a look at the kernel slab cache shrinker to see if it is working effectively to shrink the DLM cache size under memory pressure. |
| Comment by Julien Wallior [ 17/Aug/18 ] |
|
Yes, we use stripe count > 1 in different directory. I tried lru_size = 100k and lru_max_age = 600k and that works fine too. |