[LU-6077] MDS OOM Created: 31/Dec/14 Updated: 01/Jun/15 Resolved: 01/Jun/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 16909 |
| Description |
|
We have had a number of crashes with the MDS OOMing with ldlm_locks slab using most of the memory. Attached you'll find console logs and back trace. <code> TOTAL SWAP 500013 1.9 GB ---- crash> kmem -s |
| Comments |
| Comment by John Fuchs-Chesney (Inactive) [ 01/Jan/15 ] |
|
Niu, Thanks, |
| Comment by Niu Yawei (Inactive) [ 04/Jan/15 ] |
|
1. I see lots of network errors in the log: <4>LNet: 2036:0:(o2iblnd_cb.c:2348:kiblnd_passive_connect()) Conn stale 10.151.28.220@o2ib [old ver: 12, new ver: 12] <4>LNet: 2036:0:(o2iblnd_cb.c:2348:kiblnd_passive_connect()) Conn stale 10.151.49.230@o2ib [old ver: 12, new ver: 12] <4>LNet: 2036:0:(o2iblnd_cb.c:2348:kiblnd_passive_connect()) Skipped 1 previous similar message <4>LNet: 2036:0:(o2iblnd_cb.c:2348:kiblnd_passive_connect()) Conn stale 10.151.49.233@o2ib [old ver: 12, new ver: 12] <4>LNet: 2036:0:(o2iblnd_cb.c:2348:kiblnd_passive_connect()) Skipped 2 previous similar messages <4>Lustre: MGS: haven't heard from client 115bc340-65eb-e4c8-5212-3d07e8fe9c9b (at 10.151.46.238@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff880432122c00, cur 1419472730 expire 1419472580 last 1419472503 You probably should check the network is working properly first. 2. Do you have any special patches applied on 2.4.3? 3. I'm afraid that the ldlm pools shrink mechanism can't work well in heavy workload, could you try to disable the lru_resize to see if the OOM can be resolved? (see Lustre manual 32.8 Configuring locking) |
| Comment by Peter Jones [ 05/Jan/15 ] |
|
Niu The NASA tree is on github - https://github.com/jlan/lustre-nas. NASA will have to advise as to the exact version in use. Peter |
| Comment by Jay Lan (Inactive) [ 05/Jan/15 ] |
|
Service160 was running 2.4.3-8nasS. The tag corresponds to |
| Comment by Mahmoud Hanafi [ 05/Jan/15 ] |
|
The network error you pointed out are normal We see those all the time. We have large number of nodes that are some time rebooted after a job. The Documentation is not very clear. Do we run this on every client? If we have different clients/#cpus how do we deal with that? What are the side effects of disabling lru_size? |
| Comment by Niu Yawei (Inactive) [ 06/Jan/15 ] |
|
Thank you, Jay. I didn't see any suspicious commit in the log.
Yes, you have to run this on every client. You can use a script to get the NR_CPU on each client then set the lru_size accordingly, or you can just use an average value for all clients.
When lru_resize enabled, each client has a dynamic ldlm cache size, the number of cached locks for each client depends on the workload and memory on client/server (active client can cache more locks, idle client cache less locks); When lru_resize disabled, each client can at maximum cache only lru_size (NR_CPU * 100) ldlm locks. |
| Comment by Peter Jones [ 15/Jan/15 ] |
|
Niu Could this be related to Peter |
| Comment by Niu Yawei (Inactive) [ 16/Jan/15 ] |
|
I think they are different issues, in this ticket, the ldlm lock cache is getting very huge, it consumed lots of memory, whereas in |
| Comment by Niu Yawei (Inactive) [ 01/Jun/15 ] |
|
I think this is the same problem of |
| Comment by Niu Yawei (Inactive) [ 01/Jun/15 ] |
|
dup of |