[LU-5801] client 2.5.2 - cgroups compatibility problem Created: 23/Oct/14  Updated: 24/Nov/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Lukasz Flis Assignee: WC Triage
Resolution: Unresolved Votes: 1
Labels: None
Environment:

Scientific Linux 6.5


Attachments: File cgroup-killer.log    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 16267

 Description   

During internal tests it appeared that dd process get killed by OOM-Killer when it belongs to a cgroup with memory enforcement.

In our scenario we use dd to write 10G file:
dd if=/dev/zero of=./testing bs=1M count=10000
Parent shell belongs to a cgroup with memory limit set to 1.5GB

Our tests on local filesystems and GPFS were fine - no oom killer invocations have been seen. Only Lustre operations tend to fail due to memory excess

I'm attaching oom details from dmesg


Lukasz Flis
ACC Cyfronet



 Comments   
Comment by Andreas Dilger [ 24/Oct/14 ]

As yet, we haven't done any testing with Lustre and cgroups. I suspect this relates to the way that clients determine the cache limits based on the total RAM and not on the memory available to the cgroup. I suspect that this needs fixing in several places (max_dirty_mb, max_cached_mb, lu_cache limits, DLM pool limits, etc) to check the cgroup available memory instead of totalram_pages. That said, I don't know enough about cgroups yet to know what to check...

Comment by Lukasz Flis [ 24/Oct/14 ]

I was looking for workaround and tried setting lustre.max_dirty_mb below the cgroup limit. Unfortunately no luck this time.

Comment by Lukasz Flis [ 13/Nov/14 ]

Hello

Since cgroups memory limiting is quite important functionality for us I'd like to ask if the fix is likely to appear in 2.5.x line?

Best Regards

Lukasz Flis

Comment by Andreas Dilger [ 14/Nov/14 ]

All new features will appear in master before possibly being backported to a maintenance branch. Since master is currently in feature freeze for 2.7.0 it is unlikely that this feature would be implemented in the next few months.

Comment by Lukasz Flis [ 12/Jun/15 ]

Hello Andreas,

Are there any news regarding the memory cgroup support in new Lustre client versions?

Comment by Tyson Whitehead [ 08/Jan/18 ]

Just adding another voice here for this bug to get some TLC as we've (Compute Canada) ran into it in a couple of cases with our users now. Would be great if it could get addressed.

Thanks! -Tyson

Comment by Peter Jones [ 26/Jan/18 ]

Has this issue been seen running on a current 2.10.x LTS release?

Comment by Patrick Farrell (Inactive) [ 26/Jan/18 ]

Peter,

You can check with Andreas, but I'm pretty confident the problem still exists. It's both a design issue and a design question. (I recall discussing it with Andreas in a JIRA at one point, but either my memory is bad or it was another JIRA.) We may not want to solve the problem in the way the reporter requested, but we haven't changed things here yet either.

Generated at Sat Feb 10 01:54:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.