[LU-13594] register OOM callback in Lustre - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.14.0
Labels:
- debug
- easy

Rank (Obsolete):
9223372036854775807

Description

It would be useful to register an OOM callback in Lustre using register_oom_notifier() (and deregister at shutdown with unregister_oom_notifier(), firstly in libcfs and obdclass to print the current libcfs_kmemory and memused_show()/memused_max_show(), as well as potentially trying to shrink caches (e.g. the number of LNet message buffers, debug logs, etc.) before a userspace process is killed.

Attachments

Issue Links

is related to

LU-15963 sanityn test_56b: OSS OOM with ZFS

Reopened

is related to

LU-12830 RHEL8.3 and ZFS: oom on OSS

Resolved

LU-14456 ost-pools test_23b: ll_ost_io00_007 invoked oom-killer

Resolved

Activity

[LU-13594] register OOM callback in Lustre

Andreas Dilger added a comment - 22/Jun/22 6:08 PM

I've seen this callback a few times recently, running sanityn test_56 on ZFS but the current code just prints a brief message without much context and does nothing else:

[16506.409968] obd_memory max: 200336259, obd_memory current: 200336259
[16506.975974] obd_memory max: 200416739, obd_memory current: 200416739
[16507.013294] obd_memory max: 200416739, obd_memory current: 200416739
[16507.020553] obd_memory max: 200416739, obd_memory current: 200416739
[16507.035227] obd_memory max: 200416739, obd_memory current: 200416739
[16507.218562] obd_memory max: 200471595, obd_memory current: 200471595
[16507.224060] obd_memory max: 200471595, obd_memory current: 200471595
[16507.226494] obd_memory max: 200471595, obd_memory current: 200471595
[16507.229583] obd_memory max: 200471595, obd_memory current: 200471595
[16507.231476] obd_memory max: 200471595, obd_memory current: 200471595

It would be better if this message was prefixed with "{{Lustre: OOM handler: }}" to give some context to what it means.

Secondly, having the handler itself at least provides some minimal information (Lustre memory usage is 200MB in this case, on a 3GB VM, not including LNet memory usage which should also be printed).

It would be better if this callback actually tried to do something useful under memory pressure. Possible candidates would be:

reduce number of server threads to free per-thread allocations
cancel DLM locks on server (see ~~LU-6529~~ and related tickets)
cancel DLM locks on client (drop LRU completely, if not already done)
drop cached pages on client

Andreas Dilger added a comment - 22/Jun/22 6:08 PM I've seen this callback a few times recently, running sanityn test_56 on ZFS but the current code just prints a brief message without much context and does nothing else: [16506.409968] obd_memory max: 200336259, obd_memory current: 200336259 [16506.975974] obd_memory max: 200416739, obd_memory current: 200416739 [16507.013294] obd_memory max: 200416739, obd_memory current: 200416739 [16507.020553] obd_memory max: 200416739, obd_memory current: 200416739 [16507.035227] obd_memory max: 200416739, obd_memory current: 200416739 [16507.218562] obd_memory max: 200471595, obd_memory current: 200471595 [16507.224060] obd_memory max: 200471595, obd_memory current: 200471595 [16507.226494] obd_memory max: 200471595, obd_memory current: 200471595 [16507.229583] obd_memory max: 200471595, obd_memory current: 200471595 [16507.231476] obd_memory max: 200471595, obd_memory current: 200471595 It would be better if this message was prefixed with "{{Lustre: OOM handler: }}" to give some context to what it means. Secondly, having the handler itself at least provides some minimal information (Lustre memory usage is 200MB in this case, on a 3GB VM, not including LNet memory usage which should also be printed). It would be better if this callback actually tried to do something useful under memory pressure. Possible candidates would be: reduce number of server threads to free per-thread allocations cancel DLM locks on server (see LU-6529 and related tickets) cancel DLM locks on client (drop LRU completely, if not already done) drop cached pages on client

Peter Jones added a comment - 26/Jan/22 2:24 PM

Landed for 2.15

Peter Jones added a comment - 26/Jan/22 2:24 PM Landed for 2.15

Gerrit Updater added a comment - 26/Jan/22 5:14 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/42121/
Subject: LU-13594 obdclass: Add OOM handler for obdclass
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 54d4cca6cb0c92a09b364974438d91d4331a036f

Gerrit Updater added a comment - 26/Jan/22 5:14 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/42121/ Subject: LU-13594 obdclass: Add OOM handler for obdclass Project: fs/lustre-release Branch: master Current Patch Set: Commit: 54d4cca6cb0c92a09b364974438d91d4331a036f

Gerrit Updater added a comment - 21/Mar/21 9:52 AM

Arshad Hussain (arshad.hussain@aeoncomputing.com) uploaded a new patch: https://review.whamcloud.com/42121
Subject: LU-13594 obdclass: Add OOM handler for obdclass
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2e61cf0c6c84608ea583ce342270746c84de7b69

Gerrit Updater added a comment - 21/Mar/21 9:52 AM Arshad Hussain (arshad.hussain@aeoncomputing.com) uploaded a new patch: https://review.whamcloud.com/42121 Subject: LU-13594 obdclass: Add OOM handler for obdclass Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2e61cf0c6c84608ea583ce342270746c84de7b69

register OOM callback in Lustre

Details

Description

Attachments

Issue Links

Activity

People

Dates