[LU-13594] register OOM callback in Lustre Created: 22/May/20  Updated: 22/Jun/22

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Etienne Aujames
Resolution: Unresolved Votes: 0
Labels: easy

Issue Links:
Related
is related to LU-12830 RHEL8.3 and ZFS: oom on OSS Resolved
is related to LU-14456 ost-pools test_23b: ll_ost_io00_007 i... Resolved
is related to LU-15963 sanityn test_56b: OSS OOM with ZFS In Progress
Rank (Obsolete): 9223372036854775807

 Description   

It would be useful to register an OOM callback in Lustre using register_oom_notifier() (and deregister at shutdown with unregister_oom_notifier(), firstly in libcfs and obdclass to print the current libcfs_kmemory and memused_show()/memused_max_show(), as well as potentially trying to shrink caches (e.g. the number of LNet message buffers, debug logs, etc.) before a userspace process is killed.



 Comments   
Comment by Gerrit Updater [ 21/Mar/21 ]

Arshad Hussain (arshad.hussain@aeoncomputing.com) uploaded a new patch: https://review.whamcloud.com/42121
Subject: LU-13594 obdclass: Add OOM handler for obdclass
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2e61cf0c6c84608ea583ce342270746c84de7b69

Comment by Gerrit Updater [ 26/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/42121/
Subject: LU-13594 obdclass: Add OOM handler for obdclass
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 54d4cca6cb0c92a09b364974438d91d4331a036f

Comment by Peter Jones [ 26/Jan/22 ]

Landed for 2.15

Comment by Andreas Dilger [ 22/Jun/22 ]

I've seen this callback a few times recently, running sanityn test_56 on ZFS but the current code just prints a brief message without much context and does nothing else:

[16506.409968] obd_memory max: 200336259, obd_memory current: 200336259
[16506.975974] obd_memory max: 200416739, obd_memory current: 200416739
[16507.013294] obd_memory max: 200416739, obd_memory current: 200416739
[16507.020553] obd_memory max: 200416739, obd_memory current: 200416739
[16507.035227] obd_memory max: 200416739, obd_memory current: 200416739
[16507.218562] obd_memory max: 200471595, obd_memory current: 200471595
[16507.224060] obd_memory max: 200471595, obd_memory current: 200471595
[16507.226494] obd_memory max: 200471595, obd_memory current: 200471595
[16507.229583] obd_memory max: 200471595, obd_memory current: 200471595
[16507.231476] obd_memory max: 200471595, obd_memory current: 200471595

It would be better if this message was prefixed with "{{Lustre: OOM handler: }}" to give some context to what it means.

Secondly, having the handler itself at least provides some minimal information (Lustre memory usage is 200MB in this case, on a 3GB VM, not including LNet memory usage which should also be printed).

It would be better if this callback actually tried to do something useful under memory pressure. Possible candidates would be:

  • reduce number of server threads to free per-thread allocations
  • cancel DLM locks on server (see LU-6529 and related tickets)
  • cancel DLM locks on client (drop LRU completely, if not already done)
  • drop cached pages on client
Generated at Sat Feb 10 03:02:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.