Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0
    • 9223372036854775807

    Description

      It would be useful to register an OOM callback in Lustre using register_oom_notifier() (and deregister at shutdown with unregister_oom_notifier(), firstly in libcfs and obdclass to print the current libcfs_kmemory and memused_show()/memused_max_show(), as well as potentially trying to shrink caches (e.g. the number of LNet message buffers, debug logs, etc.) before a userspace process is killed.

      Attachments

        Issue Links

          Activity

            [LU-13594] register OOM callback in Lustre

            I've seen this callback a few times recently, running sanityn test_56 on ZFS but the current code just prints a brief message without much context and does nothing else:

            [16506.409968] obd_memory max: 200336259, obd_memory current: 200336259
            [16506.975974] obd_memory max: 200416739, obd_memory current: 200416739
            [16507.013294] obd_memory max: 200416739, obd_memory current: 200416739
            [16507.020553] obd_memory max: 200416739, obd_memory current: 200416739
            [16507.035227] obd_memory max: 200416739, obd_memory current: 200416739
            [16507.218562] obd_memory max: 200471595, obd_memory current: 200471595
            [16507.224060] obd_memory max: 200471595, obd_memory current: 200471595
            [16507.226494] obd_memory max: 200471595, obd_memory current: 200471595
            [16507.229583] obd_memory max: 200471595, obd_memory current: 200471595
            [16507.231476] obd_memory max: 200471595, obd_memory current: 200471595
            

            It would be better if this message was prefixed with "{{Lustre: OOM handler: }}" to give some context to what it means.

            Secondly, having the handler itself at least provides some minimal information (Lustre memory usage is 200MB in this case, on a 3GB VM, not including LNet memory usage which should also be printed).

            It would be better if this callback actually tried to do something useful under memory pressure. Possible candidates would be:

            • reduce number of server threads to free per-thread allocations
            • cancel DLM locks on server (see LU-6529 and related tickets)
            • cancel DLM locks on client (drop LRU completely, if not already done)
            • drop cached pages on client
            adilger Andreas Dilger added a comment - I've seen this callback a few times recently, running sanityn test_56 on ZFS but the current code just prints a brief message without much context and does nothing else: [16506.409968] obd_memory max: 200336259, obd_memory current: 200336259 [16506.975974] obd_memory max: 200416739, obd_memory current: 200416739 [16507.013294] obd_memory max: 200416739, obd_memory current: 200416739 [16507.020553] obd_memory max: 200416739, obd_memory current: 200416739 [16507.035227] obd_memory max: 200416739, obd_memory current: 200416739 [16507.218562] obd_memory max: 200471595, obd_memory current: 200471595 [16507.224060] obd_memory max: 200471595, obd_memory current: 200471595 [16507.226494] obd_memory max: 200471595, obd_memory current: 200471595 [16507.229583] obd_memory max: 200471595, obd_memory current: 200471595 [16507.231476] obd_memory max: 200471595, obd_memory current: 200471595 It would be better if this message was prefixed with "{{Lustre: OOM handler: }}" to give some context to what it means. Secondly, having the handler itself at least provides some minimal information (Lustre memory usage is 200MB in this case, on a 3GB VM, not including LNet memory usage which should also be printed). It would be better if this callback actually tried to do something useful under memory pressure. Possible candidates would be: reduce number of server threads to free per-thread allocations cancel DLM locks on server (see LU-6529 and related tickets) cancel DLM locks on client (drop LRU completely, if not already done) drop cached pages on client
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/42121/
            Subject: LU-13594 obdclass: Add OOM handler for obdclass
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 54d4cca6cb0c92a09b364974438d91d4331a036f

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/42121/ Subject: LU-13594 obdclass: Add OOM handler for obdclass Project: fs/lustre-release Branch: master Current Patch Set: Commit: 54d4cca6cb0c92a09b364974438d91d4331a036f

            Arshad Hussain (arshad.hussain@aeoncomputing.com) uploaded a new patch: https://review.whamcloud.com/42121
            Subject: LU-13594 obdclass: Add OOM handler for obdclass
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2e61cf0c6c84608ea583ce342270746c84de7b69

            gerrit Gerrit Updater added a comment - Arshad Hussain (arshad.hussain@aeoncomputing.com) uploaded a new patch: https://review.whamcloud.com/42121 Subject: LU-13594 obdclass: Add OOM handler for obdclass Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2e61cf0c6c84608ea583ce342270746c84de7b69

            People

              eaujames Etienne Aujames
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: