Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13212

Lustre client hangs machine under memory pressure

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • Lustre 2.10.3
    • None
    • 3
    • 9223372036854775807

    Description

      Hello,

      When a userspace process goes crazy with memory allocation, sometimes OOM killer does not manage to kick in because Lustre is still trying to free its memory.
      I am not sure if it deadlocked or there is just too many locks which it is trying to free but it has been in this state for more than 12 hours before it was manually crashed.
      This is Centos 7.4 with kernel 3.10.0-693.5.2.el7.x86_64
      Machine still responds to pings when it is in this state.

      Here is one of the kernel task stack:

      [223483.032862]  [<ffffffff81196b27>] ? putback_inactive_pages+0x117/0x2d0
      [223483.050260]  [<ffffffff81196f0a>] ? shrink_inactive_list+0x22a/0x5d0
      [223483.062319]  [<ffffffff811979a5>] shrink_lruvec+0x385/0x730
      [223483.073571]  [<ffffffffc085ee07>] ? ldlm_cli_pool_shrink+0x67/0x100 [ptlrpc]
      [223483.086214]  [<ffffffff81197dc6>] shrink_zone+0x76/0x1a0
      [223483.096773]  [<ffffffff811982d0>] do_try_to_free_pages+0xf0/0x4e0
      [223483.108086]  [<ffffffff811987bc>] try_to_free_pages+0xfc/0x180
      [223483.119023]  [<ffffffff8169fbcb>] __alloc_pages_slowpath+0x457/0x724
      [223483.130417]  [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420
      [223483.141673]  [<ffffffff811d081a>] alloc_page_interleave+0x3a/0xa0
      [223483.152526]  [<ffffffff811d4133>] alloc_pages_vma+0x143/0x200
      [223483.162848]  [<ffffffff811c37a0>] ? end_swap_bio_write+0x80/0x80
      [223483.173345]  [<ffffffff811c44ad>] read_swap_cache_async+0xed/0x160
      [223483.183938]  [<ffffffff811c45c8>] swapin_readahead+0xa8/0x110
      [223483.193933]  [<ffffffff811b22cb>] handle_mm_fault+0xadb/0xfa0
      [223483.203823]  [<ffffffff816b00b4>] __do_page_fault+0x154/0x450
      [223483.213621]  [<ffffffff816b03e5>] do_page_fault+0x35/0x90
      [223483.222983]  [<ffffffff816ac608>] page_fault+0x28/0x30
      

      Please let me know if you need more information.
      Regards.
      Jacek Tomaka

      Attachments

        Issue Links

          Activity

            [LU-13212] Lustre client hangs machine under memory pressure
            simmonsja James A Simmons made changes -
            Link New: This issue is related to LU-15058 [ LU-15058 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.15.0 [ 14791 ]
            Assignee Original: WC Triage [ wc-triage ] New: Andreas Dilger [ adilger ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            jhammond John Hammond made changes -
            Link New: This issue is related to EX-3004 [ EX-3004 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-12241 [ LU-12241 ]
            Tomaka Jacek Tomaka (Inactive) made changes -
            Description Original: Hello,

            When a userspace process goes crazy, sometimes OOM killer does not manage to kick in because Lustre is still trying to free its memory.
            I am not sure if it deadlocked or there is just too many locks which it is trying to free but it has been in this state for more than 12 hours before it was manually crashed.
            This is Centos 7.4 with kernel 3.10.0-693.5.2.el7.x86_64
            Machine still responds to pings when it is in this state.

            Here is one of the kernel task stack:
            {noformat}
            [223483.032862] [<ffffffff81196b27>] ? putback_inactive_pages+0x117/0x2d0
            [223483.050260] [<ffffffff81196f0a>] ? shrink_inactive_list+0x22a/0x5d0
            [223483.062319] [<ffffffff811979a5>] shrink_lruvec+0x385/0x730
            [223483.073571] [<ffffffffc085ee07>] ? ldlm_cli_pool_shrink+0x67/0x100 [ptlrpc]
            [223483.086214] [<ffffffff81197dc6>] shrink_zone+0x76/0x1a0
            [223483.096773] [<ffffffff811982d0>] do_try_to_free_pages+0xf0/0x4e0
            [223483.108086] [<ffffffff811987bc>] try_to_free_pages+0xfc/0x180
            [223483.119023] [<ffffffff8169fbcb>] __alloc_pages_slowpath+0x457/0x724
            [223483.130417] [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420
            [223483.141673] [<ffffffff811d081a>] alloc_page_interleave+0x3a/0xa0
            [223483.152526] [<ffffffff811d4133>] alloc_pages_vma+0x143/0x200
            [223483.162848] [<ffffffff811c37a0>] ? end_swap_bio_write+0x80/0x80
            [223483.173345] [<ffffffff811c44ad>] read_swap_cache_async+0xed/0x160
            [223483.183938] [<ffffffff811c45c8>] swapin_readahead+0xa8/0x110
            [223483.193933] [<ffffffff811b22cb>] handle_mm_fault+0xadb/0xfa0
            [223483.203823] [<ffffffff816b00b4>] __do_page_fault+0x154/0x450
            [223483.213621] [<ffffffff816b03e5>] do_page_fault+0x35/0x90
            [223483.222983] [<ffffffff816ac608>] page_fault+0x28/0x30
            {noformat}

            Please let me know if you need more information.
            Regards.
            Jacek Tomaka
            New: Hello,

            When a userspace process goes crazy with memory allocation, sometimes OOM killer does not manage to kick in because Lustre is still trying to free its memory.
            I am not sure if it deadlocked or there is just too many locks which it is trying to free but it has been in this state for more than 12 hours before it was manually crashed.
            This is Centos 7.4 with kernel 3.10.0-693.5.2.el7.x86_64
            Machine still responds to pings when it is in this state.

            Here is one of the kernel task stack:
            {noformat}
            [223483.032862] [<ffffffff81196b27>] ? putback_inactive_pages+0x117/0x2d0
            [223483.050260] [<ffffffff81196f0a>] ? shrink_inactive_list+0x22a/0x5d0
            [223483.062319] [<ffffffff811979a5>] shrink_lruvec+0x385/0x730
            [223483.073571] [<ffffffffc085ee07>] ? ldlm_cli_pool_shrink+0x67/0x100 [ptlrpc]
            [223483.086214] [<ffffffff81197dc6>] shrink_zone+0x76/0x1a0
            [223483.096773] [<ffffffff811982d0>] do_try_to_free_pages+0xf0/0x4e0
            [223483.108086] [<ffffffff811987bc>] try_to_free_pages+0xfc/0x180
            [223483.119023] [<ffffffff8169fbcb>] __alloc_pages_slowpath+0x457/0x724
            [223483.130417] [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420
            [223483.141673] [<ffffffff811d081a>] alloc_page_interleave+0x3a/0xa0
            [223483.152526] [<ffffffff811d4133>] alloc_pages_vma+0x143/0x200
            [223483.162848] [<ffffffff811c37a0>] ? end_swap_bio_write+0x80/0x80
            [223483.173345] [<ffffffff811c44ad>] read_swap_cache_async+0xed/0x160
            [223483.183938] [<ffffffff811c45c8>] swapin_readahead+0xa8/0x110
            [223483.193933] [<ffffffff811b22cb>] handle_mm_fault+0xadb/0xfa0
            [223483.203823] [<ffffffff816b00b4>] __do_page_fault+0x154/0x450
            [223483.213621] [<ffffffff816b03e5>] do_page_fault+0x35/0x90
            [223483.222983] [<ffffffff816ac608>] page_fault+0x28/0x30
            {noformat}

            Please let me know if you need more information.
            Regards.
            Jacek Tomaka
            Tomaka Jacek Tomaka (Inactive) created issue -

            People

              adilger Andreas Dilger
              Tomaka Jacek Tomaka (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: