Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15427

Lustre client hangs under memory pressue

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.12.8
    • None
    • RHEL 7 - Kernel 3.10.0-1160.42.2.el7.x86_64
    • 2
    • 9223372036854775807

    Description

      Our compute nodes have 384GB memory and 192GB swap space. When application use a lot of memory (all of 384GB and some of 192GB swap) , many processes reading/writing to Lustre enter in D state and hang and never recover. We see the below in syslog. Note - Swap never gets full. 

      Jan  3 18:26:03 spool0121 kernel: INFO: task kswapd0:510 blocked for more than 120 seconds.
      Jan  3 18:26:03 spool0121 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Jan  3 18:26:03 spool0121 kernel: kswapd0         D ffffa0f83caf85e0     0   510      2 0x00000000
      Jan  3 18:26:03 spool0121 kernel: Call Trace:
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3387480>] ? bit_wait+0x50/0x50
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3389179>] schedule+0x29/0x70
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3386e41>] schedule_timeout+0x221/0x2d0
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffc152111c>] ? cl_io_slice_add+0x5c/0x190 [obdclass]
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2d06992>] ? ktime_get_ts64+0x52/0xf0
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3387480>] ? bit_wait+0x50/0x50
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3388a2d>] io_schedule_timeout+0xad/0x130
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3388ac8>] io_schedule+0x18/0x20
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3387491>] bit_wait_io+0x11/0x50
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3386fb7>] __wait_on_bit+0x67/0x90
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dbd3c1>] wait_on_page_bit+0x81/0xa0
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cc7010>] ? wake_bit_function+0x40/0x40
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd380b>] shrink_page_list+0x9eb/0xc30
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd2853>] ? isolate_lru_pages.isra.47+0xd3/0x190
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd4066>] shrink_inactive_list+0x1b6/0x5c0
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dcd77e>] ? release_pages+0x24e/0x430
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd4b45>] shrink_lruvec+0x375/0x730
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cd2c60>] ? task_rq_unlock+0x20/0x20
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd5d36>] mem_cgroup_shrink_node_zone+0xa6/0x170
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2e428c3>] mem_cgroup_soft_limit_reclaim+0x1e3/0x4b0
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd60e0>] balance_pgdat+0x2e0/0x5e0
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd6553>] kswapd+0x173/0x430
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cc6f50>] ? wake_up_atomic_t+0x30/0x30
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2dd63e0>] ? balance_pgdat+0x5e0/0x5e0
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cc5e61>] kthread+0xd1/0xe0
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cc5d90>] ? insert_kthread_work+0x40/0x40
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa3395ddd>] ret_from_fork_nospec_begin+0x7/0x21
      Jan  3 18:26:03 spool0121 kernel: [<ffffffffa2cc5d90>] ? insert_kthread_work+0x40/0x40

      Attachments

        Activity

          People

            wc-triage WC Triage
            raot Joe Frith
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: