Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10133

Multi-page allocation failures in mlx4/mlx5

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.11.0
    • Soak cluster - lustre-master build 3654 lustre version=2.10.54_13_g84f690e
    • 3
    • 9223372036854775807

    Description

      I am seeing multiple page allocation failures from soak-clients. Failures seem to be semi-random.
      Example:

      Oct 17 02:20:07 soak-17 kernel: kworker/u480:1: page allocation failure: order:8, mode:0x80d0
      Oct 17 02:20:07 soak-17 kernel: CPU: 9 PID: 58714 Comm: kworker/u480:1 Tainted: G           OE  ------------   3.10.0-693.2.2.el7.x86_64 #1
      Oct 17 02:20:07 soak-17 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
      Oct 17 02:20:08 soak-17 kernel: Workqueue: rdma_cm cma_work_handler [rdma_cm]
      Oct 17 02:20:08 soak-17 kernel: 00000000000080d0 00000000a9e78c95 ffff8803ee9bf848 ffffffff816a3db1
      Oct 17 02:20:08 soak-17 kernel: ffff8803ee9bf8d8 ffffffff81188810 0000000000000000 ffff88043ffdb000
      Oct 17 02:20:08 soak-17 kernel: 0000000000000008 00000000000080d0 ffff8803ee9bf8d8 00000000a9e78c95
      Oct 17 02:20:08 soak-17 kernel: Call Trace:
      Oct 17 02:20:08 soak-17 kernel: [<ffffffff816a3db1>] dump_stack+0x19/0x1b
      Oct 17 02:20:08 soak-17 kernel: [<ffffffff81188810>] warn_alloc_failed+0x110/0x180
      Oct 17 02:20:08 soak-17 kernel: [<ffffffff8169fd8a>] __alloc_pages_slowpath+0x6b6/0x724
      Oct 17 02:20:08 soak-17 kernel: [<ffffffff8118cd85>] __alloc_pages_nodemask+0x405/0x420
      Oct 17 02:20:08 soak-17 kernel: [<ffffffff81030f8f>] dma_generic_alloc_coherent+0x8f/0x140
      Oct 17 02:20:08 soak-17 kernel: [<ffffffff81064341>] x86_swiotlb_alloc_coherent+0x21/0x50
      Oct 17 02:20:08 soak-17 kernel: [<ffffffffc02914d3>] mlx4_buf_direct_alloc.isra.6+0xd3/0x1a0 [mlx4_core]
      Oct 17 02:20:09 soak-17 kernel: [<ffffffffc029176b>] mlx4_buf_alloc+0x1cb/0x240 [mlx4_core]
      Oct 17 02:20:09 soak-17 kernel: [<ffffffffc02940d0>] ? __mlx4_cmd+0x560/0x920 [mlx4_core]
      Oct 17 02:20:09 soak-17 kernel: [<ffffffffc061085e>] create_qp_common.isra.31+0x62e/0x10d0 [mlx4_ib]
      Oct 17 02:20:09 soak-17 kernel: [<ffffffffc061144e>] mlx4_ib_create_qp+0x14e/0x480 [mlx4_ib]
      Oct 17 02:20:09 soak-17 kernel: [<ffffffffc03c9c3a>] ib_create_qp+0x7a/0x2f0 [ib_core]
      Oct 17 02:20:09 soak-17 kernel: [<ffffffffc04f66d4>] rdma_create_qp+0x34/0xb0 [rdma_cm]
      Oct 17 02:20:09 soak-17 kernel: [<ffffffffc0bd8539>] kiblnd_create_conn+0xbf9/0x1960 [ko2iblnd]
      Oct 17 02:20:09 soak-17 kernel: [<ffffffffc0be8649>] kiblnd_cm_callback+0x1429/0x2300 [ko2iblnd]
      Oct 17 02:20:09 soak-17 kernel: [<ffffffffc04fa57c>] cma_work_handler+0x6c/0xa0 [rdma_cm]
      Oct 17 02:20:09 soak-17 kernel: [<ffffffff810a881a>] process_one_work+0x17a/0x440
      Oct 17 02:20:09 soak-17 kernel: [<ffffffff810a94e6>] worker_thread+0x126/0x3c0
      Oct 17 02:20:09 soak-17 kernel: [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0
      Oct 17 02:20:09 soak-17 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      Oct 17 02:20:09 soak-17 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      Oct 17 02:20:10 soak-17 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      Oct 17 02:20:10 soak-17 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      Oct 17 02:20:10 soak-17 kernel: Mem-Info:
      Oct 17 02:20:10 soak-17 kernel: active_anon:36658 inactive_anon:27590 isolated_anon:6#012 active_file:2710466 inactive_file:345768 isolated_file:10#012 unevictable:0 dirty:14 writeback:0 unstable:0#012 slab_reclaimable:30971 slab_unreclaimable:3983583#012 mapped:10108 shmem:6384 pagetables:3086 bounce:0#012 free:776253 free_pcp:359 free_cma:0
      Oct 17 02:20:11 soak-17 kernel: Node 0 DMA free:15784kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15932kB managed:15848kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      Oct 17 02:20:11 soak-17 kernel: lowmem_reserve[]: 0 2580 15620 15620
      Oct 17 02:20:11 soak-17 kernel: Node 0 DMA32 free:132736kB min:7320kB low:9148kB high:10980kB active_anon:6472kB inactive_anon:8768kB active_file:1063620kB inactive_file:27644kB unevictable:0kB isolated(anon):24kB isolated(file):40kB present:3051628kB managed:2643828kB mlocked:0kB dirty:8kB writeback:0kB mapped:2140kB shmem:116kB slab_reclaimable:9352kB slab_unreclaimable:1306892kB kernel_stack:1152kB pagetables:1196kB unstable:0kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      Oct 17 02:20:11 soak-17 kernel: lowmem_reserve[]: 0 0 13040 13040
      Oct 17 02:20:11 soak-17 kernel: Node 0 Normal free:1149812kB min:37012kB low:46264kB high:55516kB active_anon:69848kB inactive_anon:32420kB active_file:4495364kB inactive_file:737992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13353036kB mlocked:0kB dirty:24kB writeback:0kB mapped:9156kB shmem:248kB slab_reclaimable:54264kB slab_unreclaimable:6303688kB kernel_stack:7248kB pagetables:5096kB unstable:0kB bounce:0kB free_pcp:860kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      Oct 17 02:20:12 soak-17 kernel: lowmem_reserve[]: 0 0 0 0
      Oct 17 02:20:12 soak-17 kernel: Node 1 Normal free:1805688kB min:45728kB low:57160kB high:68592kB active_anon:70700kB inactive_anon:69172kB active_file:5282880kB inactive_file:617436kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16498508kB mlocked:0kB dirty:24kB writeback:0kB mapped:29136kB shmem:25172kB slab_reclaimable:60268kB slab_unreclaimable:8323752kB kernel_stack:5568kB pagetables:6052kB unstable:0kB bounce:0kB free_pcp:1468kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      Oct 17 02:20:13 soak-17 kernel: lowmem_reserve[]: 0 0 0 0
      Oct 17 02:20:13 soak-17 kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 1*32kB (U) 0*64kB 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15784kB
      Oct 17 02:20:13 soak-17 kernel: Node 0 DMA32: 2018*4kB (UEM) 1070*8kB (UEM) 670*16kB (UEM) 685*32kB (UEM) 594*64kB (UEM) 199*128kB (UEM) 80*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 133240kB
      Oct 17 02:20:13 soak-17 kernel: Node 0 Normal: 8492*4kB (UEM) 5207*8kB (UEM) 3978*16kB (UEM) 8657*32kB (UEM) 8319*64kB (EM) 1594*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1152744kB
      Oct 17 02:20:13 soak-17 kernel: Node 1 Normal: 14583*4kB (UEM) 8566*8kB (UEM) 5482*16kB (UEM) 13112*32kB (UEM) 11765*64kB (UEM) 2443*128kB (UM) 418*256kB (UM) 5*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 1809388kB
      Oct 17 02:20:13 soak-17 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      Oct 17 02:20:13 soak-17 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      Oct 17 02:20:13 soak-17 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      Oct 17 02:20:14 soak-17 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      Oct 17 02:20:14 soak-17 kernel: 3062619 total pagecache pages
      Oct 17 02:20:14 soak-17 kernel: 6 pages in swap cache
      Oct 17 02:20:14 soak-17 kernel: Swap cache stats: add 13, delete 7, find 0/0
      Oct 17 02:20:14 soak-17 kernel: Free swap  = 16319432kB
      Oct 17 02:20:14 soak-17 kernel: Total swap = 16319484kB
      Oct 17 02:20:14 soak-17 kernel: 8369066 pages RAM
      Oct 17 02:20:14 soak-17 kernel: 0 pages HighMem/MovableOnly
      Oct 17 02:20:14 soak-17 kernel: 241261 pages reserved
      Oct 17 02:20:15 soak-17 kernel: kworker/u480:1: page allocation failure: order:8, mode:0x80d0
      Oct 17 02:20:15 soak-17 kernel: CPU: 9 PID: 58714 Comm: kworker/u480:1 Tainted: G           OE  ------------   3.10.0-693.2.2.el7.x86_64 #1
      Oct 17 02:20:15 soak-17 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
      Oct 17 02:20:15 soak-17 kernel: Workqueue: rdma_cm cma_work_handler [rdma_cm]
      

      The systems appear to recover and continue. Lustre-log dump from soak-17 after the most recent failure attached.

      Attachments

        Issue Links

          Activity

            People

              ashehata Amir Shehata (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              35 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: