Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10322

OSS - memory issues, page allocation failures, crash

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.10.2
    • soak stress cluster
    • 3
    • 9223372036854775807

    Description

      System was somewhat wedged, as MDS was stuck in umount after failover test.
      Some job were running.
      soak-6 had been seeing hangs in ll_ost* processes:

      Dec  3 17:03:22 soak-6 kernel: Lustre: 5090:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1512320601/real 1512320602]  req@ffff880807505100 x1585614571743184/t0(0) o38->soaked-MDT0002-lwp-OST0016@192.168.1.110@o2ib:12/10 lens 520/544 e 0 to 1 dl 1512320656 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      Dec  3 17:03:22 soak-6 kernel: Lustre: 5090:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 26 previous similar messages
      Dec  3 17:04:31 soak-6 kernel: Lustre: soaked-OST0016: Connection restored to  (at 192.168.1.123@o2ib)
      Dec  3 17:04:31 soak-6 kernel: Lustre: Skipped 34 previous similar messages
      Dec  3 17:06:44 soak-6 kernel: Lustre: 226541:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (4/4), not sending early reply#012  req@ffff88080ba09050 x1585626137492400/t0(0) o4->1c674fc7-c4ea-661f-8bc3-4139e5a9579f@192.168.1.143@o2ib:693/0 lens 608/0 e 0 to 0 dl 1512320808 ref 2 fl New:/2/ffffffff rc 0/-1
      Dec  3 17:06:44 soak-6 kernel: Lustre: 226541:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 337 previous similar messages
      Dec  3 17:08:43 soak-6 kernel: Lustre: soaked-OST0016: Client 1dc01503-bd29-4e1d-18f1-96cbe7715c08 (at 192.168.1.120@o2ib) reconnecting
      Dec  3 17:08:43 soak-6 kernel: Lustre: Skipped 34 previous similar messages
      Dec  3 17:09:26 soak-6 kernel: LNet: Service thread pid 34743 was inactive for 1201.32s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Dec  3 17:09:26 soak-6 kernel: LNet: Skipped 1 previous similar message
      Dec  3 17:09:26 soak-6 kernel: Pid: 34743, comm: ll_ost01_015
      Dec  3 17:09:26 soak-6 kernel: #012Call Trace:
      Dec  3 17:09:26 soak-6 kernel: [<ffffffff816aa489>] schedule_preempt_disabled+0x29/0x70
      Dec  3 17:09:26 soak-6 kernel: [<ffffffff816a83b7>] __mutex_lock_slowpath+0xc7/0x1d0
      Dec  3 17:09:26 soak-6 kernel: [<ffffffff816a77cf>] mutex_lock+0x1f/0x2f
      Dec  3 17:09:26 soak-6 kernel: [<ffffffffc107c7db>] ofd_create_hdl+0xdcb/0x2090 [ofd]
      Dec  3 17:09:26 soak-6 kernel: [<ffffffffc0d65e37>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
      Dec  3 17:09:26 soak-6 kernel: [<ffffffffc0d6618f>] ? lustre_pack_reply_v2+0x14f/0x280 [ptlrpc]
      Dec  3 17:09:27 soak-6 kernel: [<ffffffffc0d664b1>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
      Dec  3 17:09:27 soak-6 kernel: [<ffffffffc0dc90d5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
      Dec  3 17:09:27 soak-6 kernel: [<ffffffffc0d71f16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      Dec  3 17:09:27 soak-6 kernel: [<ffffffffc0d6e548>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      Dec  3 17:09:27 soak-6 kernel: [<ffffffff810c4832>] ? default_wake_function+0x12/0x20
      Dec  3 17:09:27 soak-6 kernel: [<ffffffff810ba598>] ? __wake_up_common+0x58/0x90
      Dec  3 17:09:27 soak-6 kernel: [<ffffffffc0d75652>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Dec  3 17:09:27 soak-6 kernel: [<ffffffffc0d74bc0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      Dec  3 17:09:27 soak-6 kernel: [<ffffffff810b099f>] kthread+0xcf/0xe0
      Dec  3 17:09:27 soak-6 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0
      Dec  3 17:09:27 soak-6 kernel: [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
      Dec  3 17:09:27 soak-6 kernel: [<ffffffff810b08d0>] ? kthread+0x0/0xe0
      

      We start seeing memory errors:

      Dec  3 17:16:45 soak-6 kernel: Lustre: 226541:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012  req@ffff8806c017e050 x1585626137492400/t0(0) o4->1c674fc7-c4ea-661f-8bc3-4139e5a9579f@192.168.1.143@o2ib:540/0 lens 608/0 e 0 to 0 dl 1512321410 ref 2 fl New:/2/ffffffff rc 0/-1
      Dec  3 17:16:45 soak-6 kernel: Lustre: 226541:0:(service.c:1346:ptlrpc_at_send_early_reply()) Skipped 332 previous similar messages
      Dec  3 17:16:53 soak-6 kernel: kworker/u480:0: page allocation failure: order:8, mode:0x80d0
      Dec  3 17:16:53 soak-6 kernel: CPU: 18 PID: 67308 Comm: kworker/u480:0 Tainted: P           OE  ------------   3.10.0-693.5.2.el7_lustre.x86_64 #1
      Dec  3 17:16:53 soak-6 kernel: Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
      Dec  3 17:16:53 soak-6 kernel: Workqueue: rdma_cm cma_work_handler [rdma_cm]
      Dec  3 17:16:53 soak-6 kernel: 00000000000080d0 000000007cdee040 ffff8803fe0f3810 ffffffff816a3e2d
      Dec  3 17:16:53 soak-6 kernel: ffff8803fe0f38a0 ffffffff81188820 0000000000000000 ffff88043ffdb000
      Dec  3 17:16:53 soak-6 kernel: 0000000000000008 00000000000080d0 ffff8803fe0f38a0 000000007cdee040
      Dec  3 17:16:53 soak-6 kernel: Call Trace:
      Dec  3 17:16:53 soak-6 kernel: [<ffffffff816a3e2d>] dump_stack+0x19/0x1b
      Dec  3 17:16:53 soak-6 kernel: [<ffffffff81188820>] warn_alloc_failed+0x110/0x180
      Dec  3 17:16:53 soak-6 kernel: [<ffffffff8169fe2a>] __alloc_pages_slowpath+0x6b6/0x724
      Dec  3 17:16:53 soak-6 kernel: [<ffffffff8118cdb5>] __alloc_pages_nodemask+0x405/0x420
      Dec  3 17:16:53 soak-6 kernel: [<ffffffff81030f8f>] dma_generic_alloc_coherent+0x8f/0x140
      Dec  3 17:16:53 soak-6 kernel: [<ffffffff81064341>] x86_swiotlb_alloc_coherent+0x21/0x50
      Dec  3 17:16:53 soak-6 kernel: [<ffffffffc00c94c4>] mlx4_buf_direct_alloc.isra.7+0xc4/0x180 [mlx4_core]
      Dec  3 17:16:53 soak-6 kernel: [<ffffffffc00c973b>] mlx4_buf_alloc+0x1bb/0x260 [mlx4_core]
      Dec  3 17:16:53 soak-6 kernel: [<ffffffffc033f4a6>] create_qp_common+0x536/0x1000 [mlx4_ib]
      Dec  3 17:16:53 soak-6 kernel: [<ffffffffc03403d1>] mlx4_ib_create_qp+0x3b1/0xdc0 [mlx4_ib]
      Dec  3 17:16:53 soak-6 kernel: [<ffffffffc032bbc2>] ? mlx4_ib_create_cq+0x2d2/0x430 [mlx4_ib]
      Dec  3 17:16:53 soak-6 kernel: [<ffffffffc034bf30>] mlx4_ib_create_qp_wrp+0x10/0x20 [mlx4_ib]
      Dec  3 17:16:53 soak-6 kernel: [<ffffffffc02ed52a>] ib_create_qp+0x7a/0x2f0 [ib_core]
      Dec  3 17:16:53 soak-6 kernel: [<ffffffffc05f85d4>] rdma_create_qp+0x34/0xb0 [rdma_cm]
      Dec  3 17:16:53 soak-6 kernel: [<ffffffffc07ae5c9>] kiblnd_create_conn+0xbf9/0x1960 [ko2iblnd]
      Dec  3 17:16:53 soak-6 kernel: [<ffffffffc07be7af>] kiblnd_cm_callback+0x145f/0x2380 [ko2iblnd]
      Dec  3 17:16:53 soak-6 kernel: [<ffffffffc05fc11c>] cma_work_handler+0x6c/0xa0 [rdma_cm]
      Dec  3 17:16:53 soak-6 kernel: [<ffffffff810a882a>] process_one_work+0x17a/0x440
      Dec  3 17:16:54 soak-6 kernel: [<ffffffff810a94f6>] worker_thread+0x126/0x3c0
      Dec  3 17:16:54 soak-6 kernel: [<ffffffff810a93d0>] ? manage_workers.isra.24+0x2a0/0x2a0
      Dec  3 17:16:54 soak-6 kernel: [<ffffffff810b099f>] kthread+0xcf/0xe0
      Dec  3 17:16:54 soak-6 kernel: [<ffffffff8108ddfb>] ? do_exit+0x6bb/0xa40
      Dec  3 17:16:54 soak-6 kernel: [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40
      Dec  3 17:16:54 soak-6 kernel: [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
      Dec  3 17:16:54 soak-6 kernel: [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40
      Dec  3 17:16:54 soak-6 kernel: Mem-Info:
      Dec  3 17:16:54 soak-6 kernel: active_anon:8063 inactive_anon:8739 isolated_anon:0#012 active_file:3019 inactive_file:3850 isolated_file:0#012 unevictable:6774 dirty:6 writeback:0 unstable:0#012 slab_reclaimable:11316 slab_unreclaimable:556713#012 mapped:4011 shmem:2145 pagetables:1609 bounce:0#012 free:73696 free_pcp:74 free_cma:0
      Dec  3 17:16:54 soak-6 kernel: Node 0 DMA free:15580kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15920kB managed:15836kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      Dec  3 17:16:54 soak-6 kernel: lowmem_reserve[]: 0 2580 15620 15620
      Dec  3 17:16:54 soak-6 kernel: Node 0 DMA32 free:60408kB min:7320kB low:9148kB high:10980kB active_anon:324kB inactive_anon:324kB active_file:40kB inactive_file:36kB unevictable:332kB isolated(anon):0kB isolated(file):0kB present:3051628kB managed:2643828kB mlocked:332kB dirty:4kB writeback:0kB mapped:540kB shmem:200kB slab_reclaimable:2868kB slab_unreclaimable:193528kB kernel_stack:1584kB pagetables:300kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      Dec  3 17:16:54 soak-6 kernel: lowmem_reserve[]: 0 0 13040 13040
      Dec  3 17:16:54 soak-6 kernel: Node 0 Normal free:155400kB min:37012kB low:46264kB high:55516kB active_anon:13344kB inactive_anon:15952kB active_file:11660kB inactive_file:15000kB unevictable:1684kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13353036kB mlocked:1684kB dirty:20kB writeback:0kB mapped:9972kB shmem:6996kB slab_reclaimable:18636kB slab_unreclaimable:1076332kB kernel_stack:14128kB pagetables:2328kB unstable:0kB bounce:0kB free_pcp:440kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      Dec  3 17:16:54 soak-6 kernel: lowmem_reserve[]: 0 0 0 0
      Dec  3 17:16:54 soak-6 kernel: Node 1 Normal free:68272kB min:45728kB low:57160kB high:68592kB active_anon:12848kB inactive_anon:12652kB active_file:376kB inactive_file:364kB unevictable:25080kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16498508kB mlocked:25080kB dirty:0kB writeback:0kB mapped:5532kB shmem:1384kB slab_reclaimable:23760kB slab_unreclaimable:956992kB kernel_stack:16720kB pagetables:3808kB unstable:0kB bounce:0kB free_pcp:988kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:23 all_unreclaimable? no
      Dec  3 17:16:54 soak-6 kernel: lowmem_reserve[]: 0 0 0 0
      Dec  3 17:16:54 soak-6 kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 1*64kB (U) 1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15580kB
      Dec  3 17:16:55 soak-6 kernel: Node 0 DMA32: 1848*4kB (UEM) 1589*8kB (UEM) 943*16kB (UEM) 292*32kB (UEM) 84*64kB (UEM) 14*128kB (UM) 14*256kB (UM) 2*512kB (M) 2*1024kB (UM) 1*2048kB (M) 0*4096kB = 60408kB
      Dec  3 17:16:55 soak-6 kernel: Node 0 Normal: 4858*4kB (UEM) 5599*8kB (UEM) 2971*16kB (UEM) 967*32kB (UEM) 176*64kB (UEM) 2*128kB (E) 1*256kB (E) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 154480kB
      Dec  3 17:16:55 soak-6 kernel: Node 1 Normal: 2109*4kB (UEM) 1421*8kB (UEM) 631*16kB (UEM) 352*32kB (UEM) 184*64kB (UEM) 40*128kB (UEM) 20*256kB (UEM) 8*512kB (M) 0*1024kB 0*2048kB 0*4096kB = 67276kB
      Dec  3 17:16:55 soak-6 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      Dec  3 17:16:55 soak-6 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      Dec  3 17:16:55 soak-6 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      Dec  3 17:16:55 soak-6 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      Dec  3 17:16:55 soak-6 kernel: 11507 total pagecache pages
      Dec  3 17:16:55 soak-6 kernel: 738 pages in swap cache
      Dec  3 17:16:55 soak-6 kernel: Swap cache stats: add 7503, delete 6766, find 268/355
      Dec  3 17:16:55 soak-6 kernel: Free swap  = 16292804kB
      Dec  3 17:16:55 soak-6 kernel: Total swap = 16319484kB
      Dec  3 17:16:55 soak-6 kernel: 8369063 pages RAM
      Dec  3 17:16:55 soak-6 kernel: 0 pages HighMem/MovableOnly
      Dec  3 17:16:55 soak-6 kernel: 241261 pages reserved
      

      These failures continue, and lead to a crash/dump
      oom-killer is invoked.

      [165704.893998] ll_ost_out01_00 invoked oom-killer: gfp_mask=0x82d2, order=0, oom_score_adj=0^M
      [165704.894000] ll_ost_out01_00 cpuset=/ mems_allowed=1^M
      [165704.894003] CPU: 24 PID: 183717 Comm: ll_ost_out01_00 Tainted: P           OE  ------------   3.10.0-693.5.2.el7_lustre.x86_64 #1^M
      [165704.894004] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013^M
      [165704.894007]  ffff8803dd56eeb0 0000000085bdad45 ffff8803966af998 ffffffff816a3e2d^M
      [165704.894008]  ffff8803966afa28 ffffffff8169f246 ffffffff810e93ac ffff8807f1e228a0^M
      [165704.894009]  ffff8807f1e228b8 0000000000000202 0000000000000202 ffff8803966afa18^M
      [165704.894010] Call Trace:^M
      

      console, syslog, and vmcore-dmesg attached. Core dump available on Spirit.

      Attachments

        1. soak-6.console.log.gz
          850 kB
        2. soak-6.syslog.txt.gz
          82 kB
        3. vmcore-dmesg.txt
          1019 kB

        Issue Links

          Activity

            People

              wc-triage WC Triage
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: