Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5483

recovery-mds-scale test failover_mds: oom failure on client

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Blocker
    • None
    • Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0, Lustre 2.5.4
    • Lustre build: https://build.hpdd.intel.com/job/lustre-b2_5/77/
      Distro/Arch: RHEL6.5/x86_64
      Test group: failover
    • 3
    • 15301

    Description

      While testing test script patch http://review.whamcloud.com/11425 on Lustre b2_5 branch, recovery-mds-scale test failover_mds hit oom failure on one of the clients:

      15:04:21:Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting...
      15:04:22:Lustre: 2207:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1407966652/real 1407966652]  req@ffff88005ddeb400 x1476359923718516/t0(0) o250->MGC10.2.4.104@tcp@10.2.4.104@tcp:26/25 lens 400/544 e 0 to 1 dl 1407966663 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      15:04:22:Lustre: 2207:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      15:04:22:Lustre: Evicted from MGS (at 10.2.4.108@tcp) after server handle changed from 0x3d7499e48b9b2ab6 to 0x46ffbf3b8b30002
      15:04:22:Lustre: MGC10.2.4.104@tcp: Connection restored to MGS (at 10.2.4.108@tcp)
      15:04:22:LustreError: 2207:0:(client.c:2795:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88007a684400 x1476359922647520/t4294967302(4294967302) o101->lustre-MDT0000-mdc-ffff88007a58c000@10.2.4.108@tcp:12/10 lens 704/544 e 0 to 0 dl 1407966698 ref 2 fl Interpret:RP/4/0 rc 301/301
      15:04:22:dd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
      15:04:22:dd cpuset=/ mems_allowed=0
      15:04:22:Pid: 3997, comm: dd Not tainted 2.6.32-431.17.1.el6.x86_64 #1
      15:04:22:Call Trace:
      15:04:23: [<ffffffff810d0211>] ? cpuset_print_task_mems_allowed+0x91/0xb0
      15:04:23: [<ffffffff811225c0>] ? dump_header+0x90/0x1b0
      15:04:23: [<ffffffff8122761c>] ? security_real_capable_noaudit+0x3c/0x70
      15:04:23: [<ffffffff81122a42>] ? oom_kill_process+0x82/0x2a0
      15:04:23: [<ffffffff81122981>] ? select_bad_process+0xe1/0x120
      15:04:23: [<ffffffff81122e80>] ? out_of_memory+0x220/0x3c0
      15:04:23: [<ffffffff8112f79f>] ? __alloc_pages_nodemask+0x89f/0x8d0
      15:04:23: [<ffffffff8116769a>] ? alloc_pages_current+0xaa/0x110
      15:04:24: [<ffffffff8111f9b7>] ? __page_cache_alloc+0x87/0x90
      15:04:24: [<ffffffff811206ce>] ? grab_cache_page_write_begin+0x8e/0xc0
      15:04:24: [<ffffffffa0a05f58>] ? ll_write_begin+0x58/0x1a0 [lustre]
      15:04:24: [<ffffffff8111ff33>] ? generic_file_buffered_write+0x123/0x2e0
      15:04:24: [<ffffffff81078f37>] ? current_fs_time+0x27/0x30
      15:04:24: [<ffffffff81121990>] ? __generic_file_aio_write+0x260/0x490
      15:04:24: [<ffffffffa056793c>] ? cl_lock_trace0+0x11c/0x130 [obdclass]
      15:04:24: [<ffffffffa056793c>] ? cl_lock_trace0+0x11c/0x130 [obdclass]
      15:04:24: [<ffffffff81121c48>] ? generic_file_aio_write+0x88/0x100
      15:04:24: [<ffffffffa0a1acc7>] ? vvp_io_write_start+0x137/0x2a0 [lustre]
      15:04:25: [<ffffffffa056de3a>] ? cl_io_start+0x6a/0x140 [obdclass]
      15:04:25: [<ffffffffa0572544>] ? cl_io_loop+0xb4/0x1b0 [obdclass]
      15:04:25: [<ffffffffa09bd4c0>] ? ll_file_io_generic+0x460/0x610 [lustre]
      15:04:25: [<ffffffffa09be2c2>] ? ll_file_aio_write+0x142/0x2c0 [lustre]
      15:04:25: [<ffffffffa09be5ac>] ? ll_file_write+0x16c/0x2a0 [lustre]
      15:04:25: [<ffffffff81188c38>] ? vfs_write+0xb8/0x1a0
      15:04:25: [<ffffffff81189531>] ? sys_write+0x51/0x90
      15:04:25: [<ffffffff810e1abe>] ? __audit_syscall_exit+0x25e/0x290
      15:04:25: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      15:04:25:Mem-Info:
      15:04:26:Node 0 DMA per-cpu:
      15:04:26:CPU    0: hi:    0, btch:   1 usd:   0
      15:04:26:CPU    1: hi:    0, btch:   1 usd:   0
      15:04:26:Node 0 DMA32 per-cpu:
      15:04:26:CPU    0: hi:  186, btch:  31 usd: 133
      15:04:26:CPU    1: hi:  186, btch:  31 usd:  63
      15:04:26:active_anon:1286 inactive_anon:1284 isolated_anon:0
      15:04:26: active_file:171721 inactive_file:173031 isolated_file:32
      15:04:26: unevictable:0 dirty:0 writeback:38535 unstable:0
      15:04:26: free:15144 slab_reclaimable:4187 slab_unreclaimable:99907
      15:04:27: mapped:4 shmem:1 pagetables:1115 bounce:0
      15:04:27:Node 0 DMA free:8352kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:5248kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:5328kB mapped:0kB shmem:0kB slab_reclaimable:32kB slab_unreclaimable:2032kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:7904 all_unreclaimable? yes
      15:04:27:lowmem_reserve[]: 0 2004 2004 2004
      15:04:27:Node 0 DMA32 free:52224kB min:44720kB low:55900kB high:67080kB active_anon:5144kB inactive_anon:5136kB active_file:686828kB inactive_file:687060kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:148812kB mapped:16kB shmem:4kB slab_reclaimable:16716kB slab_unreclaimable:397596kB kernel_stack:1408kB pagetables:4460kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:612151 all_unreclaimable? no
      15:04:27:lowmem_reserve[]: 0 0 0 0
      15:04:27:Node 0 DMA: 11*4kB 1*8kB 3*16kB 7*32kB 6*64kB 4*128kB 2*256kB 3*512kB 1*1024kB 2*2048kB 0*4096kB = 8388kB
      15:04:27:Node 0 DMA32: 806*4kB 275*8kB 215*16kB 85*32kB 23*64kB 4*128kB 3*256kB 4*512kB 5*1024kB 1*2048kB 7*4096kB = 52224kB
      15:04:27:346103 total pagecache pages
      15:04:27:1306 pages in swap cache
      15:04:27:Swap cache stats: add 4759, delete 3453, find 0/0
      15:04:27:Free swap  = 2706844kB
      15:04:27:Total swap = 2725880kB
      15:04:28:524284 pages RAM
      15:04:28:43693 pages reserved
      15:04:28:668392 pages shared
      15:04:28:115112 pages non-shared
      

      Maloo report: https://testing.hpdd.intel.com/test_sets/1d84cc0e-2339-11e4-b8ac-5254006e85c2

      Attachments

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: