Details
-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
Lustre 2.7.0, Lustre 2.5.3, Lustre 2.8.0, Lustre 2.5.4
-
Lustre build: https://build.hpdd.intel.com/job/lustre-b2_5/77/
Distro/Arch: RHEL6.5/x86_64
Test group: failover
-
3
-
15301
Description
While testing test script patch http://review.whamcloud.com/11425 on Lustre b2_5 branch, recovery-mds-scale test failover_mds hit oom failure on one of the clients:
15:04:21:Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting... 15:04:22:Lustre: 2207:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1407966652/real 1407966652] req@ffff88005ddeb400 x1476359923718516/t0(0) o250->MGC10.2.4.104@tcp@10.2.4.104@tcp:26/25 lens 400/544 e 0 to 1 dl 1407966663 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 15:04:22:Lustre: 2207:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1 previous similar message 15:04:22:Lustre: Evicted from MGS (at 10.2.4.108@tcp) after server handle changed from 0x3d7499e48b9b2ab6 to 0x46ffbf3b8b30002 15:04:22:Lustre: MGC10.2.4.104@tcp: Connection restored to MGS (at 10.2.4.108@tcp) 15:04:22:LustreError: 2207:0:(client.c:2795:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff88007a684400 x1476359922647520/t4294967302(4294967302) o101->lustre-MDT0000-mdc-ffff88007a58c000@10.2.4.108@tcp:12/10 lens 704/544 e 0 to 0 dl 1407966698 ref 2 fl Interpret:RP/4/0 rc 301/301 15:04:22:dd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0 15:04:22:dd cpuset=/ mems_allowed=0 15:04:22:Pid: 3997, comm: dd Not tainted 2.6.32-431.17.1.el6.x86_64 #1 15:04:22:Call Trace: 15:04:23: [<ffffffff810d0211>] ? cpuset_print_task_mems_allowed+0x91/0xb0 15:04:23: [<ffffffff811225c0>] ? dump_header+0x90/0x1b0 15:04:23: [<ffffffff8122761c>] ? security_real_capable_noaudit+0x3c/0x70 15:04:23: [<ffffffff81122a42>] ? oom_kill_process+0x82/0x2a0 15:04:23: [<ffffffff81122981>] ? select_bad_process+0xe1/0x120 15:04:23: [<ffffffff81122e80>] ? out_of_memory+0x220/0x3c0 15:04:23: [<ffffffff8112f79f>] ? __alloc_pages_nodemask+0x89f/0x8d0 15:04:23: [<ffffffff8116769a>] ? alloc_pages_current+0xaa/0x110 15:04:24: [<ffffffff8111f9b7>] ? __page_cache_alloc+0x87/0x90 15:04:24: [<ffffffff811206ce>] ? grab_cache_page_write_begin+0x8e/0xc0 15:04:24: [<ffffffffa0a05f58>] ? ll_write_begin+0x58/0x1a0 [lustre] 15:04:24: [<ffffffff8111ff33>] ? generic_file_buffered_write+0x123/0x2e0 15:04:24: [<ffffffff81078f37>] ? current_fs_time+0x27/0x30 15:04:24: [<ffffffff81121990>] ? __generic_file_aio_write+0x260/0x490 15:04:24: [<ffffffffa056793c>] ? cl_lock_trace0+0x11c/0x130 [obdclass] 15:04:24: [<ffffffffa056793c>] ? cl_lock_trace0+0x11c/0x130 [obdclass] 15:04:24: [<ffffffff81121c48>] ? generic_file_aio_write+0x88/0x100 15:04:24: [<ffffffffa0a1acc7>] ? vvp_io_write_start+0x137/0x2a0 [lustre] 15:04:25: [<ffffffffa056de3a>] ? cl_io_start+0x6a/0x140 [obdclass] 15:04:25: [<ffffffffa0572544>] ? cl_io_loop+0xb4/0x1b0 [obdclass] 15:04:25: [<ffffffffa09bd4c0>] ? ll_file_io_generic+0x460/0x610 [lustre] 15:04:25: [<ffffffffa09be2c2>] ? ll_file_aio_write+0x142/0x2c0 [lustre] 15:04:25: [<ffffffffa09be5ac>] ? ll_file_write+0x16c/0x2a0 [lustre] 15:04:25: [<ffffffff81188c38>] ? vfs_write+0xb8/0x1a0 15:04:25: [<ffffffff81189531>] ? sys_write+0x51/0x90 15:04:25: [<ffffffff810e1abe>] ? __audit_syscall_exit+0x25e/0x290 15:04:25: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b 15:04:25:Mem-Info: 15:04:26:Node 0 DMA per-cpu: 15:04:26:CPU 0: hi: 0, btch: 1 usd: 0 15:04:26:CPU 1: hi: 0, btch: 1 usd: 0 15:04:26:Node 0 DMA32 per-cpu: 15:04:26:CPU 0: hi: 186, btch: 31 usd: 133 15:04:26:CPU 1: hi: 186, btch: 31 usd: 63 15:04:26:active_anon:1286 inactive_anon:1284 isolated_anon:0 15:04:26: active_file:171721 inactive_file:173031 isolated_file:32 15:04:26: unevictable:0 dirty:0 writeback:38535 unstable:0 15:04:26: free:15144 slab_reclaimable:4187 slab_unreclaimable:99907 15:04:27: mapped:4 shmem:1 pagetables:1115 bounce:0 15:04:27:Node 0 DMA free:8352kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:5248kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:5328kB mapped:0kB shmem:0kB slab_reclaimable:32kB slab_unreclaimable:2032kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:7904 all_unreclaimable? yes 15:04:27:lowmem_reserve[]: 0 2004 2004 2004 15:04:27:Node 0 DMA32 free:52224kB min:44720kB low:55900kB high:67080kB active_anon:5144kB inactive_anon:5136kB active_file:686828kB inactive_file:687060kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:148812kB mapped:16kB shmem:4kB slab_reclaimable:16716kB slab_unreclaimable:397596kB kernel_stack:1408kB pagetables:4460kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:612151 all_unreclaimable? no 15:04:27:lowmem_reserve[]: 0 0 0 0 15:04:27:Node 0 DMA: 11*4kB 1*8kB 3*16kB 7*32kB 6*64kB 4*128kB 2*256kB 3*512kB 1*1024kB 2*2048kB 0*4096kB = 8388kB 15:04:27:Node 0 DMA32: 806*4kB 275*8kB 215*16kB 85*32kB 23*64kB 4*128kB 3*256kB 4*512kB 5*1024kB 1*2048kB 7*4096kB = 52224kB 15:04:27:346103 total pagecache pages 15:04:27:1306 pages in swap cache 15:04:27:Swap cache stats: add 4759, delete 3453, find 0/0 15:04:27:Free swap = 2706844kB 15:04:27:Total swap = 2725880kB 15:04:28:524284 pages RAM 15:04:28:43693 pages reserved 15:04:28:668392 pages shared 15:04:28:115112 pages non-shared
Maloo report: https://testing.hpdd.intel.com/test_sets/1d84cc0e-2339-11e4-b8ac-5254006e85c2
Attachments
Issue Links
- is duplicated by
-
LU-5574 Hard Failover failure on test suite recovery-mds-scale test_failover_mds: client OOM
- Closed
-
LU-5944 Failover recovery-mds-scale test_failover_mds: client OOM
- Closed
- is related to
-
LU-6200 Failover recovery-mds-scale test_failover_ost: test_failover_ost returned 1
- Resolved
- is related to
-
LU-2139 Tracking unstable pages
- Resolved
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...