[LU-7329] sanity test_60a timeouts with “* invoking oom-killer” Created: 22/Oct/15 Updated: 24/Oct/17 Resolved: 30/Nov/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
autotest |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
We’ve seen four recent hangs on the MDS during sanity test 60a due to OOM. For https://testing.hpdd.intel.com/test_sets/383ea288-787c-11e5-b04c-5254006e85c2, we see the following from the console on the MDS: 23:58:40:Lustre: 1823:0:(llog_test.c:962:llog_test_7_sub()) 7_sub: records are not aligned, written 64452 from 64767 23:58:40:Lustre: 1823:0:(llog_test.c:1079:llog_test_7()) 7g: test llog_gen_rec 23:58:40:hald invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0 23:58:40:hald cpuset=/ mems_allowed=0 23:58:40:Pid: 1967, comm: hald Tainted: P -- ------------ 2.6.32-573.7.1.el6_lustre.gd359461.x86_64 #1 23:58:40:Call Trace: 23:58:40: [<ffffffff810d71a1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 23:58:40: [<ffffffff8112a9a0>] ? dump_header+0x90/0x1b0 23:58:40: [<ffffffff8112ab0e>] ? check_panic_on_oom+0x4e/0x80 23:58:40: [<ffffffff8112b1fb>] ? out_of_memory+0x1bb/0x3c0 23:58:40: [<ffffffff81137c3c>] ? __alloc_pages_nodemask+0x93c/0x950 23:58:40: [<ffffffff815395e2>] ? io_schedule+0x92/0xc0 23:58:40: [<ffffffff811279c0>] ? sync_page_killable+0x0/0x40 23:58:40: [<ffffffff8117074a>] ? alloc_pages_vma+0x9a/0x150 23:58:40: [<ffffffff81163c72>] ? read_swap_cache_async+0xf2/0x160 23:58:40: [<ffffffff811647c9>] ? valid_swaphandles+0x69/0x160 23:58:40: [<ffffffff81163d67>] ? swapin_readahead+0x87/0xc0 23:58:40: [<ffffffff81152ead>] ? handle_pte_fault+0x6dd/0xb20 23:58:40: [<ffffffff81153589>] ? handle_mm_fault+0x299/0x3d0 23:58:40: [<ffffffff81155fa6>] ? find_vma+0x46/0x80 23:58:40: [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500 23:58:40: [<ffffffff81043f28>] ? pvclock_clocksource_read+0x58/0xd0 23:58:40: [<ffffffff81042fbc>] ? kvm_clock_read+0x1c/0x20 23:58:40: [<ffffffff81042fc9>] ? kvm_clock_get_cycles+0x9/0x10 23:58:40: [<ffffffff8153f3de>] ? do_page_fault+0x3e/0xa0 23:58:40: [<ffffffff8153c785>] ? page_fault+0x25/0x30 23:58:40:Mem-Info: 23:58:40:Node 0 DMA per-cpu: 23:58:40:CPU 0: hi: 0, btch: 1 usd: 0 23:58:40:CPU 1: hi: 0, btch: 1 usd: 0 23:58:40:Node 0 DMA32 per-cpu: 23:58:40:CPU 0: hi: 186, btch: 31 usd: 30 23:58:40:CPU 1: hi: 186, btch: 31 usd: 4 23:58:40:active_anon:97 inactive_anon:877 isolated_anon:0 23:58:40: active_file:3 inactive_file:55 isolated_file:0 23:58:40: unevictable:0 dirty:0 writeback:876 unstable:0 23:58:40: free:13242 slab_reclaimable:1788 slab_unreclaimable:442860 23:58:40: mapped:1 shmem:1 pagetables:681 bounce:0 23:58:40:Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7388kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes 23:58:40:lowmem_reserve[]: 0 2004 2004 2004 23:58:40:Node 0 DMA32 free:44612kB min:44720kB low:55900kB high:67080kB active_anon:388kB inactive_anon:3508kB active_file:12kB inactive_file:220kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:3504kB mapped:4kB shmem:4kB slab_reclaimable:7152kB slab_unreclaimable:1764052kB kernel_stack:4112kB pagetables:2724kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:137468 all_unreclaimable? yes 23:58:40:lowmem_reserve[]: 0 0 0 0 23:58:40:Node 0 DMA: 5*4kB 0*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB 23:58:40:Node 0 DMA32: 527*4kB 271*8kB 147*16kB 109*32kB 69*64kB 47*128kB 20*256kB 11*512kB 7*1024kB 1*2048kB 1*4096kB = 44612kB 23:58:40:957 total pagecache pages 23:58:40:898 pages in swap cache 23:58:40:Swap cache stats: add 5911, delete 5013, find 54/85 23:58:40:Free swap = 4106092kB 23:58:40:Total swap = 4128764kB 23:58:40:524284 pages RAM 23:58:40:43736 pages reserved 23:58:40:88 pages shared 23:58:40:438370 pages non-shared 23:58:40:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 23:58:40:[ 520] 0 520 2709 0 0 -17 -1000 udevd 23:58:40:[ 1706] 0 1706 6902 1 0 -17 -1000 auditd 23:58:40:[ 1740] 0 1740 62272 13 0 0 0 rsyslogd 23:58:40:[ 1774] 0 1774 4561 18 0 0 0 irqbalance 23:58:40:[ 1792] 32 1792 4744 1 0 0 0 rpcbind 23:58:40:[ 1814] 29 1814 5837 1 1 0 0 rpc.statd 23:58:40:[ 1848] 81 1848 6418 1 1 0 0 dbus-daemon 23:58:40:[ 1868] 0 1868 53919 1 1 0 0 ypbind 23:58:40:[ 1898] 0 1898 47233 1 1 0 0 cupsd 23:58:40:[ 1955] 0 1955 1020 0 1 0 0 acpid 23:58:40:[ 1967] 68 1967 10517 37 1 0 0 hald 23:58:40:[ 1968] 0 1968 5099 1 1 0 0 hald-runner 23:58:40:[ 2000] 0 2000 5629 1 1 0 0 hald-addon-inpu 23:58:40:[ 2010] 68 2010 4501 1 1 0 0 hald-addon-acpi 23:58:40:[ 2029] 0 2029 7646 1 1 0 0 zed 23:58:40:[ 2067] 0 2067 26827 0 0 0 0 rpc.rquotad 23:58:40:[ 2072] 0 2072 5417 0 0 0 0 rpc.mountd 23:58:40:[ 2116] 0 2116 6292 1 0 0 0 rpc.idmapd 23:58:40:[ 2163] 498 2163 57325 1 0 0 0 munged 23:58:40:[ 2274] 0 2274 16555 0 0 -17 -1000 sshd 23:58:40:[ 2285] 0 2285 5429 1 1 0 0 xinetd 23:58:40:[ 2317] 0 2317 22211 33 1 0 0 sendmail 23:58:40:[ 2326] 51 2326 20074 0 0 0 0 sendmail 23:58:40:[ 2354] 0 2354 29216 1 0 0 0 crond 23:58:40:[ 2369] 0 2369 5276 0 0 0 0 atd 23:58:40:[ 2382] 0 2382 1020 1 1 0 0 agetty 23:58:40:[ 2384] 0 2384 1016 1 0 0 0 mingetty 23:58:40:[ 2386] 0 2386 1016 1 1 0 0 mingetty 23:58:40:[ 2388] 0 2388 1016 1 1 0 0 mingetty 23:58:40:[ 2390] 0 2390 1016 1 1 0 0 mingetty 23:58:40:[ 2392] 0 2392 1016 1 1 0 0 mingetty 23:58:40:[ 2394] 0 2394 1016 1 1 0 0 mingetty 23:58:40:[ 2396] 0 2396 2664 0 0 -17 -1000 udevd 23:58:40:[ 2397] 0 2397 2662 0 0 -17 -1000 udevd 23:58:40:[ 2963] 38 2963 7689 1 1 0 0 ntpd 23:58:40:[ 1713] 0 1713 14747 1 1 0 0 in.mrshd 23:58:40:[ 1714] 0 1714 26515 1 1 0 0 bash 23:58:40:[ 1744] 0 1744 26515 0 0 0 0 bash 23:58:40:[ 1745] 0 1745 27594 1 1 0 0 sh 23:58:40:[ 1823] 0 1823 28024 1 0 0 0 lctl 23:58:40:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled 23:58:40: 23:58:40:Pid: 1967, comm: hald Tainted: P -- ------------ 2.6.32-573.7.1.el6_lustre.gd359461.x86_64 #1 23:58:40:Call Trace: 23:58:40: [<ffffffff815386d4>] ? panic+0xa7/0x16f 23:58:40: [<ffffffff8112aaa1>] ? dump_header+0x191/0x1b0 23:58:40: [<ffffffff8112ab3c>] ? check_panic_on_oom+0x7c/0x80 23:58:40: [<ffffffff8112b1fb>] ? out_of_memory+0x1bb/0x3c0 23:58:40: [<ffffffff81137c3c>] ? __alloc_pages_nodemask+0x93c/0x950 23:58:40: [<ffffffff815395e2>] ? io_schedule+0x92/0xc0 23:58:40: [<ffffffff811279c0>] ? sync_page_killable+0x0/0x40 23:58:40: [<ffffffff8117074a>] ? alloc_pages_vma+0x9a/0x150 23:58:40: [<ffffffff81163c72>] ? read_swap_cache_async+0xf2/0x160 23:58:40: [<ffffffff811647c9>] ? valid_swaphandles+0x69/0x160 23:58:40: [<ffffffff81163d67>] ? swapin_readahead+0x87/0xc0 23:58:40: [<ffffffff81152ead>] ? handle_pte_fault+0x6dd/0xb20 23:58:40: [<ffffffff81153589>] ? handle_mm_fault+0x299/0x3d0 23:58:40: [<ffffffff81155fa6>] ? find_vma+0x46/0x80 23:58:40: [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500 23:58:40: [<ffffffff81043f28>] ? pvclock_clocksource_read+0x58/0xd0 23:58:40: [<ffffffff81042fbc>] ? kvm_clock_read+0x1c/0x20 23:58:40: [<ffffffff81042fc9>] ? kvm_clock_get_cycles+0x9/0x10 23:58:40: [<ffffffff8153f3de>] ? do_page_fault+0x3e/0xa0 23:58:40: [<ffffffff8153c785>] ? page_fault+0x25/0x30 For https://testing.hpdd.intel.com/test_sets/0ad70b9c-7858-11e5-9a01-5254006e85c2, we see the following from the console on the MDS: 17:40:08:Lustre: 20451:0:(llog_test.c:1692:llog_test_10()) 10g: Cancel 65536 records, see one log zapped 17:40:08:irqbalance invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 17:40:08:irqbalance cpuset=/ mems_allowed=0 17:40:08:Pid: 1349, comm: irqbalance Not tainted 2.6.32-573.7.1.el6_lustre.gd359461.x86_64 #1 17:40:08:Call Trace: 17:40:08: [<ffffffff810d71a1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 17:40:08: [<ffffffff8112a9a0>] ? dump_header+0x90/0x1b0 17:40:08: [<ffffffff8112ab0e>] ? check_panic_on_oom+0x4e/0x80 17:40:08: [<ffffffff8112b1fb>] ? out_of_memory+0x1bb/0x3c0 17:40:08: [<ffffffff81137c3c>] ? __alloc_pages_nodemask+0x93c/0x950 17:40:08: [<ffffffff8117064a>] ? alloc_pages_current+0xaa/0x110 17:40:08: [<ffffffff81127d97>] ? __page_cache_alloc+0x87/0x90 17:40:08: [<ffffffff8112777e>] ? find_get_page+0x1e/0xa0 17:40:08: [<ffffffff81128d37>] ? filemap_fault+0x1a7/0x500 17:40:08: [<ffffffff811522f4>] ? __do_fault+0x54/0x530 17:40:08: [<ffffffff811528c7>] ? handle_pte_fault+0xf7/0xb20 17:40:08: [<ffffffff81066053>] ? perf_event_task_sched_out+0x33/0x70 17:40:08: [<ffffffff81043f28>] ? pvclock_clocksource_read+0x58/0xd0 17:40:08: [<ffffffff810098a5>] ? __switch_to+0x285/0x340 17:40:08: [<ffffffff81042fbc>] ? kvm_clock_read+0x1c/0x20 17:40:11: [<ffffffff81153589>] ? handle_mm_fault+0x299/0x3d0 17:40:11: [<ffffffff81538dce>] ? thread_return+0x4e/0x7d0 17:40:11: [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500 17:40:11: [<ffffffff810a677d>] ? hrtimer_try_to_cancel+0x3d/0xd0 17:40:12: [<ffffffff810a6832>] ? hrtimer_cancel+0x22/0x30 17:40:12: [<ffffffff8153b463>] ? do_nanosleep+0x93/0xc0 17:40:12: [<ffffffff810a6904>] ? hrtimer_nanosleep+0xc4/0x180 17:40:12: [<ffffffff810a5730>] ? hrtimer_wakeup+0x0/0x30 17:40:12: [<ffffffff8153f3de>] ? do_page_fault+0x3e/0xa0 17:40:12: [<ffffffff8153c785>] ? page_fault+0x25/0x30 17:40:12:Mem-Info: 17:40:12:Node 0 DMA per-cpu: 17:40:12:CPU 0: hi: 0, btch: 1 usd: 0 17:40:12:CPU 1: hi: 0, btch: 1 usd: 0 17:40:12:Node 0 DMA32 per-cpu: 17:40:12:CPU 0: hi: 186, btch: 31 usd: 13 17:40:12:CPU 1: hi: 186, btch: 31 usd: 55 17:40:12:active_anon:1008 inactive_anon:1027 isolated_anon:0 17:40:12: active_file:862 inactive_file:834 isolated_file:0 17:40:12: unevictable:0 dirty:0 writeback:0 unstable:0 17:40:12: free:13265 slab_reclaimable:2472 slab_unreclaimable:437874 17:40:12: mapped:1 shmem:1 pagetables:709 bounce:0 17:40:12:Node 0 DMA free:8336kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:128kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:7276kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:157 all_unreclaimable? yes 17:40:12:lowmem_reserve[]: 0 2004 2004 2004 17:40:12:Node 0 DMA32 free:44724kB min:44720kB low:55900kB high:67080kB active_anon:4032kB inactive_anon:4108kB active_file:3320kB inactive_file:3336kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:4kB slab_reclaimable:9884kB slab_unreclaimable:1744220kB kernel_stack:3808kB pagetables:2836kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:430063 all_unreclaimable? yes 17:40:12:lowmem_reserve[]: 0 0 0 0 17:40:12:Node 0 DMA: 2*4kB 1*8kB 2*16kB 1*32kB 1*64kB 2*128kB 3*256kB 2*512kB 2*1024kB 2*2048kB 0*4096kB = 8336kB 17:40:13:Node 0 DMA32: 1993*4kB 954*8kB 510*16kB 271*32kB 72*64kB 8*128kB 4*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 44724kB 17:40:13:2723 total pagecache pages 17:40:13:1026 pages in swap cache 17:40:13:Swap cache stats: add 3678, delete 2652, find 0/0 17:40:13:Free swap = 4114052kB 17:40:13:Total swap = 4128764kB 17:40:13:524284 pages RAM 17:40:13:43736 pages reserved 17:40:13:1750 pages shared 17:40:13:438911 pages non-shared 17:40:13:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 17:40:13:[ 509] 0 509 2663 0 0 -17 -1000 udevd 17:40:13:[ 1281] 0 1281 6899 1 0 -17 -1000 auditd 17:40:13:[ 1315] 0 1315 62273 54 1 0 0 rsyslogd 17:40:13:[ 1349] 0 1349 4561 1 0 0 0 irqbalance 17:40:13:[ 1367] 32 1367 4744 1 1 0 0 rpcbind 17:40:13:[ 1389] 29 1389 5837 1 1 0 0 rpc.statd 17:40:13:[ 1423] 81 1423 6418 1 1 0 0 dbus-daemon 17:40:13:[ 1443] 0 1443 53919 1 1 0 0 ypbind 17:40:13:[ 1477] 0 1477 47233 1 1 0 0 cupsd 17:40:13:[ 1534] 0 1534 1020 0 1 0 0 acpid 17:40:13:[ 1546] 68 1546 10521 1 1 0 0 hald 17:40:13:[ 1547] 0 1547 5099 1 1 0 0 hald-runner 17:40:13:[ 1579] 0 1579 5629 1 1 0 0 hald-addon-inpu 17:40:13:[ 1586] 68 1586 4501 1 1 0 0 hald-addon-acpi 17:40:13:[ 1635] 0 1635 26827 0 0 0 0 rpc.rquotad 17:40:13:[ 1640] 0 1640 5417 0 0 0 0 rpc.mountd 17:40:13:[ 1684] 0 1684 6292 1 0 0 0 rpc.idmapd 17:40:13:[ 1721] 498 1721 57325 1 0 0 0 munged 17:40:13:[ 1832] 0 1832 16555 0 0 -17 -1000 sshd 17:40:13:[ 1843] 0 1843 5429 16 1 0 0 xinetd 17:40:13:[ 1875] 0 1875 22211 6 0 0 0 sendmail 17:40:13:[ 1884] 51 1884 20074 0 0 0 0 sendmail 17:40:13:[ 1912] 0 1912 29215 1 0 0 0 crond 17:40:13:[ 1927] 0 1927 5276 0 0 0 0 atd 17:40:13:[ 1940] 0 1940 1020 1 1 0 0 agetty 17:40:13:[ 1942] 0 1942 1016 1 0 0 0 mingetty 17:40:13:[ 1944] 0 1944 1016 1 0 0 0 mingetty 17:40:13:[ 1946] 0 1946 1016 1 0 0 0 mingetty 17:40:13:[ 1948] 0 1948 1016 1 1 0 0 mingetty 17:40:13:[ 1950] 0 1950 2730 0 0 -17 -1000 udevd 17:40:13:[ 1951] 0 1951 2728 0 0 -17 -1000 udevd 17:40:13:[ 1952] 0 1952 1016 1 0 0 0 mingetty 17:40:13:[ 1954] 0 1954 1016 1 0 0 0 mingetty 17:40:13:[ 2706] 38 2706 7689 1 1 0 0 ntpd 17:40:13:[11506] 0 11506 4763 1 1 0 0 anacron 17:40:13:[20349] 0 20349 14749 161 1 0 0 in.mrshd 17:40:13:[20350] 0 20350 26524 58 0 0 0 bash 17:40:13:[20375] 0 20375 26524 58 1 0 0 bash 17:40:13:[20376] 0 20376 27603 617 1 0 0 sh 17:40:13:[20451] 0 20451 28026 64 0 0 0 lctl 17:40:13:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled 17:40:14: 17:40:14:Pid: 1349, comm: irqbalance Not tainted 2.6.32-573.7.1.el6_lustre.gd359461.x86_64 #1 17:40:14:Call Trace: 17:40:14: [<ffffffff815386d4>] ? panic+0xa7/0x16f 17:40:14: [<ffffffff8112aaa1>] ? dump_header+0x191/0x1b0 17:40:14: [<ffffffff8112ab3c>] ? check_panic_on_oom+0x7c/0x80 17:40:14: [<ffffffff8112b1fb>] ? out_of_memory+0x1bb/0x3c0 17:40:14: [<ffffffff81137c3c>] ? __alloc_pages_nodemask+0x93c/0x950 17:40:14: [<ffffffff8117064a>] ? alloc_pages_current+0xaa/0x110 17:40:14: [<ffffffff81127d97>] ? __page_cache_alloc+0x87/0x90 17:40:14: [<ffffffff8112777e>] ? find_get_page+0x1e/0xa0 17:40:14: [<ffffffff81128d37>] ? filemap_fault+0x1a7/0x500 17:40:14: [<ffffffff811522f4>] ? __do_fault+0x54/0x530 17:40:14: [<ffffffff811528c7>] ? handle_pte_fault+0xf7/0xb20 17:40:14: [<ffffffff81066053>] ? perf_event_task_sched_out+0x33/0x70 17:40:14: [<ffffffff81043f28>] ? pvclock_clocksource_read+0x58/0xd0 17:40:14: [<ffffffff810098a5>] ? __switch_to+0x285/0x340 17:40:14: [<ffffffff81042fbc>] ? kvm_clock_read+0x1c/0x20 17:40:14: [<ffffffff81153589>] ? handle_mm_fault+0x299/0x3d0 17:40:14: [<ffffffff81538dce>] ? thread_return+0x4e/0x7d0 17:40:14: [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500 17:40:14: [<ffffffff810a677d>] ? hrtimer_try_to_cancel+0x3d/0xd0 17:40:14: [<ffffffff810a6832>] ? hrtimer_cancel+0x22/0x30 17:40:14: [<ffffffff8153b463>] ? do_nanosleep+0x93/0xc0 17:40:14: [<ffffffff810a6904>] ? hrtimer_nanosleep+0xc4/0x180 17:40:14: [<ffffffff810a5730>] ? hrtimer_wakeup+0x0/0x30 17:40:14: [<ffffffff8153f3de>] ? do_page_fault+0x3e/0xa0 17:40:14: [<ffffffff8153c785>] ? page_fault+0x25/0x30 For https://testing.hpdd.intel.com/test_sets/aee14c08-769a-11e5-b71e-5254006e85c2, we see the following from the console on the MDS: 13:02:02:Lustre: 20522:0:(llog_test.c:1443:llog_test_10()) 10c: write 131072 more log records 13:02:02:sendmail invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 13:02:02:sendmail cpuset=/ mems_allowed=0 13:02:02:Pid: 1827, comm: sendmail Not tainted 2.6.32-573.7.1.el6_lustre.gd359461.x86_64 #1 13:02:02:Call Trace: 13:02:02: [<ffffffff810d71a1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 13:02:02: [<ffffffff8112a9a0>] ? dump_header+0x90/0x1b0 13:02:02: [<ffffffff8112ab0e>] ? check_panic_on_oom+0x4e/0x80 13:02:02: [<ffffffff8112b1fb>] ? out_of_memory+0x1bb/0x3c0 13:02:02: [<ffffffff81137c3c>] ? __alloc_pages_nodemask+0x93c/0x950 13:02:02: [<ffffffff8117064a>] ? alloc_pages_current+0xaa/0x110 13:02:02: [<ffffffff81127d97>] ? __page_cache_alloc+0x87/0x90 13:02:02: [<ffffffff8112777e>] ? find_get_page+0x1e/0xa0 13:02:02: [<ffffffff81128d37>] ? filemap_fault+0x1a7/0x500 13:02:02: [<ffffffff811522f4>] ? __do_fault+0x54/0x530 13:02:02: [<ffffffff811528c7>] ? handle_pte_fault+0xf7/0xb20 13:02:02: [<ffffffff811b4901>] ? save_mount_options+0x21/0x40 13:02:02: [<ffffffff811b7b22>] ? seq_vprintf+0x32/0x60 13:02:02: [<ffffffff81012bbe>] ? copy_user_generic+0xe/0x20 13:02:02: [<ffffffff81153589>] ? handle_mm_fault+0x299/0x3d0 13:02:02: [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500 13:02:02: [<ffffffff81042fc9>] ? kvm_clock_get_cycles+0x9/0x10 13:02:02: [<ffffffff810ad40f>] ? ktime_get_ts+0xbf/0x100 13:02:02: [<ffffffff811a9778>] ? poll_select_copy_remaining+0xf8/0x150 13:02:02: [<ffffffff8153f3de>] ? do_page_fault+0x3e/0xa0 13:02:02: [<ffffffff8153c785>] ? page_fault+0x25/0x30 13:02:02:Mem-Info: 13:02:02:Node 0 DMA per-cpu: 13:02:02:CPU 0: hi: 0, btch: 1 usd: 0 13:02:02:Node 0 DMA32 per-cpu: 13:02:02:CPU 0: hi: 186, btch: 31 usd: 115 13:02:02:active_anon:1685 inactive_anon:1752 isolated_anon:0 13:02:02: active_file:1244 inactive_file:1252 isolated_file:0 13:02:02: unevictable:0 dirty:0 writeback:0 unstable:0 13:02:02: free:13240 slab_reclaimable:2343 slab_unreclaimable:436182 13:02:02: mapped:1 shmem:1 pagetables:676 bounce:0 13:02:02:Node 0 DMA free:8352kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:7384kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:3 all_unreclaimable? yes 13:02:02:lowmem_reserve[]: 0 2004 2004 2004 13:02:02:Node 0 DMA32 free:44608kB min:44720kB low:55900kB high:67080kB active_anon:6740kB inactive_anon:7008kB active_file:4976kB inactive_file:5004kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:4kB slab_reclaimable:9368kB slab_unreclaimable:1737344kB kernel_stack:3200kB pagetables:2704kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:374310 all_unreclaimable? yes 13:02:02:lowmem_reserve[]: 0 0 0 0 13:02:02:Node 0 DMA: 2*4kB 3*8kB 2*16kB 3*32kB 2*64kB 3*128kB 2*256kB 2*512kB 2*1024kB 2*2048kB 0*4096kB = 8352kB 13:02:02:Node 0 DMA32: 2062*4kB 1079*8kB 497*16kB 232*32kB 79*64kB 11*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 44608kB 13:02:02:4249 total pagecache pages 13:02:02:1752 pages in swap cache 13:02:02:Swap cache stats: add 2908, delete 1156, find 0/0 13:02:02:Free swap = 4117132kB 13:02:02:Total swap = 4128764kB 13:02:02:524284 pages RAM 13:02:02:43736 pages reserved 13:02:02:2548 pages shared 13:02:02:438391 pages non-shared 13:02:02:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 13:02:02:[ 482] 0 482 2716 23 0 -17 -1000 udevd 13:02:02:[ 1235] 0 1235 6902 14 0 -17 -1000 auditd 13:02:02:[ 1269] 0 1269 62273 201 0 0 0 rsyslogd 13:02:02:[ 1320] 32 1320 4744 1 0 0 0 rpcbind 13:02:02:[ 1342] 29 1342 5837 1 0 0 0 rpc.statd 13:02:02:[ 1376] 81 1376 6444 1 0 0 0 dbus-daemon 13:02:02:[ 1398] 0 1398 53919 1 0 0 0 ypbind 13:02:02:[ 1430] 0 1430 47233 1 0 0 0 cupsd 13:02:02:[ 1486] 0 1486 1020 0 0 0 0 acpid 13:02:02:[ 1498] 68 1498 10523 1 0 0 0 hald 13:02:02:[ 1499] 0 1499 5099 1 0 0 0 hald-runner 13:02:02:[ 1531] 0 1531 5629 1 0 0 0 hald-addon-inpu 13:02:02:[ 1541] 68 1541 4501 1 0 0 0 hald-addon-acpi 13:02:02:[ 1587] 0 1587 26827 0 0 0 0 rpc.rquotad 13:02:02:[ 1592] 0 1592 5417 0 0 0 0 rpc.mountd 13:02:02:[ 1636] 0 1636 6292 1 0 0 0 rpc.idmapd 13:02:02:[ 1673] 498 1673 57325 139 0 0 0 munged 13:02:02:[ 1784] 0 1784 16555 0 0 -17 -1000 sshd 13:02:02:[ 1795] 0 1795 5429 18 0 0 0 xinetd 13:02:02:[ 1827] 0 1827 22210 23 0 0 0 sendmail 13:02:02:[ 1836] 51 1836 20074 13 0 0 0 sendmail 13:02:02:[ 1864] 0 1864 29216 10 0 0 0 crond 13:02:02:[ 1879] 0 1879 5276 0 0 0 0 atd 13:02:02:[ 1893] 0 1893 1020 2 0 0 0 agetty 13:02:02:[ 1894] 0 1894 1016 2 0 0 0 mingetty 13:02:02:[ 1896] 0 1896 1016 12 0 0 0 mingetty 13:02:02:[ 1898] 0 1898 1016 21 0 0 0 mingetty 13:02:02:[ 1900] 0 1900 1016 22 0 0 0 mingetty 13:02:02:[ 1902] 0 1902 2715 24 0 -17 -1000 udevd 13:02:02:[ 1903] 0 1903 2714 31 0 -17 -1000 udevd 13:02:02:[ 1904] 0 1904 1016 21 0 0 0 mingetty 13:02:02:[ 1906] 0 1906 1016 22 0 0 0 mingetty 13:02:02:[ 2658] 38 2658 7689 154 0 0 0 ntpd 13:02:02:[20420] 0 20420 14749 161 0 0 0 in.mrshd 13:02:02:[20421] 0 20421 26524 59 0 0 0 bash 13:02:02:[20446] 0 20446 26524 59 0 0 0 bash 13:02:02:[20447] 0 20447 27603 617 0 0 0 sh 13:02:02:[20522] 0 20522 28026 65 0 0 0 lctl 13:02:02:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled 13:02:02: 13:02:02:Pid: 1827, comm: sendmail Not tainted 2.6.32-573.7.1.el6_lustre.gd359461.x86_64 #1 13:02:02:Call Trace: 13:02:02: [<ffffffff815386d4>] ? panic+0xa7/0x16f 13:02:02: [<ffffffff8112aaa1>] ? dump_header+0x191/0x1b0 13:02:02: [<ffffffff8112ab3c>] ? check_panic_on_oom+0x7c/0x80 13:02:02: [<ffffffff8112b1fb>] ? out_of_memory+0x1bb/0x3c0 13:02:02: [<ffffffff81137c3c>] ? __alloc_pages_nodemask+0x93c/0x950 13:02:02: [<ffffffff8117064a>] ? alloc_pages_current+0xaa/0x110 13:02:02: [<ffffffff81127d97>] ? __page_cache_alloc+0x87/0x90 13:02:02: [<ffffffff8112777e>] ? find_get_page+0x1e/0xa0 13:02:02: [<ffffffff81128d37>] ? filemap_fault+0x1a7/0x500 13:02:02: [<ffffffff811522f4>] ? __do_fault+0x54/0x530 13:02:02: [<ffffffff811528c7>] ? handle_pte_fault+0xf7/0xb20 13:02:02: [<ffffffff811b4901>] ? save_mount_options+0x21/0x40 13:02:02: [<ffffffff811b7b22>] ? seq_vprintf+0x32/0x60 13:02:02: [<ffffffff81012bbe>] ? copy_user_generic+0xe/0x20 13:02:02: [<ffffffff81153589>] ? handle_mm_fault+0x299/0x3d0 13:02:02: [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500 13:02:02: [<ffffffff81042fc9>] ? kvm_clock_get_cycles+0x9/0x10 13:02:02: [<ffffffff810ad40f>] ? ktime_get_ts+0xbf/0x100 13:02:02: [<ffffffff811a9778>] ? poll_select_copy_remaining+0xf8/0x150 13:02:02: [<ffffffff8153f3de>] ? do_page_fault+0x3e/0xa0 13:02:02: [<ffffffff8153c785>] ? page_fault+0x25/0x30 For https://testing.hpdd.intel.com/test_sets/4790f2ba-767c-11e5-86b7-5254006e85c2, we see the following from the console on the MDS: 08:00:56:Lustre: 20445:0:(llog_test.c:1443:llog_test_10()) 10c: write 131072 more log records 08:00:56:sendmail invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 08:00:56:sendmail cpuset=/ mems_allowed=0 08:00:56:Pid: 1875, comm: sendmail Not tainted 2.6.32-573.7.1.el6_lustre.gd359461.x86_64 #1 08:00:56:Call Trace: 08:00:56: [<ffffffff810d71a1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 08:00:56: [<ffffffff8112a9a0>] ? dump_header+0x90/0x1b0 08:00:56: [<ffffffff8112ab0e>] ? check_panic_on_oom+0x4e/0x80 08:00:56: [<ffffffff8112b1fb>] ? out_of_memory+0x1bb/0x3c0 08:00:56: [<ffffffff81137c3c>] ? __alloc_pages_nodemask+0x93c/0x950 08:00:56: [<ffffffff8117064a>] ? alloc_pages_current+0xaa/0x110 08:00:56: [<ffffffff81127d97>] ? __page_cache_alloc+0x87/0x90 08:00:56: [<ffffffff8112777e>] ? find_get_page+0x1e/0xa0 08:00:56: [<ffffffff81128d37>] ? filemap_fault+0x1a7/0x500 08:00:56: [<ffffffff811522f4>] ? __do_fault+0x54/0x530 08:00:56: [<ffffffff811528c7>] ? handle_pte_fault+0xf7/0xb20 08:00:56: [<ffffffff811b4901>] ? save_mount_options+0x21/0x40 08:00:56: [<ffffffff811b7b22>] ? seq_vprintf+0x32/0x60 08:00:56: [<ffffffff81012bbe>] ? copy_user_generic+0xe/0x20 08:00:56: [<ffffffff81153589>] ? handle_mm_fault+0x299/0x3d0 08:00:56: [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500 08:00:56: [<ffffffff81042fc9>] ? kvm_clock_get_cycles+0x9/0x10 08:00:56: [<ffffffff810ad40f>] ? ktime_get_ts+0xbf/0x100 08:00:56: [<ffffffff811a9778>] ? poll_select_copy_remaining+0xf8/0x150 08:00:56: [<ffffffff8153f3de>] ? do_page_fault+0x3e/0xa0 08:00:56: [<ffffffff8153c785>] ? page_fault+0x25/0x30 08:00:56:Mem-Info: 08:00:56:Node 0 DMA per-cpu: 08:00:56:CPU 0: hi: 0, btch: 1 usd: 0 08:00:56:CPU 1: hi: 0, btch: 1 usd: 0 08:00:56:Node 0 DMA32 per-cpu: 08:00:56:CPU 0: hi: 186, btch: 31 usd: 94 08:00:56:CPU 1: hi: 186, btch: 31 usd: 0 08:00:56:active_anon:2321 inactive_anon:2319 isolated_anon:0 08:00:56: active_file:1224 inactive_file:1294 isolated_file:0 08:00:56: unevictable:0 dirty:13 writeback:2015 unstable:0 08:00:56: free:13240 slab_reclaimable:2534 slab_unreclaimable:434564 08:00:56: mapped:18 shmem:14 pagetables:709 bounce:0 08:00:56:Node 0 DMA free:8336kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:7328kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1 all_unreclaimable? yes 08:00:56:lowmem_reserve[]: 0 2004 2004 2004 08:00:56:Node 0 DMA32 free:44624kB min:44720kB low:55900kB high:67080kB active_anon:9284kB inactive_anon:9276kB active_file:4896kB inactive_file:5176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:52kB writeback:8060kB mapped:72kB shmem:56kB slab_reclaimable:10132kB slab_unreclaimable:1730928kB kernel_stack:3680kB pagetables:2836kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:300165 all_unreclaimable? yes 08:00:56:lowmem_reserve[]: 0 0 0 0 08:00:56:Node 0 DMA: 2*4kB 2*8kB 2*16kB 2*32kB 3*64kB 3*128kB 2*256kB 2*512kB 2*1024kB 2*2048kB 0*4096kB = 8376kB 08:00:56:Node 0 DMA32: 2124*4kB 1102*8kB 573*16kB 241*32kB 57*64kB 7*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 44624kB 08:00:56:4866 total pagecache pages 08:00:56:2340 pages in swap cache 08:00:56:Swap cache stats: add 2340, delete 0, find 0/0 08:00:56:Free swap = 4119404kB 08:00:56:Total swap = 4128764kB 08:00:56:524284 pages RAM 08:00:56:43736 pages reserved 08:00:56:2566 pages shared 08:00:56:438172 pages non-shared 08:00:56:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 08:00:56:[ 506] 0 506 2708 25 0 -17 -1000 udevd 08:00:56:[ 1281] 0 1281 6902 13 0 -17 -1000 auditd 08:00:56:[ 1315] 0 1315 62272 176 1 0 0 rsyslogd 08:00:56:[ 1349] 0 1349 4561 1 1 0 0 irqbalance 08:00:56:[ 1367] 32 1367 4744 1 0 0 0 rpcbind 08:00:56:[ 1389] 29 1389 5837 1 1 0 0 rpc.statd 08:00:56:[ 1423] 81 1423 6418 1 0 0 0 dbus-daemon 08:00:56:[ 1443] 0 1443 53919 1 1 0 0 ypbind 08:00:56:[ 1477] 0 1477 47233 1 1 0 0 cupsd 08:00:56:[ 1534] 0 1534 1020 0 1 0 0 acpid 08:00:56:[ 1546] 68 1546 10517 1 1 0 0 hald 08:00:56:[ 1547] 0 1547 5099 1 0 0 0 hald-runner 08:00:56:[ 1579] 0 1579 5629 1 1 0 0 hald-addon-inpu 08:00:56:[ 1589] 68 1589 4501 1 1 0 0 hald-addon-acpi 08:00:56:[ 1635] 0 1635 26827 0 0 0 0 rpc.rquotad 08:00:56:[ 1640] 0 1640 5417 0 0 0 0 rpc.mountd 08:00:56:[ 1684] 0 1684 6292 1 0 0 0 rpc.idmapd 08:00:56:[ 1721] 498 1721 57325 139 0 0 0 munged 08:00:56:[ 1832] 0 1832 16555 0 0 -17 -1000 sshd 08:00:56:[ 1843] 0 1843 5429 17 1 0 0 xinetd 08:00:56:[ 1875] 0 1875 22211 31 0 0 0 sendmail 08:00:56:[ 1884] 51 1884 20074 365 0 0 0 sendmail 08:00:56:[ 1912] 0 1912 29215 151 1 0 0 crond 08:00:56:[ 1927] 0 1927 5276 45 0 0 0 atd 08:00:56:[ 1940] 0 1940 1020 22 1 0 0 agetty 08:00:56:[ 1942] 0 1942 1016 21 1 0 0 mingetty 08:00:56:[ 1944] 0 1944 1016 21 1 0 0 mingetty 08:00:56:[ 1946] 0 1946 1016 21 1 0 0 mingetty 08:00:56:[ 1948] 0 1948 1016 22 1 0 0 mingetty 08:00:56:[ 1951] 0 1951 2710 35 0 -17 -1000 udevd 08:00:56:[ 1952] 0 1952 2708 32 0 -17 -1000 udevd 08:00:56:[ 1953] 0 1953 1016 21 0 0 0 mingetty 08:00:56:[ 1955] 0 1955 1016 21 0 0 0 mingetty 08:00:56:[ 2711] 38 2711 7689 156 1 0 0 ntpd 08:00:56:[21148] 0 21148 4763 55 1 0 0 anacron 08:00:56:[20343] 0 20343 14749 161 0 0 0 in.mrshd 08:00:56:[20344] 0 20344 26524 58 0 0 0 bash 08:00:56:[20369] 0 20369 26524 58 1 0 0 bash 08:00:56:[20370] 0 20370 27603 619 0 0 0 sh 08:00:56:[20445] 0 20445 28026 65 1 0 0 lctl 08:00:56:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled 08:00:56: 08:00:56:Pid: 1875, comm: sendmail Not tainted 2.6.32-573.7.1.el6_lustre.gd359461.x86_64 #1 08:00:56:Call Trace: 08:00:56: [<ffffffff815386d4>] ? panic+0xa7/0x16f 08:00:56: [<ffffffff8112aaa1>] ? dump_header+0x191/0x1b0 08:00:56: [<ffffffff8112ab3c>] ? check_panic_on_oom+0x7c/0x80 08:00:56: [<ffffffff8112b1fb>] ? out_of_memory+0x1bb/0x3c0 08:00:56: [<ffffffff81137c3c>] ? __alloc_pages_nodemask+0x93c/0x950 08:00:56: [<ffffffff8117064a>] ? alloc_pages_current+0xaa/0x110 08:00:56: [<ffffffff81127d97>] ? __page_cache_alloc+0x87/0x90 08:00:56: [<ffffffff8112777e>] ? find_get_page+0x1e/0xa0 08:00:56: [<ffffffff81128d37>] ? filemap_fault+0x1a7/0x500 08:00:56: [<ffffffff811522f4>] ? __do_fault+0x54/0x530 08:00:56: [<ffffffff811528c7>] ? handle_pte_fault+0xf7/0xb20 08:00:56: [<ffffffff811b4901>] ? save_mount_options+0x21/0x40 08:00:56: [<ffffffff811b7b22>] ? seq_vprintf+0x32/0x60 08:00:56: [<ffffffff81012bbe>] ? copy_user_generic+0xe/0x20 08:00:56: [<ffffffff81153589>] ? handle_mm_fault+0x299/0x3d0 08:00:56: [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500 08:00:56: [<ffffffff81042fc9>] ? kvm_clock_get_cycles+0x9/0x10 08:00:56: [<ffffffff810ad40f>] ? ktime_get_ts+0xbf/0x100 08:00:56: [<ffffffff811a9778>] ? poll_select_copy_remaining+0xf8/0x150 08:00:56: [<ffffffff8153f3de>] ? do_page_fault+0x3e/0xa0 08:00:56: [<ffffffff8153c785>] ? page_fault+0x25/0x30 |
| Comments |
| Comment by James Nunez (Inactive) [ 22/Oct/15 ] |
|
Another failure (ppc): https://testing.hpdd.intel.com/test_sets/82b87298-78f0-11e5-9aa5-5254006e85c2 |
| Comment by Andreas Dilger [ 22/Oct/15 ] |
|
I don't think the actual process being OOM killed matters, since there are relatively few user-space processes on the MDS, and one will be selected at random to kill, since Lustre kernel threads cannot be killed. I suspect this is related to Bruno's recent patch to add llog wrap testing to llog_test.c, http://review.whamcloud.com/14912 " |
| Comment by Bruno Faccini (Inactive) [ 23/Oct/15 ] |
|
Strange is that the whole memory seems to be consumed as slab_unreclaimable ! |
| Comment by Bob Glossman (Inactive) [ 23/Oct/15 ] |
|
another seen with sles11sp3 on master: |
| Comment by Bruno Faccini (Inactive) [ 23/Oct/15 ] |
|
Ok, found a crash-dump from https://testing.hpdd.intel.com/test_sets/4790f2ba-767c-11e5-86b7-5254006e85c2 failed session reported by James. CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff88007f8b02c0 size-512 512 2415216 2415272 301909 4k Interesting is that in the 2 active threads on the 2 virtual cores : PID: 1875 TASK: ffff880037497520 CPU: 0 COMMAND: "sendmail"
#0 [ffff88007dbf77a0] machine_kexec at ffffffff8103d1ab
#1 [ffff88007dbf7800] crash_kexec at ffffffff810cc8d2
#2 [ffff88007dbf78d0] panic at ffffffff815386db
#3 [ffff88007dbf7950] check_panic_on_oom at ffffffff8112ab3c
#4 [ffff88007dbf7980] out_of_memory at ffffffff8112b1fb
#5 [ffff88007dbf7a00] __alloc_pages_nodemask at ffffffff81137c3c
#6 [ffff88007dbf7b50] alloc_pages_current at ffffffff8117064a
#7 [ffff88007dbf7b80] __page_cache_alloc at ffffffff81127d97
#8 [ffff88007dbf7bb0] filemap_fault at ffffffff81128d37
#9 [ffff88007dbf7c20] __do_fault at ffffffff811522f4
#10 [ffff88007dbf7cb0] handle_pte_fault at ffffffff811528c7
#11 [ffff88007dbf7d90] handle_mm_fault at ffffffff81153589
#12 [ffff88007dbf7e00] __do_page_fault at ffffffff8104f156
#13 [ffff88007dbf7f20] do_page_fault at ffffffff8153f3de
#14 [ffff88007dbf7f50] page_fault at ffffffff8153c785
RIP: 00007f1d296db393 RSP: 00007ffe4a334528 RFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f1d296db393
RDX: 0000000000000000 RSI: 00007ffe4a334620 RDI: 0000000000000005
RBP: 0000000000000001 R8: 00007ffe4a3346a0 R9: 00007ffe4a338efc
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f1d2b9b9fb0
R13: 0000000000000001 R14: 0000000000000004 R15: 00007f1d2b977b28
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
PID: 20445 TASK: ffff880077389520 CPU: 1 COMMAND: "lctl"
#0 [ffff880002306e90] crash_nmi_callback at ffffffff81033cf6
#1 [ffff880002306ea0] notifier_call_chain at ffffffff8153f495
#2 [ffff880002306ee0] atomic_notifier_call_chain at ffffffff8153f4fa
#3 [ffff880002306ef0] notify_die at ffffffff810a783e
#4 [ffff880002306f20] do_nmi at ffffffff8153d153
#5 [ffff880002306f50] nmi at ffffffff8153ca10
[exception RIP: idr_get_next+108]
RIP: ffffffff812911bc RSP: ffff8800656af388 RFLAGS: 00000206
RAX: 0000000000000000 RBX: ffffffff81c1ab68 RCX: 0000000000000000
RDX: ffff8800656af390 RSI: ffff8800656af3e4 RDI: 0000000000000040
RBP: ffff8800656af3b8 R8: 0000000000000012 R9: 00000000ffffffff
R10: 0000000000000001 R11: 0000000000000012 R12: 0000000000000000
R13: ffff8800656af3e4 R14: ffffffff81c1ab48 R15: 0000000000000001
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
--- <DOUBLEFAULT exception stack> ---
#6 [ffff8800656af388] idr_get_next at ffffffff812911bc
#7 [ffff8800656af3c0] css_get_next at ffffffff810d2077
#8 [ffff8800656af420] mem_cgroup_iter at ffffffff811837ad
#9 [ffff8800656af470] shrink_zone at ffffffff81142bc8
#10 [ffff8800656af4f0] do_try_to_free_pages at ffffffff81142de5
#11 [ffff8800656af590] try_to_free_pages at ffffffff811434a2
#12 [ffff8800656af630] __alloc_pages_nodemask at ffffffff811377cd
#13 [ffff8800656af780] kmem_getpages at ffffffff81177282
#14 [ffff8800656af7b0] fallback_alloc at ffffffff81177e9a
#15 [ffff8800656af830] ____cache_alloc_node at ffffffff81177c19
#16 [ffff8800656af890] kmem_cache_alloc_trace at ffffffff81178517
#17 [ffff8800656af8e0] lu_context_init at ffffffffa05f1efe [obdclass]
#18 [ffff8800656af910] osd_trans_start at ffffffffa0d108ee [osd_ldiskfs]
#19 [ffff8800656af950] llog_cat_add at ffffffffa05abb2d [obdclass]
#20 [ffff8800656af9a0] llog_test_10 at ffffffffa024910a [llog_test]
#21 [ffff8800656afa90] llog_run_tests at ffffffffa024da62 [llog_test]
#22 [ffff8800656afb10] llog_test_setup at ffffffffa024e499 [llog_test]
#23 [ffff8800656afbd0] obd_setup at ffffffffa05d9f8b [obdclass]
#24 [ffff8800656afc90] class_setup at ffffffffa05da2d9 [obdclass]
#25 [ffff8800656afce0] class_process_config at ffffffffa05e1481 [obdclass]
#26 [ffff8800656afd90] class_handle_ioctl at ffffffffa05b9cf2 [obdclass]
#27 [ffff8800656afe40] obd_class_ioctl at ffffffffa059f2ab [obdclass]
#28 [ffff8800656afe60] vfs_ioctl at ffffffff811a77a2
#29 [ffff8800656afea0] do_vfs_ioctl at ffffffff811a7944
#30 [ffff8800656aff30] sys_ioctl at ffffffff811a7ec1
#31 [ffff8800656aff80] system_call_fastpath at ffffffff8100b0d2
RIP: 00000036680e0907 RSP: 00007ffd31d15f58 RFLAGS: 00010246
RAX: 0000000000000010 RBX: ffffffff8100b0d2 RCX: 0000000000000000
RDX: 00007ffd31d13c40 RSI: 00000000c00866b8 RDI: 0000000000000003
RBP: 0000000000000001 R8: 0000000000000240 R9: 0000000000000240
R10: 0000000000000040 R11: 0000000000000246 R12: 00000000c00866b8
R13: 00007ffd31d13c40 R14: 000000000064f980 R15: 0000000000000003
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
one is the lctl cmd that started the in-kernel LLOG tests, which is trying to allocate a new object in the "size-512" Slab (requested size is 320 bytes) and the concerned code is : static int keys_init(struct lu_context *ctx)
{
>>> OBD_ALLOC(ctx->lc_value, ARRAY_SIZE(lu_keys) * sizeof ctx->lc_value[0]);
if (likely(ctx->lc_value != NULL))
return keys_fill(ctx);
return -ENOMEM;
}
These allocated things, should be freed in keys_fini() routine, itself being called by lu_context_fini(), to be called by osd_trans_commit_cb(). This last one is being run as a callback posted via ldiskfs_journal_callback_add() in osd_trans_stop(). So seems that something has gone wrong/delayed in the journaling process here, and I wonder if this could not be due to too much work being done in a raw and in kernel space by llog_test() code, aggravated with my recent add-on of test_10() sub-test for What is unclear for me now, is how/when these callbacks are executed by [ext4,ldiskfs]_journal_commit_callback() ?? |
| Comment by Alex Zhuravlev [ 23/Oct/15 ] |
|
this might be due to lots of transactions in the transaction groups. though 241K/sec looks very serious.. I've got a patch (still under development) which allows to get rid of per-tx context: http://review.whamcloud.com/#/c/4307/ |
| Comment by Alex Zhuravlev [ 23/Oct/15 ] |
|
> What is unclear for me now, is how/when these callbacks are executed by [ext4,ldiskfs]_journal_commit_callback() ?? it's called when the transaction group holding this transaction is committed. it's done every 5s roughly. |
| Comment by Bruno Faccini (Inactive) [ 23/Oct/15 ] |
|
Ok, this is kjournald2 thread's job to execute these callbacks, via [ext4,ldiskfs]_journal_commit_callback() set in j_commit_callback field, and to be called by jbd2_journal_commit_transaction(). But according to crash-dump content the kjournald2 thread associated to the MGS/MDT device has not been scheduled since about 11s, so may be we just need to add some tempos in llog_test() code to allow the thread to be scheduled and run the callbacks? |
| Comment by Alex Zhuravlev [ 23/Oct/15 ] |
|
no, this shouldn't be needed - JBD does this internally in journal_start() unless there was an open transaction. if the transaction group becomes very large or when running time is over (5s by default), journal_start() blocks and essentially passes control to the commit thread. |
| Comment by Bruno Faccini (Inactive) [ 23/Oct/15 ] |
|
Ok, but then how can we explain that the kjournald2/commit thread has not been running since about 11s after calling schedule() ? |
| Comment by nasf (Inactive) [ 25/Oct/15 ] |
|
Another failure instance on master: |
| Comment by Alex Zhuravlev [ 25/Oct/15 ] |
|
we do not control commit thread directly. llog_test() generates new transactions and at some of them JBD will block everything and commit thread get control. plus we're doing lots of allocations where the kernel can schedule (especially when we're low on memory). IOW, I doubt schedule() can help here. |
| Comment by James Nunez (Inactive) [ 26/Oct/15 ] |
|
Another failure at https://testing.hpdd.intel.com/test_sets/04224472-7b2a-11e5-9650-5254006e85c2 |
| Comment by Bob Glossman (Inactive) [ 26/Oct/15 ] |
|
another on master: |
| Comment by nasf (Inactive) [ 28/Oct/15 ] |
|
More failures on master: |
| Comment by James Nunez (Inactive) [ 28/Oct/15 ] |
|
More failures on master: |
| Comment by James Nunez (Inactive) [ 29/Oct/15 ] |
|
This test is failing about twice a day. Raised priority to Critical. More failures: |
| Comment by nasf (Inactive) [ 03/Nov/15 ] |
|
Another failure instance on master: |
| Comment by Bob Glossman (Inactive) [ 03/Nov/15 ] |
|
another on master: |
| Comment by Gerrit Updater [ 05/Nov/15 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/17052 |
| Comment by Bruno Faccini (Inactive) [ 05/Nov/15 ] |
|
Patch #17052 is an attempt to sync LLOG device frequently in order to force flushing of journal commit callbacks and avoid OOM situations on VMs of auto-test. |
| Comment by Alex Zhuravlev [ 05/Nov/15 ] |
|
i still don't understand.. i'm pretty sure that we don't hit anything like this running something silly like while true; do mkdir /mnt/lustre/d0; rmdir /mnt/lustre/d0; done |
| Comment by Jian Yu [ 05/Nov/15 ] |
|
One more instance on master branch: |
| Comment by James Nunez (Inactive) [ 10/Nov/15 ] |
|
Another failure on master |
| Comment by Gerrit Updater [ 30/Nov/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17052/ |
| Comment by Peter Jones [ 30/Nov/15 ] |
|
Landed for 2.8 |
| Comment by Gerrit Updater [ 18/Dec/15 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/17670 |
| Comment by Gerrit Updater [ 25/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17670/ |