[LU-6122] DLC: system crash when setting a too large value for large_buffers Created: 14/Jan/15 Updated: 18/Aug/15 Resolved: 18/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Sarah Liu | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
lustre-master build # 2808 |
||
| Severity: | 3 |
| Rank (Obsolete): | 17055 |
| Description |
|
According to the DLC test plan, using a "too large" value for large_buffers should NOT crash, while hit this when I tried with a "too large" value [root@eagle-54vm5 modprobe.d]# lnetctl routing show
routing:
- cpt[0]:
tiny:
npages: 0
nbuffers: 4096
credits: 4096
mincredits: 4096
small:
npages: 1
nbuffers: 4096
credits: 16384
mincredits: 16384
large:
npages: 256
nbuffers: 1024
credits: 1024
mincredits: 1024
- enable: 1
[root@eagle-54vm5 modprobe.d]# lnetctl set large_buffers 4096
rpcbind invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
rpcbind cpuset=/ mems_allowed=0
Pid: 1412, comm: rpcbind Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
[<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff81122b60>] ? dump_header+0x90/0x1b0
[<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
[<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
[<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
[<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
[<ffffffff81167dca>] ? alloc_pages_vma+0x9a/0x150
[<ffffffff8115ba52>] ? read_swap_cache_async+0xf2/0x160
[<ffffffff8115c579>] ? valid_swaphandles+0x69/0x150
[<ffffffff8115bb47>] ? swapin_readahead+0x87/0xc0
[<ffffffff8114aded>] ? handle_pte_fault+0x6dd/0xb00
[<ffffffffa013c675>] ? inet6_fill_link_af+0x25/0x30 [ipv6]
[<ffffffff8146e4d6>] ? rtnl_fill_ifinfo+0x946/0xcb0
[<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
[<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
[<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff815296ee>] ? thread_return+0x4e/0x770
[<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
[<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
[<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
[<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8152c5f5>] ? page_fault+0x25/0x30
[<ffffffff811a11a9>] ? do_sys_poll+0x349/0x520
[<ffffffff811a1191>] ? do_sys_poll+0x331/0x520
[<ffffffff811a0c10>] ? __pollwait+0x0/0xf0
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811cdd52>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
[<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
[<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
[<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
[<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
[<ffffffff811a0ac5>] ? poll_select_set_timeout+0x95/0xb0
[<ffffffff811a1571>] ? sys_poll+0x71/0x100
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:11 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:13209 slab_reclaimable:1286 slab_unreclaimable:8786
mapped:1 shmem:0 pagetables:725 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:136kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:44480kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:5144kB slab_unreclaimable:35008kB kernel_stack:1280kB pagetables:2900kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:72820 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 624*4kB 267*8kB 115*16kB 45*32kB 22*64kB 13*128kB 15*256kB 6*512kB 4*1024kB 7*2048kB 2*4096kB = 44520kB
20 total pagecache pages
0 pages in swap cache
Swap cache stats: add 3541, delete 3541, find 0/1
Free swap = 4114600kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
50 pages shared
462636 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 354] 0 354 2733 0 0 -17 -1000 udevd
[ 630] 0 630 2732 0 0 -17 -1000 udevd
[ 965] 0 965 2280 1 0 0 0 dhclient
[ 1078] 0 1078 2280 1 0 0 0 dhclient
[ 1191] 0 1191 2280 1 0 0 0 dhclient
[ 1304] 0 1304 2280 1 0 0 0 dhclient
[ 1354] 0 1354 6910 1 0 -17 -1000 auditd
[ 1370] 0 1370 62271 1 0 0 0 rsyslogd
[ 1412] 32 1412 4744 1 0 0 0 rpcbind
[ 1430] 29 1430 5837 1 0 0 0 rpc.statd
[ 1543] 81 1543 5387 1 0 0 0 dbus-daemon
[ 1581] 0 1581 1020 0 0 0 0 acpid
[ 1590] 68 1590 9408 1 0 0 0 hald
[ 1591] 0 1591 5082 1 0 0 0 hald-runner
[ 1623] 0 1623 5612 1 0 0 0 hald-addon-inpu
[ 1634] 68 1634 4484 1 0 0 0 hald-addon-acpi
[ 1652] 0 1652 96433 1 0 0 0 automount
[ 1672] 0 1672 16656 0 0 -17 -1000 sshd
[ 1748] 0 1748 20326 1 0 0 0 master
[ 1757] 89 1757 20389 1 0 0 0 qmgr
[ 1772] 0 1772 27580 1 0 0 0 abrtd
[ 1780] 0 1780 29324 1 0 0 0 crond
[ 1791] 0 1791 5385 0 0 0 0 atd
[ 1804] 0 1804 15590 0 0 0 0 certmonger
[ 1817] 0 1817 1016 1 0 0 0 mingetty
[ 1819] 0 1819 1016 1 0 0 0 mingetty
[ 1821] 0 1821 19853 1 0 0 0 login
[ 1822] 0 1822 1016 1 0 0 0 mingetty
[ 1824] 0 1824 1016 1 0 0 0 mingetty
[ 1826] 0 1826 1016 1 0 0 0 mingetty
[ 1828] 0 1828 1016 1 0 0 0 mingetty
[ 2244] 0 2244 144390 1 0 0 0 console-kit-dae
[ 2310] 0 2310 27084 1 0 0 0 bash
[ 8491] 89 8491 20346 1 0 0 0 pickup
[ 8769] 0 8769 4903 1 0 0 0 lnetctl
Out of memory: Kill process 965 (dhclient) score 1 or sacrifice child
Killed process 965, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
rpcbind invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
rpcbind cpuset=/ mems_allowed=0
Pid: 1412, comm: rpcbind Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
[<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff81122b60>] ? dump_header+0x90/0x1b0
[<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
[<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
[<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
[<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
[<ffffffff81167dca>] ? alloc_pages_vma+0x9a/0x150
[<ffffffff8115ba52>] ? read_swap_cache_async+0xf2/0x160
[<ffffffff8115c579>] ? valid_swaphandles+0x69/0x150
[<ffffffff8115bb47>] ? swapin_readahead+0x87/0xc0
[<ffffffff8114aded>] ? handle_pte_fault+0x6dd/0xb00
[<ffffffffa013c675>] ? inet6_fill_link_af+0x25/0x30 [ipv6]
[<ffffffff8146e4d6>] ? rtnl_fill_ifinfo+0x946/0xcb0
[<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
[<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
[<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff815296ee>] ? thread_return+0x4e/0x770
[<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
[<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
[<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
[<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8152c5f5>] ? page_fault+0x25/0x30
[<ffffffff811a11a9>] ? do_sys_poll+0x349/0x520
[<ffffffff811a1191>] ? do_sys_poll+0x331/0x520
[<ffffffff811a0c10>] ? __pollwait+0x0/0xf0
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811cdd52>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
[<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
[<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
[<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
[<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
[<ffffffff811a0ac5>] ? poll_select_set_timeout+0x95/0xb0
[<ffffffff811a1571>] ? sys_poll+0x71/0x100
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:0 inactive_file:20 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:13223 slab_reclaimable:1286 slab_unreclaimable:8786
mapped:1 shmem:0 pagetables:725 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:136kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:44536kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:80kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:5144kB slab_unreclaimable:35008kB kernel_stack:1280kB pagetables:2900kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:220 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 636*4kB 267*8kB 115*16kB 45*32kB 22*64kB 13*128kB 15*256kB 6*512kB 4*1024kB 7*2048kB 2*4096kB = 44568kB
20 total pagecache pages
0 pages in swap cache
Swap cache stats: add 3549, delete 3549, find 2/6
Free swap = 4115092kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
49 pages shared
462624 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 354] 0 354 2733 0 0 -17 -1000 udevd
[ 630] 0 630 2732 0 0 -17 -1000 udevd
[ 1078] 0 1078 2280 1 0 0 0 dhclient
[ 1191] 0 1191 2280 1 0 0 0 dhclient
[ 1304] 0 1304 2280 1 0 0 0 dhclient
[ 1354] 0 1354 6910 1 0 -17 -1000 auditd
[ 1370] 0 1370 62271 1 0 0 0 rsyslogd
[ 1412] 32 1412 4744 1 0 0 0 rpcbind
[ 1430] 29 1430 5837 1 0 0 0 rpc.statd
[ 1543] 81 1543 5387 1 0 0 0 dbus-daemon
[ 1581] 0 1581 1020 0 0 0 0 acpid
[ 1590] 68 1590 9408 1 0 0 0 hald
[ 1591] 0 1591 5082 1 0 0 0 hald-runner
[ 1623] 0 1623 5612 1 0 0 0 hald-addon-inpu
[ 1634] 68 1634 4484 1 0 0 0 hald-addon-acpi
[ 1652] 0 1652 96433 1 0 0 0 automount
[ 1672] 0 1672 16656 0 0 -17 -1000 sshd
[ 1748] 0 1748 20326 1 0 0 0 master
[ 1757] 89 1757 20389 1 0 0 0 qmgr
[ 1772] 0 1772 27580 1 0 0 0 abrtd
[ 1780] 0 1780 29324 1 0 0 0 crond
[ 1791] 0 1791 5385 0 0 0 0 atd
[ 1804] 0 1804 15590 0 0 0 0 certmonger
[ 1817] 0 1817 1016 1 0 0 0 mingetty
[ 1819] 0 1819 1016 1 0 0 0 mingetty
[ 1821] 0 1821 19853 1 0 0 0 login
[ 1822] 0 1822 1016 1 0 0 0 mingetty
[ 1824] 0 1824 1016 1 0 0 0 mingetty
[ 1826] 0 1826 1016 1 0 0 0 mingetty
[ 1828] 0 1828 1016 1 0 0 0 mingetty
[ 2244] 0 2244 144390 1 0 0 0 console-kit-dae
[ 2310] 0 2310 27084 1 0 0 0 bash
[ 8491] 89 8491 20346 1 0 0 0 pickup
[ 8769] 0 8769 4903 1 0 0 0 lnetctl
Out of memory: Kill process 1078 (dhclient) score 1 or sacrifice child
Killed process 1078, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
rpcbind invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
rpcbind cpuset=/ mems_allowed=0
Pid: 1412, comm: rpcbind Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
[<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff81122b60>] ? dump_header+0x90/0x1b0
[<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
[<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
[<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
[<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
[<ffffffff81167dca>] ? alloc_pages_vma+0x9a/0x150
[<ffffffff8115ba52>] ? read_swap_cache_async+0xf2/0x160
[<ffffffff8115c579>] ? valid_swaphandles+0x69/0x150
[<ffffffff8115bb47>] ? swapin_readahead+0x87/0xc0
[<ffffffff8114aded>] ? handle_pte_fault+0x6dd/0xb00
[<ffffffffa013c675>] ? inet6_fill_link_af+0x25/0x30 [ipv6]
[<ffffffff8146e4d6>] ? rtnl_fill_ifinfo+0x946/0xcb0
[<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
[<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
[<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff815296ee>] ? thread_return+0x4e/0x770
[<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
[<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
[<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
[<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8152c5f5>] ? page_fault+0x25/0x30
[<ffffffff811a11a9>] ? do_sys_poll+0x349/0x520
[<ffffffff811a1191>] ? do_sys_poll+0x331/0x520
[<ffffffff811a0c10>] ? __pollwait+0x0/0xf0
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811cdd52>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
[<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
[<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
[<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
[<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
[<ffffffff811a0ac5>] ? poll_select_set_timeout+0x95/0xb0
[<ffffffff811a1571>] ? sys_poll+0x71/0x100
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:11 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:13236 slab_reclaimable:1286 slab_unreclaimable:8786
mapped:1 shmem:0 pagetables:709 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:136kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:44588kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:5144kB slab_unreclaimable:35008kB kernel_stack:1280kB pagetables:2836kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:320 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 649*4kB 267*8kB 115*16kB 45*32kB 22*64kB 13*128kB 15*256kB 6*512kB 4*1024kB 7*2048kB 2*4096kB = 44620kB
20 total pagecache pages
0 pages in swap cache
Swap cache stats: add 3557, delete 3557, find 4/9
Free swap = 4115592kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
48 pages shared
462611 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 354] 0 354 2733 0 0 -17 -1000 udevd
[ 630] 0 630 2732 0 0 -17 -1000 udevd
[ 1191] 0 1191 2280 1 0 0 0 dhclient
[ 1304] 0 1304 2280 1 0 0 0 dhclient
[ 1354] 0 1354 6910 1 0 -17 -1000 auditd
[ 1370] 0 1370 62271 1 0 0 0 rsyslogd
[ 1412] 32 1412 4744 1 0 0 0 rpcbind
[ 1430] 29 1430 5837 1 0 0 0 rpc.statd
[ 1543] 81 1543 5387 1 0 0 0 dbus-daemon
[ 1581] 0 1581 1020 0 0 0 0 acpid
[ 1590] 68 1590 9408 1 0 0 0 hald
[ 1591] 0 1591 5082 1 0 0 0 hald-runner
[ 1623] 0 1623 5612 1 0 0 0 hald-addon-inpu
[ 1634] 68 1634 4484 1 0 0 0 hald-addon-acpi
[ 1652] 0 1652 96433 1 0 0 0 automount
[ 1672] 0 1672 16656 0 0 -17 -1000 sshd
[ 1748] 0 1748 20326 1 0 0 0 master
[ 1757] 89 1757 20389 1 0 0 0 qmgr
[ 1772] 0 1772 27580 1 0 0 0 abrtd
[ 1780] 0 1780 29324 1 0 0 0 crond
[ 1791] 0 1791 5385 0 0 0 0 atd
[ 1804] 0 1804 15590 0 0 0 0 certmonger
[ 1817] 0 1817 1016 1 0 0 0 mingetty
[ 1819] 0 1819 1016 1 0 0 0 mingetty
[ 1821] 0 1821 19853 1 0 0 0 login
[ 1822] 0 1822 1016 1 0 0 0 mingetty
[ 1824] 0 1824 1016 1 0 0 0 mingetty
[ 1826] 0 1826 1016 1 0 0 0 mingetty
[ 1828] 0 1828 1016 1 0 0 0 mingetty
[ 2244] 0 2244 144390 1 0 0 0 console-kit-dae
[ 2310] 0 2310 27084 1 0 0 0 bash
[ 8491] 89 8491 20346 1 0 0 0 pickup
[ 8769] 0 8769 4903 1 0 0 0 lnetctl
Out of memory: Kill process 1191 (dhclient) score 1 or sacrifice child
Killed process 1191, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
rpcbind invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
rpcbind cpuset=/ mems_allowed=0
Pid: 1412, comm: rpcbind Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
[<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff81122b60>] ? dump_header+0x90/0x1b0
[<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
[<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
[<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
[<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
[<ffffffff81167dca>] ? alloc_pages_vma+0x9a/0x150
[<ffffffff8115ba52>] ? read_swap_cache_async+0xf2/0x160
[<ffffffff8115c579>] ? valid_swaphandles+0x69/0x150
[<ffffffff8115bb47>] ? swapin_readahead+0x87/0xc0
[<ffffffff8114aded>] ? handle_pte_fault+0x6dd/0xb00
[<ffffffffa013c675>] ? inet6_fill_link_af+0x25/0x30 [ipv6]
[<ffffffff8146e4d6>] ? rtnl_fill_ifinfo+0x946/0xcb0
[<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
[<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
[<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff815296ee>] ? thread_return+0x4e/0x770
[<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
[<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
[<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
[<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8152c5f5>] ? page_fault+0x25/0x30
[<ffffffff811a11a9>] ? do_sys_poll+0x349/0x520
[<ffffffff811a1191>] ? do_sys_poll+0x331/0x520
[<ffffffff811a0c10>] ? __pollwait+0x0/0xf0
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811cdd52>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
[<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
[<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
[<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
[<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
[<ffffffff811a0ac5>] ? poll_select_set_timeout+0x95/0xb0
[<ffffffff811a1571>] ? sys_poll+0x71/0x100
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:11 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:13250 slab_reclaimable:1286 slab_unreclaimable:8786
mapped:1 shmem:0 pagetables:709 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:136kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:44644kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:5144kB slab_unreclaimable:35008kB kernel_stack:1280kB pagetables:2836kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1520 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 655*4kB 271*8kB 115*16kB 45*32kB 22*64kB 13*128kB 15*256kB 6*512kB 4*1024kB 7*2048kB 2*4096kB = 44676kB
20 total pagecache pages
0 pages in swap cache
Swap cache stats: add 3565, delete 3565, find 6/12
Free swap = 4116088kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
47 pages shared
462597 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 354] 0 354 2733 0 0 -17 -1000 udevd
[ 630] 0 630 2732 0 0 -17 -1000 udevd
[ 1304] 0 1304 2280 1 0 0 0 dhclient
[ 1354] 0 1354 6910 1 0 -17 -1000 auditd
[ 1370] 0 1370 62271 1 0 0 0 rsyslogd
[ 1412] 32 1412 4744 1 0 0 0 rpcbind
[ 1430] 29 1430 5837 1 0 0 0 rpc.statd
[ 1543] 81 1543 5387 1 0 0 0 dbus-daemon
[ 1581] 0 1581 1020 0 0 0 0 acpid
[ 1590] 68 1590 9408 1 0 0 0 hald
[ 1591] 0 1591 5082 1 0 0 0 hald-runner
[ 1623] 0 1623 5612 1 0 0 0 hald-addon-inpu
[ 1634] 68 1634 4484 1 0 0 0 hald-addon-acpi
[ 1652] 0 1652 96433 1 0 0 0 automount
[ 1672] 0 1672 16656 0 0 -17 -1000 sshd
[ 1748] 0 1748 20326 1 0 0 0 master
[ 1757] 89 1757 20389 1 0 0 0 qmgr
[ 1772] 0 1772 27580 1 0 0 0 abrtd
[ 1780] 0 1780 29324 1 0 0 0 crond
[ 1791] 0 1791 5385 0 0 0 0 atd
[ 1804] 0 1804 15590 0 0 0 0 certmonger
[ 1817] 0 1817 1016 1 0 0 0 mingetty
[ 1819] 0 1819 1016 1 0 0 0 mingetty
[ 1821] 0 1821 19853 1 0 0 0 login
[ 1822] 0 1822 1016 1 0 0 0 mingetty
[ 1824] 0 1824 1016 1 0 0 0 mingetty
[ 1826] 0 1826 1016 1 0 0 0 mingetty
[ 1828] 0 1828 1016 1 0 0 0 mingetty
[ 2244] 0 2244 144390 1 0 0 0 console-kit-dae
[ 2310] 0 2310 27084 1 0 0 0 bash
[ 8491] 89 8491 20346 1 0 0 0 pickup
[ 8769] 0 8769 4903 1 0 0 0 lnetctl
Out of memory: Kill process 1304 (dhclient) score 1 or sacrifice child
Killed process 1304, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
rpcbind invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
rpcbind cpuset=/ mems_allowed=0
Pid: 1412, comm: rpcbind Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
[<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff81122b60>] ? dump_header+0x90/0x1b0
[<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
[<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
[<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
[<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
[<ffffffff81167dca>] ? alloc_pages_vma+0x9a/0x150
[<ffffffff8115ba52>] ? read_swap_cache_async+0xf2/0x160
[<ffffffff8115c579>] ? valid_swaphandles+0x69/0x150
[<ffffffff8115bb47>] ? swapin_readahead+0x87/0xc0
[<ffffffff8114aded>] ? handle_pte_fault+0x6dd/0xb00
[<ffffffffa013c675>] ? inet6_fill_link_af+0x25/0x30 [ipv6]
[<ffffffff8146e4d6>] ? rtnl_fill_ifinfo+0x946/0xcb0
[<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
[<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
[<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
[<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
[<ffffffff815296ee>] ? thread_return+0x4e/0x770
[<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
[<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
[<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
[<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8152c5f5>] ? page_fault+0x25/0x30
[<ffffffff811a11a9>] ? do_sys_poll+0x349/0x520
[<ffffffff811a1191>] ? do_sys_poll+0x331/0x520
[<ffffffff811a0c10>] ? __pollwait+0x0/0xf0
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811a0d00>] ? pollwake+0x0/0x60
[<ffffffff811cdd52>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
[<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
[<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
[<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
[<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
[<ffffffff811a0ac5>] ? poll_select_set_timeout+0x95/0xb0
[<ffffffff811a1571>] ? sys_poll+0x71/0x100
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
active_anon:0 inactive_anon:0 isolated_anon:0
active_file:11 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:13270 slab_reclaimable:1286 slab_unreclaimable:8786
mapped:1 shmem:0 pagetables:692 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:136kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:44724kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:5144kB slab_unreclaimable:35008kB kernel_stack:1280kB pagetables:2768kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:62180 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 667*4kB 271*8kB 115*16kB 45*32kB 22*64kB 13*128kB 15*256kB 6*512kB 4*1024kB 7*2048kB 2*4096kB = 44724kB
20 total pagecache pages
0 pages in swap cache
Swap cache stats: add 3573, delete 3573, find 8/15
Free swap = 4116588kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
46 pages shared
462585 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 354] 0 354 2733 0 0 -17 -1000 udevd
[ 630] 0 630 2732 0 0 -17 -1000 udevd
[ 1354] 0 1354 6910 1 0 -17 -1000 auditd
[ 1370] 0 1370 62271 1 0 0 0 rsyslogd
[ 1412] 32 1412 4744 1 0 0 0 rpcbind
[ 1430] 29 1430 5837 1 0 0 0 rpc.statd
[ 1543] 81 1543 5387 1 0 0 0 dbus-daemon
[ 1581] 0 1581 1020 0 0 0 0 acpid
[ 1590] 68 1590 9408 1 0 0 0 hald
[ 1591] 0 1591 5082 1 0 0 0 hald-runner
[ 1623] 0 1623 5612 1 0 0 0 hald-addon-inpu
[ 1634] 68 1634 4484 1 0 0 0 hald-addon-acpi
[ 1652] 0 1652 96433 1 0 0 0 automount
[ 1672] 0 1672 16656 0 0 -17 -1000 sshd
[ 1748] 0 1748 20326 1 0 0 0 master
[ 1757] 89 1757 20389 1 0 0 0 qmgr
[ 1772] 0 1772 27580 1 0 0 0 abrtd
[ 1780] 0 1780 29324 1 0 0 0 crond
[ 1791] 0 1791 5385 0 0 0 0 atd
[ 1804] 0 1804 15590 0 0 0 0 certmonger
[ 1817] 0 1817 1016 1 0 0 0 mingetty
[ 1819] 0 1819 1016 1 0 0 0 mingetty
[ 1821] 0 1821 19853 1 0 0 0 login
[ 1822] 0 1822 1016 1 0 0 0 mingetty
[ 1824] 0 1824 1016 1 0 0 0 mingetty
[ 1826] 0 1826 1016 1 0 0 0 mingetty
[ 1828] 0 1828 1016 1 0 0 0 mingetty
[ 2244] 0 2244 144390 1 0 0 0 console-kit-dae
[ 2310] 0 2310 27084 1 0 0 0 bash
[ 8491] 89 8491 20346 1 0 0 0 pickup
[ 8769] 0 8769 4903 1 0 0 0 lnetctl
Out of memory: Kill process 1370 (rsyslogd) score 1 or sacrifice child
Killed process 1370, UID 0, (rsyslogd) total-vm:249084kB, anon-rss:0kB, file-rss:4kB
|
| Comments |
| Comment by Sarah Liu [ 14/Jan/15 ] |
|
system OOM again after trying set large_buffers with valid values: [root@eagle-54vm5 ~]# lnetctl routing show
routing:
- cpt[0]:
tiny:
npages: 0
nbuffers: 2048
credits: 3072
mincredits: 3072
small:
npages: 1
nbuffers: 16384
credits: 28672
mincredits: 28672
large:
npages: 256
nbuffers: 256
credits: 256
mincredits: 256
- enable: 1
[root@eagle-54vm5 ~]# lnetctl set large_buffers 0
[root@eagle-54vm5 ~]# lnetctl routing show
routing:
- cpt[0]:
tiny:
npages: 0
nbuffers: 2048
credits: 3072
mincredits: 3072
small:
npages: 1
nbuffers: 16384
credits: 28672
mincredits: 28672
large:
npages: 256
nbuffers: 1024
credits: 1024
mincredits: 1024
- enable: 1
[root@eagle-54vm5 ~]# lnetctl set large_buffers 257
[root@eagle-54vm5 ~]# lnetctl routing show
routing:
- cpt[0]:
tiny:
npages: 0
nbuffers: 2048
credits: 3072
mincredits: 3072
small:
npages: 1
nbuffers: 16384
credits: 28672
mincredits: 28672
large:
npages: 256
nbuffers: 257
credits: 1024
mincredits: 1024
- enable: 1
[root@eagle-54vm5 ~]# lnetctl set large_buffers 0
master invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
master cpuset=/ mems_allowed=0
Pid: 1748, comm: master Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
[<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff81122b60>] ? dump_header+0x90/0x1b0
[<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
[<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
[<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
[<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
[<ffffffff81167cca>] ? alloc_pages_current+0xaa/0x110
[<ffffffff8111ff57>] ? __page_cache_alloc+0x87/0x90
[<ffffffff8111f93e>] ? find_get_page+0x1e/0xa0
[<ffffffff81120ef7>] ? filemap_fault+0x1a7/0x500
[<ffffffff8114a234>] ? __do_fault+0x54/0x530
[<ffffffff81069973>] ? dequeue_entity+0x113/0x2e0
[<ffffffff8114a807>] ? handle_pte_fault+0xf7/0xb00
[<ffffffff815296ee>] ? thread_return+0x4e/0x770
[<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
[<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
[<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
[<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
[<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
[<ffffffff811d27c6>] ? ep_poll+0x306/0x330
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8152c5f5>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 4
active_anon:213 inactive_anon:242 isolated_anon:0
active_file:0 inactive_file:20 isolated_file:0
unevictable:0 dirty:0 writeback:4 unstable:0
free:13041 slab_reclaimable:1396 slab_unreclaimable:8898
mapped:0 shmem:8 pagetables:720 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:140kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:43808kB min:44720kB low:55900kB high:67080kB active_anon:852kB inactive_anon:968kB active_file:0kB inactive_file:80kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:16kB mapped:0kB shmem:32kB slab_reclaimable:5584kB slab_unreclaimable:35452kB kernel_stack:1288kB pagetables:2880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:73080 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 344*4kB 154*8kB 61*16kB 45*32kB 28*64kB 21*128kB 16*256kB 11*512kB 8*1024kB 4*2048kB 2*4096kB = 43808kB
271 total pagecache pages
252 pages in swap cache
Swap cache stats: add 3333, delete 3081, find 0/0
Free swap = 4115432kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
50 pages shared
462857 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 353] 0 353 2733 0 0 -17 -1000 udevd
[ 965] 0 965 2280 1 0 0 0 dhclient
[ 1078] 0 1078 2280 1 0 0 0 dhclient
[ 1191] 0 1191 2280 1 0 0 0 dhclient
[ 1304] 0 1304 2280 1 0 0 0 dhclient
[ 1354] 0 1354 6910 24 0 -17 -1000 auditd
[ 1370] 0 1370 62272 1 0 0 0 rsyslogd
[ 1412] 32 1412 4744 1 0 0 0 rpcbind
[ 1430] 29 1430 5837 1 0 0 0 rpc.statd
[ 1543] 81 1543 5387 4 0 0 0 dbus-daemon
[ 1581] 0 1581 1020 0 0 0 0 acpid
[ 1590] 68 1590 9408 1 0 0 0 hald
[ 1591] 0 1591 5082 1 0 0 0 hald-runner
[ 1623] 0 1623 5612 1 0 0 0 hald-addon-inpu
[ 1637] 68 1637 4484 1 0 0 0 hald-addon-acpi
[ 1652] 0 1652 96435 1 0 0 0 automount
[ 1672] 0 1672 16656 0 0 -17 -1000 sshd
[ 1748] 0 1748 20326 1 0 0 0 master
[ 1756] 89 1756 20346 1 0 0 0 pickup
[ 1757] 89 1757 20389 2 0 0 0 qmgr
[ 1772] 0 1772 27580 1 0 0 0 abrtd
[ 1780] 0 1780 29325 9 0 0 0 crond
[ 1791] 0 1791 5385 0 0 0 0 atd
[ 1805] 0 1805 15590 0 0 0 0 certmonger
[ 1817] 0 1817 1016 1 0 0 0 mingetty
[ 1819] 0 1819 19853 12 0 0 0 login
[ 1820] 0 1820 1016 1 0 0 0 mingetty
[ 1822] 0 1822 1016 1 0 0 0 mingetty
[ 1824] 0 1824 1016 1 0 0 0 mingetty
[ 1826] 0 1826 2732 1 0 -17 -1000 udevd
[ 1827] 0 1827 2732 0 0 -17 -1000 udevd
[ 1828] 0 1828 1016 1 0 0 0 mingetty
[ 1830] 0 1830 1016 1 0 0 0 mingetty
[ 1848] 0 1848 144390 48 0 0 0 console-kit-dae
[ 1914] 0 1914 27084 90 0 0 0 bash
[ 2051] 0 2051 4903 31 0 0 0 lnetctl
Out of memory: Kill process 965 (dhclient) score 1 or sacrifice child
Killed process 965, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
master invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
master cpuset=/ mems_allowed=0
Pid: 1748, comm: master Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
[<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff81122b60>] ? dump_header+0x90/0x1b0
[<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
[<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
[<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
[<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
[<ffffffff81167cca>] ? alloc_pages_current+0xaa/0x110
[<ffffffff8111ff57>] ? __page_cache_alloc+0x87/0x90
[<ffffffff8111f93e>] ? find_get_page+0x1e/0xa0
[<ffffffff81120ef7>] ? filemap_fault+0x1a7/0x500
[<ffffffff8114a234>] ? __do_fault+0x54/0x530
[<ffffffff81069973>] ? dequeue_entity+0x113/0x2e0
[<ffffffff8114a807>] ? handle_pte_fault+0xf7/0xb00
[<ffffffff815296ee>] ? thread_return+0x4e/0x770
[<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
[<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
[<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
[<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
[<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
[<ffffffff811d27c6>] ? ep_poll+0x306/0x330
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8152c5f5>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
active_anon:85 inactive_anon:118 isolated_anon:0
active_file:11 inactive_file:0 isolated_file:0
unevictable:0 dirty:0 writeback:4 unstable:0
free:13053 slab_reclaimable:1396 slab_unreclaimable:8898
mapped:0 shmem:8 pagetables:720 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:140kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:43856kB min:44720kB low:55900kB high:67080kB active_anon:340kB inactive_anon:472kB active_file:44kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:16kB mapped:0kB shmem:32kB slab_reclaimable:5584kB slab_unreclaimable:35452kB kernel_stack:1288kB pagetables:2880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2724 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 326*4kB 151*8kB 56*16kB 46*32kB 29*64kB 20*128kB 17*256kB 11*512kB 8*1024kB 6*2048kB 1*4096kB = 43856kB
143 total pagecache pages
128 pages in swap cache
Swap cache stats: add 3469, delete 3341, find 2/5
Free swap = 4115416kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
51 pages shared
462846 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 353] 0 353 2733 0 0 -17 -1000 udevd
[ 1078] 0 1078 2280 1 0 0 0 dhclient
[ 1191] 0 1191 2280 1 0 0 0 dhclient
[ 1304] 0 1304 2280 1 0 0 0 dhclient
[ 1354] 0 1354 6910 24 0 -17 -1000 auditd
[ 1370] 0 1370 62272 1 0 0 0 rsyslogd
[ 1412] 32 1412 4744 1 0 0 0 rpcbind
[ 1430] 29 1430 5837 1 0 0 0 rpc.statd
[ 1543] 81 1543 5387 1 0 0 0 dbus-daemon
[ 1581] 0 1581 1020 0 0 0 0 acpid
[ 1590] 68 1590 9408 1 0 0 0 hald
[ 1591] 0 1591 5082 1 0 0 0 hald-runner
[ 1623] 0 1623 5612 1 0 0 0 hald-addon-inpu
[ 1637] 68 1637 4484 1 0 0 0 hald-addon-acpi
[ 1652] 0 1652 96435 1 0 0 0 automount
[ 1672] 0 1672 16656 0 0 -17 -1000 sshd
[ 1748] 0 1748 20326 1 0 0 0 master
[ 1756] 89 1756 20346 1 0 0 0 pickup
[ 1757] 89 1757 20389 1 0 0 0 qmgr
[ 1772] 0 1772 27580 1 0 0 0 abrtd
[ 1780] 0 1780 29325 9 0 0 0 crond
[ 1791] 0 1791 5385 0 0 0 0 atd
[ 1805] 0 1805 15590 0 0 0 0 certmonger
[ 1817] 0 1817 1016 1 0 0 0 mingetty
[ 1819] 0 1819 19853 1 0 0 0 login
[ 1820] 0 1820 1016 1 0 0 0 mingetty
[ 1822] 0 1822 1016 1 0 0 0 mingetty
[ 1824] 0 1824 1016 1 0 0 0 mingetty
[ 1826] 0 1826 2732 0 0 -17 -1000 udevd
[ 1827] 0 1827 2732 0 0 -17 -1000 udevd
[ 1828] 0 1828 1016 1 0 0 0 mingetty
[ 1830] 0 1830 1016 1 0 0 0 mingetty
[ 1848] 0 1848 144390 1 0 0 0 console-kit-dae
[ 1914] 0 1914 27084 25 0 0 0 bash
[ 2051] 0 2051 4903 31 0 0 0 lnetctl
Out of memory: Kill process 1078 (dhclient) score 1 or sacrifice child
Killed process 1078, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
master invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
master cpuset=/ mems_allowed=0
Pid: 1748, comm: master Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
[<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff81122b60>] ? dump_header+0x90/0x1b0
[<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
[<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
[<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
[<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
[<ffffffff81167cca>] ? alloc_pages_current+0xaa/0x110
[<ffffffff8111ff57>] ? __page_cache_alloc+0x87/0x90
[<ffffffff8111f93e>] ? find_get_page+0x1e/0xa0
[<ffffffff81120ef7>] ? filemap_fault+0x1a7/0x500
[<ffffffff8114a234>] ? __do_fault+0x54/0x530
[<ffffffff81069973>] ? dequeue_entity+0x113/0x2e0
[<ffffffff8114a807>] ? handle_pte_fault+0xf7/0xb00
[<ffffffff815296ee>] ? thread_return+0x4e/0x770
[<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
[<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
[<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
[<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
[<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
[<ffffffff811d27c6>] ? ep_poll+0x306/0x330
[<ffffffff81061d00>] ? default_wake_function+0x0/0x20
[<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8152c5f5>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
|
| Comment by Amir Shehata (Inactive) [ 17/Jan/15 ] |
|
There are two issues here. 1) is why we're running out of memory and 2) why we're crashing. I believe there is an issue that would trigger LNet to consume more memory than it needs to. The way it works is that buffer pools are allocated and put on a list. When the buffer pools are adjusted, if they are to be increased, then more buffers are allocated. If they are to be decreased, the number of pools is changed, but the buffers are only freed when they are used and returned to the pool. If the system is Idle, which I believe is the case in this test, and you increase the number of buffers, then more buffers are allocated, but none are currently in use. When the buffer pools are decreased, only the number is decreased but the buffers remain allocated on the linked list. When they are increased again, then more buffers are allocated, although there are already unused buffers on the list; thereby using up more memory than needed. This could be a culprit in the OOM case. For 2, the reason of the crash needs more investigation. |
| Comment by Amir Shehata (Inactive) [ 21/Jan/15 ] |
|
I believe the crash is due to TEI-2286. This leaves the other part of the issue which I'm addressing. However, I believe this can drop in priority if need be. |
| Comment by Gerrit Updater [ 23/Jan/15 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: http://review.whamcloud.com/13519 |
| Comment by Gerrit Updater [ 18/Aug/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13519/ |
| Comment by Peter Jones [ 18/Aug/15 ] |
|
Landed for 2.8 |