[LU-6122] DLC: system crash when setting a too large value for large_buffers Created: 14/Jan/15  Updated: 18/Aug/15  Resolved: 18/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Sarah Liu Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: HB
Environment:

lustre-master build # 2808


Severity: 3
Rank (Obsolete): 17055

 Description   

According to the DLC test plan, using a "too large" value for large_buffers should NOT crash, while hit this when I tried with a "too large" value

[root@eagle-54vm5 modprobe.d]# lnetctl routing show
routing:
    - cpt[0]:
          tiny:
              npages: 0
              nbuffers: 4096
              credits: 4096
              mincredits: 4096
          small:
              npages: 1
              nbuffers: 4096
              credits: 16384
              mincredits: 16384
          large:
              npages: 256
              nbuffers: 1024
              credits: 1024
              mincredits: 1024
    - enable: 1
[root@eagle-54vm5 modprobe.d]# lnetctl set large_buffers 4096

rpcbind invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
rpcbind cpuset=/ mems_allowed=0
Pid: 1412, comm: rpcbind Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff81122b60>] ? dump_header+0x90/0x1b0
 [<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
 [<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
 [<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
 [<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
 [<ffffffff81167dca>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff8115ba52>] ? read_swap_cache_async+0xf2/0x160
 [<ffffffff8115c579>] ? valid_swaphandles+0x69/0x150
 [<ffffffff8115bb47>] ? swapin_readahead+0x87/0xc0
 [<ffffffff8114aded>] ? handle_pte_fault+0x6dd/0xb00
 [<ffffffffa013c675>] ? inet6_fill_link_af+0x25/0x30 [ipv6]
 [<ffffffff8146e4d6>] ? rtnl_fill_ifinfo+0x946/0xcb0
 [<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
 [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
 [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff815296ee>] ? thread_return+0x4e/0x770
 [<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
 [<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
 [<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
 [<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8152c5f5>] ? page_fault+0x25/0x30
 [<ffffffff811a11a9>] ? do_sys_poll+0x349/0x520
 [<ffffffff811a1191>] ? do_sys_poll+0x331/0x520
 [<ffffffff811a0c10>] ? __pollwait+0x0/0xf0
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811cdd52>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff811a0ac5>] ? poll_select_set_timeout+0x95/0xb0
 [<ffffffff811a1571>] ? sys_poll+0x71/0x100
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
active_anon:0 inactive_anon:0 isolated_anon:0
 active_file:11 inactive_file:0 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 free:13209 slab_reclaimable:1286 slab_unreclaimable:8786
 mapped:1 shmem:0 pagetables:725 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:136kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:44480kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:5144kB slab_unreclaimable:35008kB kernel_stack:1280kB pagetables:2900kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:72820 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 624*4kB 267*8kB 115*16kB 45*32kB 22*64kB 13*128kB 15*256kB 6*512kB 4*1024kB 7*2048kB 2*4096kB = 44520kB
20 total pagecache pages
0 pages in swap cache
Swap cache stats: add 3541, delete 3541, find 0/1
Free swap  = 4114600kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
50 pages shared
462636 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  354]     0   354     2733        0   0     -17         -1000 udevd
[  630]     0   630     2732        0   0     -17         -1000 udevd
[  965]     0   965     2280        1   0       0             0 dhclient
[ 1078]     0  1078     2280        1   0       0             0 dhclient
[ 1191]     0  1191     2280        1   0       0             0 dhclient
[ 1304]     0  1304     2280        1   0       0             0 dhclient
[ 1354]     0  1354     6910        1   0     -17         -1000 auditd
[ 1370]     0  1370    62271        1   0       0             0 rsyslogd
[ 1412]    32  1412     4744        1   0       0             0 rpcbind
[ 1430]    29  1430     5837        1   0       0             0 rpc.statd
[ 1543]    81  1543     5387        1   0       0             0 dbus-daemon
[ 1581]     0  1581     1020        0   0       0             0 acpid
[ 1590]    68  1590     9408        1   0       0             0 hald
[ 1591]     0  1591     5082        1   0       0             0 hald-runner
[ 1623]     0  1623     5612        1   0       0             0 hald-addon-inpu
[ 1634]    68  1634     4484        1   0       0             0 hald-addon-acpi
[ 1652]     0  1652    96433        1   0       0             0 automount
[ 1672]     0  1672    16656        0   0     -17         -1000 sshd
[ 1748]     0  1748    20326        1   0       0             0 master
[ 1757]    89  1757    20389        1   0       0             0 qmgr
[ 1772]     0  1772    27580        1   0       0             0 abrtd
[ 1780]     0  1780    29324        1   0       0             0 crond
[ 1791]     0  1791     5385        0   0       0             0 atd
[ 1804]     0  1804    15590        0   0       0             0 certmonger
[ 1817]     0  1817     1016        1   0       0             0 mingetty
[ 1819]     0  1819     1016        1   0       0             0 mingetty
[ 1821]     0  1821    19853        1   0       0             0 login
[ 1822]     0  1822     1016        1   0       0             0 mingetty
[ 1824]     0  1824     1016        1   0       0             0 mingetty
[ 1826]     0  1826     1016        1   0       0             0 mingetty
[ 1828]     0  1828     1016        1   0       0             0 mingetty
[ 2244]     0  2244   144390        1   0       0             0 console-kit-dae
[ 2310]     0  2310    27084        1   0       0             0 bash
[ 8491]    89  8491    20346        1   0       0             0 pickup
[ 8769]     0  8769     4903        1   0       0             0 lnetctl
Out of memory: Kill process 965 (dhclient) score 1 or sacrifice child
Killed process 965, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
rpcbind invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
rpcbind cpuset=/ mems_allowed=0
Pid: 1412, comm: rpcbind Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff81122b60>] ? dump_header+0x90/0x1b0
 [<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
 [<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
 [<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
 [<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
 [<ffffffff81167dca>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff8115ba52>] ? read_swap_cache_async+0xf2/0x160
 [<ffffffff8115c579>] ? valid_swaphandles+0x69/0x150
 [<ffffffff8115bb47>] ? swapin_readahead+0x87/0xc0
 [<ffffffff8114aded>] ? handle_pte_fault+0x6dd/0xb00
 [<ffffffffa013c675>] ? inet6_fill_link_af+0x25/0x30 [ipv6]
 [<ffffffff8146e4d6>] ? rtnl_fill_ifinfo+0x946/0xcb0
 [<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
 [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
 [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff815296ee>] ? thread_return+0x4e/0x770
 [<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
 [<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
 [<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
 [<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8152c5f5>] ? page_fault+0x25/0x30
 [<ffffffff811a11a9>] ? do_sys_poll+0x349/0x520
 [<ffffffff811a1191>] ? do_sys_poll+0x331/0x520
 [<ffffffff811a0c10>] ? __pollwait+0x0/0xf0
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811cdd52>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff811a0ac5>] ? poll_select_set_timeout+0x95/0xb0
 [<ffffffff811a1571>] ? sys_poll+0x71/0x100
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
active_anon:0 inactive_anon:0 isolated_anon:0
 active_file:0 inactive_file:20 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 free:13223 slab_reclaimable:1286 slab_unreclaimable:8786
 mapped:1 shmem:0 pagetables:725 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:136kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:44536kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:80kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:5144kB slab_unreclaimable:35008kB kernel_stack:1280kB pagetables:2900kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:220 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 636*4kB 267*8kB 115*16kB 45*32kB 22*64kB 13*128kB 15*256kB 6*512kB 4*1024kB 7*2048kB 2*4096kB = 44568kB
20 total pagecache pages
0 pages in swap cache
Swap cache stats: add 3549, delete 3549, find 2/6
Free swap  = 4115092kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
49 pages shared
462624 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  354]     0   354     2733        0   0     -17         -1000 udevd
[  630]     0   630     2732        0   0     -17         -1000 udevd
[ 1078]     0  1078     2280        1   0       0             0 dhclient
[ 1191]     0  1191     2280        1   0       0             0 dhclient
[ 1304]     0  1304     2280        1   0       0             0 dhclient
[ 1354]     0  1354     6910        1   0     -17         -1000 auditd
[ 1370]     0  1370    62271        1   0       0             0 rsyslogd
[ 1412]    32  1412     4744        1   0       0             0 rpcbind
[ 1430]    29  1430     5837        1   0       0             0 rpc.statd
[ 1543]    81  1543     5387        1   0       0             0 dbus-daemon
[ 1581]     0  1581     1020        0   0       0             0 acpid
[ 1590]    68  1590     9408        1   0       0             0 hald
[ 1591]     0  1591     5082        1   0       0             0 hald-runner
[ 1623]     0  1623     5612        1   0       0             0 hald-addon-inpu
[ 1634]    68  1634     4484        1   0       0             0 hald-addon-acpi
[ 1652]     0  1652    96433        1   0       0             0 automount
[ 1672]     0  1672    16656        0   0     -17         -1000 sshd
[ 1748]     0  1748    20326        1   0       0             0 master
[ 1757]    89  1757    20389        1   0       0             0 qmgr
[ 1772]     0  1772    27580        1   0       0             0 abrtd
[ 1780]     0  1780    29324        1   0       0             0 crond
[ 1791]     0  1791     5385        0   0       0             0 atd
[ 1804]     0  1804    15590        0   0       0             0 certmonger
[ 1817]     0  1817     1016        1   0       0             0 mingetty
[ 1819]     0  1819     1016        1   0       0             0 mingetty
[ 1821]     0  1821    19853        1   0       0             0 login
[ 1822]     0  1822     1016        1   0       0             0 mingetty
[ 1824]     0  1824     1016        1   0       0             0 mingetty
[ 1826]     0  1826     1016        1   0       0             0 mingetty
[ 1828]     0  1828     1016        1   0       0             0 mingetty
[ 2244]     0  2244   144390        1   0       0             0 console-kit-dae
[ 2310]     0  2310    27084        1   0       0             0 bash
[ 8491]    89  8491    20346        1   0       0             0 pickup
[ 8769]     0  8769     4903        1   0       0             0 lnetctl
Out of memory: Kill process 1078 (dhclient) score 1 or sacrifice child
Killed process 1078, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
rpcbind invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
rpcbind cpuset=/ mems_allowed=0
Pid: 1412, comm: rpcbind Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff81122b60>] ? dump_header+0x90/0x1b0
 [<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
 [<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
 [<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
 [<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
 [<ffffffff81167dca>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff8115ba52>] ? read_swap_cache_async+0xf2/0x160
 [<ffffffff8115c579>] ? valid_swaphandles+0x69/0x150
 [<ffffffff8115bb47>] ? swapin_readahead+0x87/0xc0
 [<ffffffff8114aded>] ? handle_pte_fault+0x6dd/0xb00
 [<ffffffffa013c675>] ? inet6_fill_link_af+0x25/0x30 [ipv6]
 [<ffffffff8146e4d6>] ? rtnl_fill_ifinfo+0x946/0xcb0
 [<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
 [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
 [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff815296ee>] ? thread_return+0x4e/0x770
 [<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
 [<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
 [<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
 [<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8152c5f5>] ? page_fault+0x25/0x30
 [<ffffffff811a11a9>] ? do_sys_poll+0x349/0x520
 [<ffffffff811a1191>] ? do_sys_poll+0x331/0x520
 [<ffffffff811a0c10>] ? __pollwait+0x0/0xf0
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811cdd52>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff811a0ac5>] ? poll_select_set_timeout+0x95/0xb0
 [<ffffffff811a1571>] ? sys_poll+0x71/0x100
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
active_anon:0 inactive_anon:0 isolated_anon:0
 active_file:11 inactive_file:0 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 free:13236 slab_reclaimable:1286 slab_unreclaimable:8786
 mapped:1 shmem:0 pagetables:709 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:136kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:44588kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:5144kB slab_unreclaimable:35008kB kernel_stack:1280kB pagetables:2836kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:320 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 649*4kB 267*8kB 115*16kB 45*32kB 22*64kB 13*128kB 15*256kB 6*512kB 4*1024kB 7*2048kB 2*4096kB = 44620kB
20 total pagecache pages
0 pages in swap cache
Swap cache stats: add 3557, delete 3557, find 4/9
Free swap  = 4115592kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
48 pages shared
462611 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  354]     0   354     2733        0   0     -17         -1000 udevd
[  630]     0   630     2732        0   0     -17         -1000 udevd
[ 1191]     0  1191     2280        1   0       0             0 dhclient
[ 1304]     0  1304     2280        1   0       0             0 dhclient
[ 1354]     0  1354     6910        1   0     -17         -1000 auditd
[ 1370]     0  1370    62271        1   0       0             0 rsyslogd
[ 1412]    32  1412     4744        1   0       0             0 rpcbind
[ 1430]    29  1430     5837        1   0       0             0 rpc.statd
[ 1543]    81  1543     5387        1   0       0             0 dbus-daemon
[ 1581]     0  1581     1020        0   0       0             0 acpid
[ 1590]    68  1590     9408        1   0       0             0 hald
[ 1591]     0  1591     5082        1   0       0             0 hald-runner
[ 1623]     0  1623     5612        1   0       0             0 hald-addon-inpu
[ 1634]    68  1634     4484        1   0       0             0 hald-addon-acpi
[ 1652]     0  1652    96433        1   0       0             0 automount
[ 1672]     0  1672    16656        0   0     -17         -1000 sshd
[ 1748]     0  1748    20326        1   0       0             0 master
[ 1757]    89  1757    20389        1   0       0             0 qmgr
[ 1772]     0  1772    27580        1   0       0             0 abrtd
[ 1780]     0  1780    29324        1   0       0             0 crond
[ 1791]     0  1791     5385        0   0       0             0 atd
[ 1804]     0  1804    15590        0   0       0             0 certmonger
[ 1817]     0  1817     1016        1   0       0             0 mingetty
[ 1819]     0  1819     1016        1   0       0             0 mingetty
[ 1821]     0  1821    19853        1   0       0             0 login
[ 1822]     0  1822     1016        1   0       0             0 mingetty
[ 1824]     0  1824     1016        1   0       0             0 mingetty
[ 1826]     0  1826     1016        1   0       0             0 mingetty
[ 1828]     0  1828     1016        1   0       0             0 mingetty
[ 2244]     0  2244   144390        1   0       0             0 console-kit-dae
[ 2310]     0  2310    27084        1   0       0             0 bash
[ 8491]    89  8491    20346        1   0       0             0 pickup
[ 8769]     0  8769     4903        1   0       0             0 lnetctl
Out of memory: Kill process 1191 (dhclient) score 1 or sacrifice child
Killed process 1191, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
rpcbind invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
rpcbind cpuset=/ mems_allowed=0
Pid: 1412, comm: rpcbind Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff81122b60>] ? dump_header+0x90/0x1b0
 [<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
 [<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
 [<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
 [<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
 [<ffffffff81167dca>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff8115ba52>] ? read_swap_cache_async+0xf2/0x160
 [<ffffffff8115c579>] ? valid_swaphandles+0x69/0x150
 [<ffffffff8115bb47>] ? swapin_readahead+0x87/0xc0
 [<ffffffff8114aded>] ? handle_pte_fault+0x6dd/0xb00
 [<ffffffffa013c675>] ? inet6_fill_link_af+0x25/0x30 [ipv6]
 [<ffffffff8146e4d6>] ? rtnl_fill_ifinfo+0x946/0xcb0
 [<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
 [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
 [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff815296ee>] ? thread_return+0x4e/0x770
 [<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
 [<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
 [<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
 [<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8152c5f5>] ? page_fault+0x25/0x30
 [<ffffffff811a11a9>] ? do_sys_poll+0x349/0x520
 [<ffffffff811a1191>] ? do_sys_poll+0x331/0x520
 [<ffffffff811a0c10>] ? __pollwait+0x0/0xf0
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811cdd52>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff811a0ac5>] ? poll_select_set_timeout+0x95/0xb0
 [<ffffffff811a1571>] ? sys_poll+0x71/0x100
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
active_anon:0 inactive_anon:0 isolated_anon:0
 active_file:11 inactive_file:0 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 free:13250 slab_reclaimable:1286 slab_unreclaimable:8786
 mapped:1 shmem:0 pagetables:709 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:136kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:44644kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:5144kB slab_unreclaimable:35008kB kernel_stack:1280kB pagetables:2836kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1520 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 655*4kB 271*8kB 115*16kB 45*32kB 22*64kB 13*128kB 15*256kB 6*512kB 4*1024kB 7*2048kB 2*4096kB = 44676kB
20 total pagecache pages
0 pages in swap cache
Swap cache stats: add 3565, delete 3565, find 6/12
Free swap  = 4116088kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
47 pages shared
462597 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  354]     0   354     2733        0   0     -17         -1000 udevd
[  630]     0   630     2732        0   0     -17         -1000 udevd
[ 1304]     0  1304     2280        1   0       0             0 dhclient
[ 1354]     0  1354     6910        1   0     -17         -1000 auditd
[ 1370]     0  1370    62271        1   0       0             0 rsyslogd
[ 1412]    32  1412     4744        1   0       0             0 rpcbind
[ 1430]    29  1430     5837        1   0       0             0 rpc.statd
[ 1543]    81  1543     5387        1   0       0             0 dbus-daemon
[ 1581]     0  1581     1020        0   0       0             0 acpid
[ 1590]    68  1590     9408        1   0       0             0 hald
[ 1591]     0  1591     5082        1   0       0             0 hald-runner
[ 1623]     0  1623     5612        1   0       0             0 hald-addon-inpu
[ 1634]    68  1634     4484        1   0       0             0 hald-addon-acpi
[ 1652]     0  1652    96433        1   0       0             0 automount
[ 1672]     0  1672    16656        0   0     -17         -1000 sshd
[ 1748]     0  1748    20326        1   0       0             0 master
[ 1757]    89  1757    20389        1   0       0             0 qmgr
[ 1772]     0  1772    27580        1   0       0             0 abrtd
[ 1780]     0  1780    29324        1   0       0             0 crond
[ 1791]     0  1791     5385        0   0       0             0 atd
[ 1804]     0  1804    15590        0   0       0             0 certmonger
[ 1817]     0  1817     1016        1   0       0             0 mingetty
[ 1819]     0  1819     1016        1   0       0             0 mingetty
[ 1821]     0  1821    19853        1   0       0             0 login
[ 1822]     0  1822     1016        1   0       0             0 mingetty
[ 1824]     0  1824     1016        1   0       0             0 mingetty
[ 1826]     0  1826     1016        1   0       0             0 mingetty
[ 1828]     0  1828     1016        1   0       0             0 mingetty
[ 2244]     0  2244   144390        1   0       0             0 console-kit-dae
[ 2310]     0  2310    27084        1   0       0             0 bash
[ 8491]    89  8491    20346        1   0       0             0 pickup
[ 8769]     0  8769     4903        1   0       0             0 lnetctl
Out of memory: Kill process 1304 (dhclient) score 1 or sacrifice child
Killed process 1304, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
rpcbind invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
rpcbind cpuset=/ mems_allowed=0
Pid: 1412, comm: rpcbind Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff81122b60>] ? dump_header+0x90/0x1b0
 [<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
 [<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
 [<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
 [<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
 [<ffffffff81167dca>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff8115ba52>] ? read_swap_cache_async+0xf2/0x160
 [<ffffffff8115c579>] ? valid_swaphandles+0x69/0x150
 [<ffffffff8115bb47>] ? swapin_readahead+0x87/0xc0
 [<ffffffff8114aded>] ? handle_pte_fault+0x6dd/0xb00
 [<ffffffffa013c675>] ? inet6_fill_link_af+0x25/0x30 [ipv6]
 [<ffffffff8146e4d6>] ? rtnl_fill_ifinfo+0x946/0xcb0
 [<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
 [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
 [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
 [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
 [<ffffffff815296ee>] ? thread_return+0x4e/0x770
 [<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
 [<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
 [<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
 [<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8152c5f5>] ? page_fault+0x25/0x30
 [<ffffffff811a11a9>] ? do_sys_poll+0x349/0x520
 [<ffffffff811a1191>] ? do_sys_poll+0x331/0x520
 [<ffffffff811a0c10>] ? __pollwait+0x0/0xf0
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811a0d00>] ? pollwake+0x0/0x60
 [<ffffffff811cdd52>] ? fsnotify_clear_marks_by_inode+0x32/0xf0
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
 [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
 [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff811a0ac5>] ? poll_select_set_timeout+0x95/0xb0
 [<ffffffff811a1571>] ? sys_poll+0x71/0x100
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
active_anon:0 inactive_anon:0 isolated_anon:0
 active_file:11 inactive_file:0 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 free:13270 slab_reclaimable:1286 slab_unreclaimable:8786
 mapped:1 shmem:0 pagetables:692 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:136kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:44724kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:44kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:5144kB slab_unreclaimable:35008kB kernel_stack:1280kB pagetables:2768kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:62180 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 667*4kB 271*8kB 115*16kB 45*32kB 22*64kB 13*128kB 15*256kB 6*512kB 4*1024kB 7*2048kB 2*4096kB = 44724kB
20 total pagecache pages
0 pages in swap cache
Swap cache stats: add 3573, delete 3573, find 8/15
Free swap  = 4116588kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
46 pages shared
462585 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  354]     0   354     2733        0   0     -17         -1000 udevd
[  630]     0   630     2732        0   0     -17         -1000 udevd
[ 1354]     0  1354     6910        1   0     -17         -1000 auditd
[ 1370]     0  1370    62271        1   0       0             0 rsyslogd
[ 1412]    32  1412     4744        1   0       0             0 rpcbind
[ 1430]    29  1430     5837        1   0       0             0 rpc.statd
[ 1543]    81  1543     5387        1   0       0             0 dbus-daemon
[ 1581]     0  1581     1020        0   0       0             0 acpid
[ 1590]    68  1590     9408        1   0       0             0 hald
[ 1591]     0  1591     5082        1   0       0             0 hald-runner
[ 1623]     0  1623     5612        1   0       0             0 hald-addon-inpu
[ 1634]    68  1634     4484        1   0       0             0 hald-addon-acpi
[ 1652]     0  1652    96433        1   0       0             0 automount
[ 1672]     0  1672    16656        0   0     -17         -1000 sshd
[ 1748]     0  1748    20326        1   0       0             0 master
[ 1757]    89  1757    20389        1   0       0             0 qmgr
[ 1772]     0  1772    27580        1   0       0             0 abrtd
[ 1780]     0  1780    29324        1   0       0             0 crond
[ 1791]     0  1791     5385        0   0       0             0 atd
[ 1804]     0  1804    15590        0   0       0             0 certmonger
[ 1817]     0  1817     1016        1   0       0             0 mingetty
[ 1819]     0  1819     1016        1   0       0             0 mingetty
[ 1821]     0  1821    19853        1   0       0             0 login
[ 1822]     0  1822     1016        1   0       0             0 mingetty
[ 1824]     0  1824     1016        1   0       0             0 mingetty
[ 1826]     0  1826     1016        1   0       0             0 mingetty
[ 1828]     0  1828     1016        1   0       0             0 mingetty
[ 2244]     0  2244   144390        1   0       0             0 console-kit-dae
[ 2310]     0  2310    27084        1   0       0             0 bash
[ 8491]    89  8491    20346        1   0       0             0 pickup
[ 8769]     0  8769     4903        1   0       0             0 lnetctl
Out of memory: Kill process 1370 (rsyslogd) score 1 or sacrifice child
Killed process 1370, UID 0, (rsyslogd) total-vm:249084kB, anon-rss:0kB, file-rss:4kB


 Comments   
Comment by Sarah Liu [ 14/Jan/15 ]

system OOM again after trying set large_buffers with valid values:
1. set large_buffers 0 # pass
2. set large_buffers 257 # pass
3. set large_buffers 0 # OOM

[root@eagle-54vm5 ~]# lnetctl routing show
routing:
    - cpt[0]:
          tiny:
              npages: 0
              nbuffers: 2048
              credits: 3072
              mincredits: 3072
          small:
              npages: 1
              nbuffers: 16384
              credits: 28672
              mincredits: 28672
          large:
              npages: 256
              nbuffers: 256
              credits: 256
              mincredits: 256
    - enable: 1
[root@eagle-54vm5 ~]# lnetctl set large_buffers 0
[root@eagle-54vm5 ~]# lnetctl routing show
routing:
    - cpt[0]:
          tiny:
              npages: 0
              nbuffers: 2048
              credits: 3072
              mincredits: 3072
          small:
              npages: 1
              nbuffers: 16384
              credits: 28672
              mincredits: 28672
          large:
              npages: 256
              nbuffers: 1024
              credits: 1024
              mincredits: 1024
    - enable: 1
[root@eagle-54vm5 ~]# lnetctl set large_buffers 257
[root@eagle-54vm5 ~]# lnetctl routing show
routing:
    - cpt[0]:
          tiny:
              npages: 0
              nbuffers: 2048
              credits: 3072
              mincredits: 3072
          small:
              npages: 1
              nbuffers: 16384
              credits: 28672
              mincredits: 28672
          large:
              npages: 256
              nbuffers: 257
              credits: 1024
              mincredits: 1024
    - enable: 1
[root@eagle-54vm5 ~]# lnetctl set large_buffers 0
master invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
master cpuset=/ mems_allowed=0
Pid: 1748, comm: master Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff81122b60>] ? dump_header+0x90/0x1b0
 [<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
 [<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
 [<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
 [<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
 [<ffffffff81167cca>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff8111ff57>] ? __page_cache_alloc+0x87/0x90
 [<ffffffff8111f93e>] ? find_get_page+0x1e/0xa0
 [<ffffffff81120ef7>] ? filemap_fault+0x1a7/0x500
 [<ffffffff8114a234>] ? __do_fault+0x54/0x530
 [<ffffffff81069973>] ? dequeue_entity+0x113/0x2e0
 [<ffffffff8114a807>] ? handle_pte_fault+0xf7/0xb00
 [<ffffffff815296ee>] ? thread_return+0x4e/0x770
 [<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
 [<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
 [<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
 [<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
 [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
 [<ffffffff811d27c6>] ? ep_poll+0x306/0x330
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8152c5f5>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:   4
active_anon:213 inactive_anon:242 isolated_anon:0
 active_file:0 inactive_file:20 isolated_file:0
 unevictable:0 dirty:0 writeback:4 unstable:0
 free:13041 slab_reclaimable:1396 slab_unreclaimable:8898
 mapped:0 shmem:8 pagetables:720 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:140kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:43808kB min:44720kB low:55900kB high:67080kB active_anon:852kB inactive_anon:968kB active_file:0kB inactive_file:80kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:16kB mapped:0kB shmem:32kB slab_reclaimable:5584kB slab_unreclaimable:35452kB kernel_stack:1288kB pagetables:2880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:73080 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 344*4kB 154*8kB 61*16kB 45*32kB 28*64kB 21*128kB 16*256kB 11*512kB 8*1024kB 4*2048kB 2*4096kB = 43808kB
271 total pagecache pages
252 pages in swap cache
Swap cache stats: add 3333, delete 3081, find 0/0
Free swap  = 4115432kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
50 pages shared
462857 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  353]     0   353     2733        0   0     -17         -1000 udevd
[  965]     0   965     2280        1   0       0             0 dhclient
[ 1078]     0  1078     2280        1   0       0             0 dhclient
[ 1191]     0  1191     2280        1   0       0             0 dhclient
[ 1304]     0  1304     2280        1   0       0             0 dhclient
[ 1354]     0  1354     6910       24   0     -17         -1000 auditd
[ 1370]     0  1370    62272        1   0       0             0 rsyslogd
[ 1412]    32  1412     4744        1   0       0             0 rpcbind
[ 1430]    29  1430     5837        1   0       0             0 rpc.statd
[ 1543]    81  1543     5387        4   0       0             0 dbus-daemon
[ 1581]     0  1581     1020        0   0       0             0 acpid
[ 1590]    68  1590     9408        1   0       0             0 hald
[ 1591]     0  1591     5082        1   0       0             0 hald-runner
[ 1623]     0  1623     5612        1   0       0             0 hald-addon-inpu
[ 1637]    68  1637     4484        1   0       0             0 hald-addon-acpi
[ 1652]     0  1652    96435        1   0       0             0 automount
[ 1672]     0  1672    16656        0   0     -17         -1000 sshd
[ 1748]     0  1748    20326        1   0       0             0 master
[ 1756]    89  1756    20346        1   0       0             0 pickup
[ 1757]    89  1757    20389        2   0       0             0 qmgr
[ 1772]     0  1772    27580        1   0       0             0 abrtd
[ 1780]     0  1780    29325        9   0       0             0 crond
[ 1791]     0  1791     5385        0   0       0             0 atd
[ 1805]     0  1805    15590        0   0       0             0 certmonger
[ 1817]     0  1817     1016        1   0       0             0 mingetty
[ 1819]     0  1819    19853       12   0       0             0 login
[ 1820]     0  1820     1016        1   0       0             0 mingetty
[ 1822]     0  1822     1016        1   0       0             0 mingetty
[ 1824]     0  1824     1016        1   0       0             0 mingetty
[ 1826]     0  1826     2732        1   0     -17         -1000 udevd
[ 1827]     0  1827     2732        0   0     -17         -1000 udevd
[ 1828]     0  1828     1016        1   0       0             0 mingetty
[ 1830]     0  1830     1016        1   0       0             0 mingetty
[ 1848]     0  1848   144390       48   0       0             0 console-kit-dae
[ 1914]     0  1914    27084       90   0       0             0 bash
[ 2051]     0  2051     4903       31   0       0             0 lnetctl
Out of memory: Kill process 965 (dhclient) score 1 or sacrifice child
Killed process 965, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
master invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
master cpuset=/ mems_allowed=0
Pid: 1748, comm: master Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff81122b60>] ? dump_header+0x90/0x1b0
 [<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
 [<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
 [<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
 [<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
 [<ffffffff81167cca>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff8111ff57>] ? __page_cache_alloc+0x87/0x90
 [<ffffffff8111f93e>] ? find_get_page+0x1e/0xa0
 [<ffffffff81120ef7>] ? filemap_fault+0x1a7/0x500
 [<ffffffff8114a234>] ? __do_fault+0x54/0x530
 [<ffffffff81069973>] ? dequeue_entity+0x113/0x2e0
 [<ffffffff8114a807>] ? handle_pte_fault+0xf7/0xb00
 [<ffffffff815296ee>] ? thread_return+0x4e/0x770
 [<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
 [<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
 [<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
 [<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
 [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
 [<ffffffff811d27c6>] ? ep_poll+0x306/0x330
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8152c5f5>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
active_anon:85 inactive_anon:118 isolated_anon:0
 active_file:11 inactive_file:0 isolated_file:0
 unevictable:0 dirty:0 writeback:4 unstable:0
 free:13053 slab_reclaimable:1396 slab_unreclaimable:8898
 mapped:0 shmem:8 pagetables:720 bounce:0
Node 0 DMA free:8356kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:140kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2004 2004 2004
Node 0 DMA32 free:43856kB min:44720kB low:55900kB high:67080kB active_anon:340kB inactive_anon:472kB active_file:44kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:16kB mapped:0kB shmem:32kB slab_reclaimable:5584kB slab_unreclaimable:35452kB kernel_stack:1288kB pagetables:2880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2724 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8356kB
Node 0 DMA32: 326*4kB 151*8kB 56*16kB 46*32kB 29*64kB 20*128kB 17*256kB 11*512kB 8*1024kB 6*2048kB 1*4096kB = 43856kB
143 total pagecache pages
128 pages in swap cache
Swap cache stats: add 3469, delete 3341, find 2/5
Free swap  = 4115416kB
Total swap = 4128764kB
524284 pages RAM
43654 pages reserved
51 pages shared
462846 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  353]     0   353     2733        0   0     -17         -1000 udevd
[ 1078]     0  1078     2280        1   0       0             0 dhclient
[ 1191]     0  1191     2280        1   0       0             0 dhclient
[ 1304]     0  1304     2280        1   0       0             0 dhclient
[ 1354]     0  1354     6910       24   0     -17         -1000 auditd
[ 1370]     0  1370    62272        1   0       0             0 rsyslogd
[ 1412]    32  1412     4744        1   0       0             0 rpcbind
[ 1430]    29  1430     5837        1   0       0             0 rpc.statd
[ 1543]    81  1543     5387        1   0       0             0 dbus-daemon
[ 1581]     0  1581     1020        0   0       0             0 acpid
[ 1590]    68  1590     9408        1   0       0             0 hald
[ 1591]     0  1591     5082        1   0       0             0 hald-runner
[ 1623]     0  1623     5612        1   0       0             0 hald-addon-inpu
[ 1637]    68  1637     4484        1   0       0             0 hald-addon-acpi
[ 1652]     0  1652    96435        1   0       0             0 automount
[ 1672]     0  1672    16656        0   0     -17         -1000 sshd
[ 1748]     0  1748    20326        1   0       0             0 master
[ 1756]    89  1756    20346        1   0       0             0 pickup
[ 1757]    89  1757    20389        1   0       0             0 qmgr
[ 1772]     0  1772    27580        1   0       0             0 abrtd
[ 1780]     0  1780    29325        9   0       0             0 crond
[ 1791]     0  1791     5385        0   0       0             0 atd
[ 1805]     0  1805    15590        0   0       0             0 certmonger
[ 1817]     0  1817     1016        1   0       0             0 mingetty
[ 1819]     0  1819    19853        1   0       0             0 login
[ 1820]     0  1820     1016        1   0       0             0 mingetty
[ 1822]     0  1822     1016        1   0       0             0 mingetty
[ 1824]     0  1824     1016        1   0       0             0 mingetty
[ 1826]     0  1826     2732        0   0     -17         -1000 udevd
[ 1827]     0  1827     2732        0   0     -17         -1000 udevd
[ 1828]     0  1828     1016        1   0       0             0 mingetty
[ 1830]     0  1830     1016        1   0       0             0 mingetty
[ 1848]     0  1848   144390        1   0       0             0 console-kit-dae
[ 1914]     0  1914    27084       25   0       0             0 bash
[ 2051]     0  2051     4903       31   0       0             0 lnetctl
Out of memory: Kill process 1078 (dhclient) score 1 or sacrifice child
Killed process 1078, UID 0, (dhclient) total-vm:9120kB, anon-rss:0kB, file-rss:4kB
master invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
master cpuset=/ mems_allowed=0
Pid: 1748, comm: master Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
Call Trace:
 [<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff81122b60>] ? dump_header+0x90/0x1b0
 [<ffffffff8122892c>] ? security_real_capable_noaudit+0x3c/0x70
 [<ffffffff81122fe2>] ? oom_kill_process+0x82/0x2a0
 [<ffffffff81122f21>] ? select_bad_process+0xe1/0x120
 [<ffffffff81123420>] ? out_of_memory+0x220/0x3c0
 [<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
 [<ffffffff81167cca>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff8111ff57>] ? __page_cache_alloc+0x87/0x90
 [<ffffffff8111f93e>] ? find_get_page+0x1e/0xa0
 [<ffffffff81120ef7>] ? filemap_fault+0x1a7/0x500
 [<ffffffff8114a234>] ? __do_fault+0x54/0x530
 [<ffffffff81069973>] ? dequeue_entity+0x113/0x2e0
 [<ffffffff8114a807>] ? handle_pte_fault+0xf7/0xb00
 [<ffffffff815296ee>] ? thread_return+0x4e/0x770
 [<ffffffff8109fde3>] ? __hrtimer_start_range_ns+0x1a3/0x460
 [<ffffffff8109f4a1>] ? lock_hrtimer_base+0x31/0x60
 [<ffffffff810a011f>] ? hrtimer_try_to_cancel+0x3f/0xd0
 [<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
 [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
 [<ffffffff811d27c6>] ? ep_poll+0x306/0x330
 [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
 [<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8152c5f5>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
Comment by Amir Shehata (Inactive) [ 17/Jan/15 ]

There are two issues here. 1) is why we're running out of memory and 2) why we're crashing.

I believe there is an issue that would trigger LNet to consume more memory than it needs to. The way it works is that buffer pools are allocated and put on a list. When the buffer pools are adjusted, if they are to be increased, then more buffers are allocated. If they are to be decreased, the number of pools is changed, but the buffers are only freed when they are used and returned to the pool.

If the system is Idle, which I believe is the case in this test, and you increase the number of buffers, then more buffers are allocated, but none are currently in use. When the buffer pools are decreased, only the number is decreased but the buffers remain allocated on the linked list. When they are increased again, then more buffers are allocated, although there are already unused buffers on the list; thereby using up more memory than needed.

This could be a culprit in the OOM case.

For 2, the reason of the crash needs more investigation.

Comment by Amir Shehata (Inactive) [ 21/Jan/15 ]

I believe the crash is due to TEI-2286.

This leaves the other part of the issue which I'm addressing. However, I believe this can drop in priority if need be.

Comment by Gerrit Updater [ 23/Jan/15 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: http://review.whamcloud.com/13519
Subject: LU-6122 lnet: Allocate the correct number of rtr buffers
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1e81abafca154136d602f244acca36c9f75935f4

Comment by Gerrit Updater [ 18/Aug/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13519/
Subject: LU-6122 lnet: Allocate the correct number of rtr buffers
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6182af3703026ac633b6f0bddc3e90958dc9631d

Comment by Peter Jones [ 18/Aug/15 ]

Landed for 2.8

Generated at Sat Feb 10 01:57:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.