[LU-6257] Interop 2.6.0<->2.7 replay-vbr test_7b: MDS OOM Created: 19/Feb/15  Updated: 19/Feb/15  Resolved: 19/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

server: 2.6.0
client: lustre-master build # 2856


Severity: 3
Rank (Obsolete): 17540

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/1dc4ae10-b7fd-11e4-a8e9-5254006e85c2.

The sub-test test_7b failed with the following error:

test failed to respond and timed out
16:18:19:LustreError: 11-0: lustre-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
16:18:19:LustreError: Skipped 5 previous similar messages
16:18:19:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
16:18:19:Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
16:18:20:Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-OST0001_UUID (at 10.2.4.208@tcp), waiting for all 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:56
16:18:20:Lustre: Skipped 561 previous similar messages
16:18:20:ntpd invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
16:18:20:ntpd cpuset=/ mems_allowed=0
16:18:20:Pid: 2199, comm: ntpd Not tainted 2.6.32-431.20.3.el6_lustre.x86_64 #1
16:18:20:Call Trace:
16:18:20: [<ffffffff810d03d1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
16:18:20: [<ffffffff81122780>] ? dump_header+0x90/0x1b0
16:18:20: [<ffffffff811228ee>] ? check_panic_on_oom+0x4e/0x80
16:18:20: [<ffffffff81122fdb>] ? out_of_memory+0x1bb/0x3c0
16:18:20: [<ffffffff8112f95f>] ? __alloc_pages_nodemask+0x89f/0x8d0
16:18:20: [<ffffffff8116795a>] ? alloc_pages_vma+0x9a/0x150
16:18:20: [<ffffffff8115b632>] ? read_swap_cache_async+0xf2/0x160
16:18:20: [<ffffffff8115c159>] ? valid_swaphandles+0x69/0x150
16:18:20: [<ffffffff8115b727>] ? swapin_readahead+0x87/0xc0
16:18:20: [<ffffffff8114a9fd>] ? handle_pte_fault+0x6dd/0xb00
16:18:20: [<ffffffff812272c6>] ? security_task_to_inode+0x16/0x20
16:18:20: [<ffffffff8114b04a>] ? handle_mm_fault+0x22a/0x300
16:18:20: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
16:18:20: [<ffffffff811a07b0>] ? pollwake+0x0/0x60
16:18:20: [<ffffffff811a07b0>] ? pollwake+0x0/0x60
16:18:20: [<ffffffff811a07b0>] ? pollwake+0x0/0x60
16:18:20: [<ffffffff8152e7ee>] ? do_page_fault+0x3e/0xa0
16:18:20: [<ffffffff8152bba5>] ? page_fault+0x25/0x30
16:18:20: [<ffffffff8128e1e6>] ? copy_user_generic_unrolled+0x86/0xb0
16:18:20: [<ffffffff810129de>] ? copy_user_generic+0xe/0x20
16:18:20: [<ffffffff811a04c9>] ? set_fd_set+0x49/0x60
16:18:20: [<ffffffff811a198c>] ? core_sys_select+0x1bc/0x2c0
16:18:20: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
16:18:20: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
16:18:20: [<ffffffff8109530f>] ? queue_work+0x1f/0x30
16:18:20: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
16:18:20: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
16:18:20: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
16:18:20: [<ffffffff810a6d21>] ? ktime_get_ts+0xb1/0xf0
16:18:20: [<ffffffff811a1ce7>] ? sys_select+0x47/0x110
16:18:20: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
16:18:20:Mem-Info:
16:18:20:Node 0 DMA per-cpu:
16:18:20:CPU    0: hi:    0, btch:   1 usd:   0
16:18:20:CPU    1: hi:    0, btch:   1 usd:   0
16:18:20:Node 0 DMA32 per-cpu:
16:18:20:CPU    0: hi:  186, btch:  31 usd:  86
16:18:20:CPU    1: hi:  186, btch:  31 usd: 178
16:18:20:active_anon:0 inactive_anon:0 isolated_anon:0
16:18:20: active_file:1044 inactive_file:916 isolated_file:0
16:18:20: unevictable:0 dirty:0 writeback:0 unstable:0
16:18:20: free:13246 slab_reclaimable:2102 slab_unreclaimable:436918
16:18:20: mapped:1 shmem:0 pagetables:1010 bounce:0
16:18:20:Node 0 DMA free:8336kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7408kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
16:18:20:lowmem_reserve[]: 0 2004 2004 2004
16:18:20:Node 0 DMA32 free:44648kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:0kB active_file:4176kB inactive_file:3664kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:8408kB slab_unreclaimable:1740264kB kernel_stack:1736kB pagetables:4040kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:376672 all_unreclaimable? yes
16:18:20:lowmem_reserve[]: 0 0 0 0
16:18:20:Node 0 DMA: 0*4kB 0*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8336kB
16:18:20:Node 0 DMA32: 778*4kB 612*8kB 380*16kB 169*32kB 69*64kB 32*128kB 21*256kB 8*512kB 3*1024kB 0*2048kB 1*4096kB = 44648kB
16:18:20:116 total pagecache pages
16:18:20:0 pages in swap cache
16:18:20:Swap cache stats: add 6362, delete 6362, find 3766/3856
16:18:20:Free swap  = 4107060kB
16:18:20:Total swap = 4128760kB
16:18:20:524284 pages RAM
16:18:20:43694 pages reserved
16:18:20:151 pages shared
16:18:20:462141 pages non-shared
16:18:20:[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
16:18:20:[  366]     0   366     2663        0   1     -17         -1000 udevd
16:18:20:[ 1014]     0  1014     2280        1   0       0             0 dhclient
16:18:20:[ 1066]     0  1066    23300        1   1     -17         -1000 auditd
16:18:20:[ 1082]     0  1082    63903        1   1       0             0 rsyslogd
16:18:20:[ 1111]     0  1111     2705        1   1       0             0 irqbalance
16:18:20:[ 1125]    32  1125     4744        1   1       0             0 rpcbind
16:18:20:[ 1137]     0  1137    49913        1   0       0             0 sssd
16:18:20:[ 1138]     0  1138    64294        1   1       0             0 sssd_be
16:18:20:[ 1140]     0  1140    50479        1   1       0             0 sssd_nss
16:18:20:[ 1141]     0  1141    48029        1   0       0             0 sssd_pam
16:18:20:[ 1142]     0  1142    47518        1   0       0             0 sssd_ssh
16:18:20:[ 1143]     0  1143    52608        1   0       0             0 sssd_pac
16:18:20:[ 1160]    29  1160     5837        1   1       0             0 rpc.statd
16:18:20:[ 1274]    81  1274     5871        1   0       0             0 dbus-daemon
16:18:20:[ 1312]     0  1312     1020        0   1       0             0 acpid
16:18:20:[ 1321]    68  1321     9920        1   0       0             0 hald
16:18:20:[ 1322]     0  1322     5081        1   0       0             0 hald-runner
16:18:20:[ 1354]     0  1354     5611        1   1       0             0 hald-addon-inpu
16:18:20:[ 1364]    68  1364     4483        1   1       0             0 hald-addon-acpi
16:18:20:[ 1384]     0  1384   168326        1   1       0             0 automount
16:18:20:[ 1430]     0  1430    26827        0   0       0             0 rpc.rquotad
16:18:20:[ 1434]     0  1434     5414        0   0       0             0 rpc.mountd
16:18:20:[ 1470]     0  1470     5773        1   0       0             0 rpc.idmapd
16:18:20:[ 1501]   496  1501    56785        1   0       0             0 munged
16:18:20:[ 1516]     0  1516    16656        0   0     -17         -1000 sshd
16:18:20:[ 1524]     0  1524     5545        1   0       0             0 xinetd
16:18:20:[ 1608]     0  1608    20846        1   1       0             0 master
16:18:20:[ 1628]    89  1628    20909        1   1       0             0 qmgr
16:18:20:[ 1631]     0  1631    29325        1   1       0             0 crond
16:18:20:[ 1642]     0  1642     5385        0   0       0             0 atd
16:18:20:[ 1656]     0  1656    15585        1   0       0             0 certmonger
16:18:20:[ 1669]     0  1669     1020        1   1       0             0 agetty
16:18:20:[ 1671]     0  1671     1016        1   0       0             0 mingetty
16:18:20:[ 1673]     0  1673     1016        1   1       0             0 mingetty
16:18:20:[ 1675]     0  1675     1016        1   0       0             0 mingetty
16:18:20:[ 1677]     0  1677     1016        1   1       0             0 mingetty
16:18:20:[ 1679]     0  1679     2664        0   1     -17         -1000 udevd
16:18:20:[ 1680]     0  1680     2662        0   1     -17         -1000 udevd
16:18:20:[ 1681]     0  1681     1016        1   0       0             0 mingetty
16:18:20:[ 1683]     0  1683     1016        1   0       0             0 mingetty
16:18:20:[ 2199]    38  2199     8205        1   0       0             0 ntpd
16:18:20:[13001]    89 13001    20866        1   1       0             0 pickup
16:18:20:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
16:18:20:
16:18:20:Pid: 2199, comm: ntpd Not tainted 2.6.32-431.20.3.el6_lustre.x86_64 #1
16:18:20:Call Trace:
16:18:20: [<ffffffff8152859c>] ? panic+0xa7/0x16f
16:18:20: [<ffffffff811227f1>] ? dump_header+0x101/0x1b0
16:18:20: [<ffffffff8112291c>] ? check_panic_on_oom+0x7c/0x80
16:18:20: [<ffffffff81122fdb>] ? out_of_memory+0x1bb/0x3c0
16:18:20: [<ffffffff8112f95f>] ? __alloc_pages_nodemask+0x89f/0x8d0
16:18:20: [<ffffffff8116795a>] ? alloc_pages_vma+0x9a/0x150
16:18:20: [<ffffffff8115b632>] ? read_swap_cache_async+0xf2/0x160
16:18:20: [<ffffffff8115c159>] ? valid_swaphandles+0x69/0x150
16:18:20: [<ffffffff8115b727>] ? swapin_readahead+0x87/0xc0
16:18:20: [<ffffffff8114a9fd>] ? handle_pte_fault+0x6dd/0xb00
16:18:20: [<ffffffff812272c6>] ? security_task_to_inode+0x16/0x20
16:18:20: [<ffffffff8114b04a>] ? handle_mm_fault+0x22a/0x300
16:18:20: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
16:18:20: [<ffffffff811a07b0>] ? pollwake+0x0/0x60
16:18:20: [<ffffffff811a07b0>] ? pollwake+0x0/0x60
16:18:20: [<ffffffff811a07b0>] ? pollwake+0x0/0x60
16:18:20: [<ffffffff8152e7ee>] ? do_page_fault+0x3e/0xa0
16:18:20: [<ffffffff8152bba5>] ? page_fault+0x25/0x30
16:18:20: [<ffffffff8128e1e6>] ? copy_user_generic_unrolled+0x86/0xb0
16:18:20: [<ffffffff810129de>] ? copy_user_generic+0xe/0x20
16:18:20: [<ffffffff811a04c9>] ? set_fd_set+0x49/0x60
16:18:20: [<ffffffff811a198c>] ? core_sys_select+0x1bc/0x2c0
16:18:20: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
16:18:20: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
16:18:20: [<ffffffff8109530f>] ? queue_work+0x1f/0x30
16:18:20: [<ffffffff8103f9d8>] ? pvclock_clocksource_read+0x58/0xd0
16:18:20: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20
16:18:20: [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
16:18:20: [<ffffffff810a6d21>] ? ktime_get_ts+0xb1/0xf0
16:18:20: [<ffffffff811a1ce7>] ? sys_select+0x47/0x110
16:18:20: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b


 Comments   
Comment by Andreas Dilger [ 19/Feb/15 ]

Since this is failing on a 2.6 server, it may well be fixed already for 2.7. If this is hit with 2.7 then it should be reopened.

Generated at Sat Feb 10 01:58:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.