[LU-6145] lfsck-performance test_6: out of memory on MDS Created: 21/Jan/15  Updated: 28/Feb/20  Resolved: 28/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 17157

 Description   

This issue was created by maloo for nasf <fan.yong@intel.com>

Please provide additional information about the failure here.

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/b91acdba-a103-11e4-87d1-5254006e85c2.

20:46:44:rpm invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
20:46:44:rpm cpuset=/ mems_allowed=0
20:46:44:Pid: 16160, comm: rpm Not tainted 2.6.32-431.29.2.el6_lustre.gffd1fc2.x86_64 #1
20:46:44:Call Trace:
20:46:44: [<ffffffff810d0791>] ? cpuset_print_task_mems_allowed+0x91/0xb0
20:46:44: [<ffffffff81122b60>] ? dump_header+0x90/0x1b0
20:46:44: [<ffffffff81122cce>] ? check_panic_on_oom+0x4e/0x80
20:46:44: [<ffffffff811233bb>] ? out_of_memory+0x1bb/0x3c0
20:46:44: [<ffffffff8112fd3f>] ? __alloc_pages_nodemask+0x89f/0x8d0
20:46:44: [<ffffffff81167cca>] ? alloc_pages_current+0xaa/0x110
20:46:44: [<ffffffff8111ff57>] ? __page_cache_alloc+0x87/0x90
20:46:44: [<ffffffff8111f93e>] ? find_get_page+0x1e/0xa0
20:46:44: [<ffffffff81120ef7>] ? filemap_fault+0x1a7/0x500
20:46:44: [<ffffffff8114a234>] ? __do_fault+0x54/0x530
20:46:44: [<ffffffff8114a807>] ? handle_pte_fault+0xf7/0xb00
20:46:44: [<ffffffff8114b43a>] ? handle_mm_fault+0x22a/0x300
20:46:44: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
20:46:44: [<ffffffff811ab820>] ? mntput_no_expire+0x30/0x110
20:46:44: [<ffffffff8118aba1>] ? __fput+0x1a1/0x210
20:46:44: [<ffffffff810890a1>] ? do_sigaction+0x91/0x1d0
20:46:44: [<ffffffff8152f23e>] ? do_page_fault+0x3e/0xa0
20:46:44: [<ffffffff8152c5f5>] ? page_fault+0x25/0x30
20:46:44:Mem-Info:
20:46:44:Node 0 DMA per-cpu:
20:46:44:CPU    0: hi:    0, btch:   1 usd:   0
20:46:44:CPU    1: hi:    0, btch:   1 usd:   0
20:46:44:Node 0 DMA32 per-cpu:
20:46:44:CPU    0: hi:  186, btch:  31 usd:   1
20:46:44:CPU    1: hi:  186, btch:  31 usd:  30
20:46:44:active_anon:79 inactive_anon:74 isolated_anon:0
20:46:44: active_file:111 inactive_file:71 isolated_file:32
20:46:44: unevictable:0 dirty:0 writeback:82 unstable:0
20:46:44: free:13242 slab_reclaimable:2265 slab_unreclaimable:438364
20:46:44: mapped:0 shmem:6 pagetables:621 bounce:0
20:46:44:Node 0 DMA free:8340kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7404kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
20:46:44:lowmem_reserve[]: 0 2004 2004 2004
20:46:44:Node 0 DMA32 free:44628kB min:44720kB low:55900kB high:67080kB active_anon:316kB inactive_anon:296kB active_file:368kB inactive_file:420kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2052308kB mlocked:0kB dirty:0kB writeback:328kB mapped:0kB shmem:24kB slab_reclaimable:9060kB slab_unreclaimable:1746052kB kernel_stack:1720kB pagetables:2484kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:9792 all_unreclaimable? no
20:46:44:lowmem_reserve[]: 0 0 0 0
20:46:44:Node 0 DMA: 1*4kB 0*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8340kB
20:46:44:Node 0 DMA32: 4719*4kB 2275*8kB 288*16kB 56*32kB 12*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB       1   1       0             0 hald-addon-inpu
20:46:44:[ 1303]    68  1303     4483        1   1       0             0 hald-addon-acpi
20:46:44:[ 1342]     0  1342    26827        0   0       0             0 rpc.rquotad
20:46:44:[ 1346]     0  1346     5414        0   0       0             0 rpc.mountd
20:46:44:[ 1381]     0  1381     6291        1   0       0             0 rpc.idmapd
20:46:44:[ 1413]   498  1413    57325        1   0       0             0 munged
20:46:44:[ 1428]     0  1428    16656        0   0     -17         -1000 sshd
20:46:44:[ 1436]     0  1436     5545        1   0       0             0 xinetd
20:46:44:[ 1460]     0  1460    22321        0   1       0             0 sendmail
20:46:44:[ 1468]    51  1468    20183        0   0       0             0 sendmail
20:46:44:[ 1490]     0  1490    29324        1   1       0             0 crond
20:46:44:[ 1501]     0  1501     5385        0   0       0             0 atd
20:46:44:[ 1514]     0  1514     1020        1   1       0             0 agetty
20:46:44:[ 1516]     0  1516     1016        1   1       0             0 mingetty
20:46:44:[ 1518]     0  1518     1016        1   1       0             0 mingetty
20:46:44:[ 1520]     0  1520     1016        1   1       0             0 mingetty
20:46:44:[ 1522]     0  1522     1016        1   0       0             0 mingetty
20:46:44:[ 1523]     0  1523     2663        0   1     -17         -1000 udevd
20:46:44:[ 1524]     0  1524     2696        0   0     -17         -1000 udevd
20:46:44:[ 1526]     0  1526     1016        1   0       0             0 mingetty
20:46:44:[ 1528]     0  1528     1016        1   0       0             0 mingetty
20:46:44:[ 2055]    38  2055     7687        1   0       0             0 ntpd
20:46:44:[22506]     0 22506     4346        0   1       0             0 anacron
20:46:44:[16144]     0 16144    14862        1   0       0             0 in.mrshd
20:46:44:[16145]     0 16145    26515        1   1       0             0 bash
20:46:44:[16159]     0 16159    26515        0   1       0             0 bash
20:46:44:[16160]     0 16160    15217       89   1       0             0 rpm


 Comments   
Comment by Oleg Drokin [ 21/Jan/15 ]

So first strange thing: how come rpm is running, it's not part of the test? Some cron job? something that came over network by mistake? This should be possible to see in the crashdump.

Additionall - how come we have this panic on oom set in this run? TEI-2286 - how come this went in unnoticed I wonder?

Comment by John Hammond [ 22/Jan/15 ]
crash> ps
...
   1436      1   0  ffff88007b4f2ae0  IN   0.0   22180      4  xinetd
...
  16144   1436   0  ffff88005404eae0  IN   0.0   59448      4  in.mrshd
  16145  16144   1  ffff880063bbeaa0  IN   0.0  106060      4  bash
  16159  16145   1  ffff8800595bd540  IN   0.0  106060      0  bash
> 16160  16159   1  ffff8800628fb500  RU   0.0   60868    356  rpm

Are we sending messages in a tight loop?

crash> ps | grep -v IN
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
      0      0   0  ffffffff81a8d020  RU   0.0       0      0  [swapper]
      0      0   1  ffff88007e509540  RU   0.0       0      0  [swapper]
     23      2   1  ffff88007e5d8080  RU   0.0       0      0  [kblockd/1]
>  1009      1   0  ffff880037471540  RU   0.0  249092      4  rsyslogd
   2695      2   0  ffff88007cdf5500  RU   0.0       0      0  [socknal_sd00_01]
   2699      2   0  ffff88003790c080  RU   0.0       0      0  [ptlrpcd_0]
   2700      2   1  ffff880037bd1500  RU   0.0       0      0  [ptlrpcd_1]
> 16160  16159   1  ffff8800628fb500  RU   0.0   60868    356  rpm
crash> bt 2695
PID: 2695   TASK: ffff88007cdf5500  CPU: 0   COMMAND: "socknal_sd00_01"
 #0 [ffff88006afbb810] schedule at ffffffff815296a0
 #1 [ffff88006afbb8d8] __cond_resched at ffffffff810695fa
 #2 [ffff88006afbb8f8] _cond_resched at ffffffff8152a0e0
 #3 [ffff88006afbb908] lock_sock_nested at ffffffff8144ca40
 #4 [ffff88006afbb968] tcp_recvmsg at ffffffff814a5b48
 #5 [ffff88006afbba78] inet_recvmsg at ffffffff814c750a
 #6 [ffff88006afbbab8] sock_recvmsg at ffffffff8144b1c3
 #7 [ffff88006afbbc78] kernel_recvmsg at ffffffff8144b234
 #8 [ffff88006afbbc98] ksocknal_lib_recv_iov at ffffffffa0a2651a [ksocklnd]
 #9 [ffff88006afbbd28] ksocknal_process_receive at ffffffffa0a202aa [ksocklnd]
#10 [ffff88006afbbdc8] ksocknal_scheduler at ffffffffa0a229bb [ksocklnd]
#11 [ffff88006afbbee8] kthread at ffffffff8109abf6
#12 [ffff88006afbbf48] kernel_thread at ffffffff8100c20a
crash> bt 2699
PID: 2699   TASK: ffff88003790c080  CPU: 0   COMMAND: "ptlrpcd_0"
 #0 [ffff88007a8d3bb0] schedule at ffffffff815296a0
 #1 [ffff88007a8d3c78] __cond_resched at ffffffff810695fa
 #2 [ffff88007a8d3c98] _cond_resched at ffffffff8152a0e0
 #3 [ffff88007a8d3ca8] ptlrpc_check_set at ffffffffa08045a7 [ptlrpc]
 #4 [ffff88007a8d3d68] ptlrpcd_check at ffffffffa0831c63 [ptlrpc]
 #5 [ffff88007a8d3dc8] ptlrpcd at ffffffffa083228b [ptlrpc]
 #6 [ffff88007a8d3ee8] kthread at ffffffff8109abf6
 #7 [ffff88007a8d3f48] kernel_thread at ffffffff8100c20a
crash> bt 2700
PID: 2700   TASK: ffff880037bd1500  CPU: 1   COMMAND: "ptlrpcd_1"
 #0 [ffff88007a8d5440] schedule at ffffffff815296a0
 #1 [ffff88007a8d5508] __cond_resched at ffffffff810695fa
 #2 [ffff88007a8d5528] _cond_resched at ffffffff8152a0e0
 #3 [ffff88007a8d5538] shrink_active_list at ffffffff81139ccd
 #4 [ffff88007a8d55f8] shrink_mem_cgroup_zone at ffffffff8113aa75
 #5 [ffff88007a8d56a8] shrink_zone at ffffffff8113ac3a
 #6 [ffff88007a8d5728] do_try_to_free_pages at ffffffff8113ae55
 #7 [ffff88007a8d57c8] try_to_free_pages at ffffffff8113b522
 #8 [ffff88007a8d5868] __alloc_pages_nodemask at ffffffff8112f91e
 #9 [ffff88007a8d59a8] kmem_getpages at ffffffff8116e6b2
#10 [ffff88007a8d59d8] fallback_alloc at ffffffff8116f2ca
#11 [ffff88007a8d5a58] ____cache_alloc_node at ffffffff8116f049
#12 [ffff88007a8d5ab8] __kmalloc at ffffffff8116fe19
#13 [ffff88007a8d5b08] null_alloc_repbuf at ffffffffa084c5ba [ptlrpc]
#14 [ffff88007a8d5b38] sptlrpc_cli_alloc_repbuf at ffffffffa083a7a5 [ptlrpc]
#15 [ffff88007a8d5b68] ptl_send_rpc at ffffffffa080cc51 [ptlrpc]
#16 [ffff88007a8d5c38] ptlrpc_send_new_req at ffffffffa0800cd3 [ptlrpc]
#17 [ffff88007a8d5ca8] ptlrpc_check_set at ffffffffa0804e60 [ptlrpc]
#18 [ffff88007a8d5d68] ptlrpcd_check at ffffffffa0831c63 [ptlrpc]
#19 [ffff88007a8d5dc8] ptlrpcd at ffffffffa08321fa [ptlrpc]
#20 [ffff88007a8d5ee8] kthread at ffffffff8109abf6
#21 [ffff88007a8d5f48] kernel_thread at ffffffff8100c20a
Comment by Andreas Dilger [ 28/Feb/20 ]

Close old bug that hasn't been seen in a long time.

Generated at Sat Feb 10 01:57:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.