[LU-5121] replay-ost-single test_0b: OOM on the OST Created: 30/May/14  Updated: 28/Oct/14  Resolved: 11/Jul/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-5077 insanity test_1: out of memory on MDT... Resolved
Related
is related to LU-5077 insanity test_1: out of memory on MDT... Resolved
is related to LU-5131 insanity test 13: ll_ost00_006 invoke... Resolved
Severity: 3
Rank (Obsolete): 14130

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/90389636-e54f-11e3-bb3a-52540035b04c.

The sub-test test_0b failed with the following error:

8:54:10:LustreError: 137-5: lustre-OST0000_UUID: not available for connect from 10.1.5.230@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
18:54:10:LustreError: Skipped 7 previous similar messages
18:54:10:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
18:54:10:LNet: 29343:0:(debug.c:218:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
18:54:10:LNet: 29343:0:(debug.c:218:libcfs_debug_str2mask()) Skipped 3 previous similar messages
18:54:10:Lustre: DEBUG MARKER: e2label /dev/lvm-Role_OSS/P1 2>/dev/null
18:54:10:Lustre: lustre-OST0000: Will be in recovery for at least 1:00, or until 7 clients reconnect
18:54:10:ll_ost00_003 invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
18:54:10:ll_ost00_003 cpuset=/ mems_allowed=0
18:54:10:Pid: 26086, comm: ll_ost00_003 Not tainted 2.6.32-431.17.1.el6_lustre.g2e529c5.x86_64 #1
18:54:10:Call Trace:
18:54:10: [<ffffffff810d0211>] ? cpuset_print_task_mems_allowed+0x91/0xb0
18:54:10: [<ffffffff811225c0>] ? dump_header+0x90/0x1b0
18:54:10: [<ffffffff8122781c>] ? security_real_capable_noaudit+0x3c/0x70
18:54:10: [<ffffffff81122a42>] ? oom_kill_process+0x82/0x2a0
18:54:10: [<ffffffff8112293e>] ? select_bad_process+0x9e/0x120
18:54:10: [<ffffffff81122e80>] ? out_of_memory+0x220/0x3c0
18:54:10: [<ffffffff8112f79f>] ? __alloc_pages_nodemask+0x89f/0x8d0
18:54:10: [<ffffffff8116e082>] ? kmem_getpages+0x62/0x170
18:54:10: [<ffffffff8116ec9a>] ? fallback_alloc+0x1ba/0x270
18:54:10: [<ffffffff8116e6ef>] ? cache_grow+0x2cf/0x320
18:54:10: [<ffffffff8116ea19>] ? ____cache_alloc_node+0x99/0x160
18:54:10: [<ffffffff8124bc3c>] ? crypto_create_tfm+0x3c/0xe0
18:54:10: [<ffffffff8116f7e9>] ? __kmalloc+0x189/0x220
18:54:10: [<ffffffff8124bc3c>] ? crypto_create_tfm+0x3c/0xe0
18:54:10: [<ffffffff812525d8>] ? crypto_init_shash_ops+0x68/0x100
18:54:10: [<ffffffff8124bd4a>] ? __crypto_alloc_tfm+0x6a/0x130
18:54:10: [<ffffffff8124c5ba>] ? crypto_alloc_base+0x5a/0xb0
18:54:10: [<ffffffff810554f8>] ? resched_task+0x68/0x80
18:54:10: [<ffffffffa048d2ca>] ? cfs_crypto_hash_alloc+0x7a/0x290 [libcfs]
18:54:10: [<ffffffffa048d5da>] ? cfs_crypto_hash_digest+0x6a/0xf0 [libcfs]
18:54:10: [<ffffffff8116f86c>] ? __kmalloc+0x20c/0x220
18:54:10: [<ffffffffa082bd73>] ? lustre_msg_calc_cksum+0xd3/0x130 [ptlrpc]
18:54:10: [<ffffffffa0865a81>] ? null_authorize+0xa1/0x100 [ptlrpc]
18:54:10: [<ffffffffa0854c56>] ? sptlrpc_svc_wrap_reply+0x56/0x1c0 [ptlrpc]
18:54:10: [<ffffffffa08241ec>] ? ptlrpc_send_reply+0x1fc/0x7f0 [ptlrpc]
18:54:10: [<ffffffffa083b675>] ? ptlrpc_at_check_timed+0xc05/0x1360 [ptlrpc]
18:54:10: [<ffffffffa0832c09>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
18:54:10: [<ffffffffa083cf68>] ? ptlrpc_main+0x1198/0x1980 [ptlrpc]
18:54:10: [<ffffffffa083bdd0>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
18:54:10: [<ffffffff8109ab56>] ? kthread+0x96/0xa0
18:54:10: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
18:54:10: [<ffffffff8109aac0>] ? kthread+0x0/0xa0
18:54:10: [<ffffffff8100c200>] ? child_rip+0x0/0x20
18:54:10:Mem-Info:
test_0b returned 1

Info required for matching: replay-ost-single 0b



 Comments   
Comment by nasf (Inactive) [ 08/Oct/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/663963a4-4ebf-11e4-872e-5254006e85c2

Comment by nasf (Inactive) [ 08/Oct/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_logs/403f289c-4c25-11e4-b821-5254006e85c2

Comment by Johann Lombardi (Inactive) [ 20/Oct/14 ]

Another instance:
https://testing.hpdd.intel.com/test_sets/02faa1b0-575a-11e4-9132-5254006e85c2

From the crash dump:

<0>Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
<0>
<4>Pid: 26633, comm: ll_ost00_010 Not tainted 2.6.32-431.29.2.el6_lustre.g01c1143.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81528ffc>] ? panic+0xa7/0x16f
<4> [<ffffffff81122bf1>] ? dump_header+0x101/0x1b0
<4> [<ffffffff81122d1c>] ? check_panic_on_oom+0x7c/0x80
<4> [<ffffffff811233db>] ? out_of_memory+0x1bb/0x3c0
<4> [<ffffffff8112fd5f>] ? __alloc_pages_nodemask+0x89f/0x8d0
<4> [<ffffffff8116e6d2>] ? kmem_getpages+0x62/0x170
<4> [<ffffffff8116f2ea>] ? fallback_alloc+0x1ba/0x270
<4> [<ffffffff8116ed3f>] ? cache_grow+0x2cf/0x320
<4> [<ffffffff8116f069>] ? ____cache_alloc_node+0x99/0x160
<4> [<ffffffff8124cd7c>] ? crypto_create_tfm+0x3c/0xe0
<4> [<ffffffff8116fe39>] ? __kmalloc+0x189/0x220
<4> [<ffffffff8124cd7c>] ? crypto_create_tfm+0x3c/0xe0
<4> [<ffffffff81253718>] ? crypto_init_shash_ops+0x68/0x100
<4> [<ffffffff8124ce8a>] ? __crypto_alloc_tfm+0x6a/0x130
<4> [<ffffffff8124d6fa>] ? crypto_alloc_base+0x5a/0xb0
<4> [<ffffffffa048d107>] ? cfs_crypto_hash_alloc+0x77/0x290 [libcfs]
<4> [<ffffffffa048d7e6>] ? cfs_crypto_hash_digest+0x66/0xf0 [libcfs]
<4> [<ffffffff8116febc>] ? __kmalloc+0x20c/0x220
<4> [<ffffffffa081f3e3>] ? lustre_msg_calc_cksum+0xd3/0x140 [ptlrpc]
<4> [<ffffffffa0858e01>] ? null_authorize+0xa1/0x100 [ptlrpc]
<4> [<ffffffffa0847e96>] ? sptlrpc_svc_wrap_reply+0x56/0x1c0 [ptlrpc]
<4> [<ffffffffa081780c>] ? ptlrpc_send_reply+0x1fc/0x7f0 [ptlrpc]
<4> [<ffffffffa082ee85>] ? ptlrpc_at_check_timed+0xcd5/0x1370 [ptlrpc]
<4> [<ffffffffa08262f9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
<4> [<ffffffffa0830808>] ? ptlrpc_main+0x12e8/0x1990 [ptlrpc]
<4> [<ffffffffa082f520>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff811bffc0>] ? sync_buffer+0x0/0x50
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
Comment by Bruno Faccini (Inactive) [ 28/Oct/14 ]

One more at https://testing.hpdd.intel.com/test_sets/3840229a-5e22-11e4-b92b-5254006e85c2.

Generated at Sat Feb 10 01:48:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.