[LU-1724] Test failure on test suite performance-sanity, subtest test_3 Created: 08/Aug/12  Updated: 13/Aug/12  Resolved: 13/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Keith Mannthey (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6353

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/0596b3e0-dcd3-11e1-8744-52540035b04c.

The sub-test test_3 failed with the following error:

test_3 returned 1

01:24:41:Lustre: DEBUG MARKER: performance-sanity test_3: @@@@@@ FAIL: test_3 failed with 10


 Comments   
Comment by Sarah Liu [ 08/Aug/12 ]

This error may caused by the previous failure of test mds-survey

Comment by Keith Mannthey (Inactive) [ 10/Aug/12 ]

What happened to client-27vm3 of this test run? It was the mds and it paniced or ????

Comment by Sarah Liu [ 10/Aug/12 ]

There is a previous failure of mds-survey, it may cause the MDS abnormal.

https://maloo.whamcloud.com/test_sets/7fa0e0bc-dcd2-11e1-8744-52540035b04c

Comment by Keith Mannthey (Inactive) [ 10/Aug/12 ]

from the MDS of the 2nd run

00:54:47:Lustre: DEBUG MARKER: == mds-survey test 2: Metadata survey with stripe_count = 1 == 00:54:45 (1343894085)
00:54:50:Lustre: DEBUG MARKER: lctl dl
00:54:52:LustreError: 17365:0:(echo_client.c:1607:echo_md_lookup()) lookup tests: rc = -2
00:54:52:LustreError: 17365:0:(echo_client.c:1607:echo_md_lookup()) Skipped 2 previous similar messages
00:54:52:LustreError: 17365:0:(echo_client.c:1806:echo_md_destroy_internal()) Can't find child tests: rc = -2
00:54:52:LustreError: 17365:0:(echo_client.c:1806:echo_md_destroy_internal()) Skipped 2 previous similar messages
00:54:54:LustreError: 17387:0:(echo_client.c:1607:echo_md_lookup()) lookup tests1: rc = -2
00:54:54:LustreError: 17387:0:(echo_client.c:1806:echo_md_destroy_internal()) Can't find child tests1: rc = -2
01:04:40:lctl invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0

The OOM killer was out running around. Once OOM is running the system behaviour becomes non-deterministic as it may choose different processes to kill.

Why is the MDS running of of memory?

Comment by Peter Jones [ 10/Aug/12 ]

I think that the theory is due to LU-1548. Do you think that it is reasonable to consider this a follow-on issue, close as a duplicate and reopen if it reoccurs with the LU-1548 fix applied?

Comment by Keith Mannthey (Inactive) [ 10/Aug/12 ]

After reviewing LU-1548 I feel these are very likely the same issue: MDS OOM under the right conditions. I think duping it is the right way to go. There are no MDS logs from the first failure but OOM can cause the network to drop and the 2nd failure was clearly OOM.

Comment by Peter Jones [ 13/Aug/12 ]

Closing as a duplicate of LU-1548. Please reopen if this reoccurs with that issue fixed

Generated at Sat Feb 10 01:19:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.