[LU-5759] replay-dual test_21b: Restart of mds0 failed Created: 16/Oct/14  Updated: 02/Jul/15  Resolved: 08/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

client and server: lustre-master build #2690


Issue Links:
Related
is related to LU-6006 replay-dual test_22a: Remote creation... Resolved
Severity: 3
Rank (Obsolete): 16158

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/aff6c30c-5427-11e4-abcf-5254006e85c2.

The sub-test test_21b failed with the following error:

Restart of mds0 failed!
08:51:44:Lustre: DEBUG MARKER: == replay-dual test 21b: commit on sharing, two clients == 08:51:21 (1413301881)
08:51:44:LustreError: 21059:0:(qsd_reint.c:54:qsd_reint_completion()) lustre-MDT0001: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5
08:51:44:LustreError: 21059:0:(qsd_reint.c:54:qsd_reint_completion()) Skipped 3 previous similar messages
08:51:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-dual test_21b: @@@@@@ FAIL: Restart of mds0 failed! 
08:51:44:Lustre: DEBUG MARKER: replay-dual test_21b: @@@@@@ FAIL: Restart of mds0 failed!

Info required for matching: replay-dual 21b



 Comments   
Comment by Jodi Levi (Inactive) [ 17/Oct/14 ]

Mike,
Could you please have a look at this one and comment?
Thank you!

Comment by Andreas Dilger [ 17/Oct/14 ]

It looks like this started failing 100% on 2014-10-06, but wasn't noticed because replay-dual doesn't run as part of the per-patch review tests. According to the maloo test results the last passing test was commit 0b4b33592c09 "LU-5613 lustre: unused variable in tgt_brw_read()" and the first failing test was commit 6039fc8fd47, but that commit and the previous three don't appear to have anything to do with the failing test. The most likely source of this problem is commit e2677595ab7ff "LU-5003 llog: do not fix remote llogs". It would make sense to start by reverting this patch with a commit comment using Test-Parameters: testlist=replay-dual,replay-dual,replay-dual (or testing this locally) to verify if this fixes the problem.

Comment by Mikhail Pershin [ 21/Oct/14 ]

the reason of failure is the wrong MDS index in test:

Starting mds0:    /mnt/mds0
CMD: onyx-45vm3 mkdir -p /mnt/mds0; mount -t lustre   		                    /mnt/mds0
onyx-45vm3: Usage: mount -V                 : print version
onyx-45vm3:        mount -h                 : print this help
onyx-45vm3:        mount                    : list mounted filesystems
onyx-45vm3:        mount -l                 : idem, including volume labels
onyx-45vm3: So far the informational part. Next the mounting.

Further investigation shows that get_mds_dir() function in test-framework.sh was corrupted by commit 745c19c70319. Unfortunately regular testing didin't show that regression. It affects replay-dual.sh test_21b and several tests in test_27 group sanity.sh

Comment by Andreas Dilger [ 22/Oct/14 ]

See also http://review.whamcloud.com/12149 to clean up this code a bit more.

Comment by Mikhail Pershin [ 27/Oct/14 ]

http://review.whamcloud.com/12363 - patch was derived from Andreas fix but eliminates get_mds_num() entirely and use "lfs getstripe -M" in tests instead

Comment by Jian Yu [ 11/Nov/14 ]

We need add replay-dual back into patch review test group. However, this failure is preventing replay-dual from passing on master branch. So I raise the priority of this ticket as a blocker.

Comment by Gerrit Updater [ 08/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12363/
Subject: LU-5759 tests: use lfs getstripe -M instead of get_mds_num
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9804f872d28315a7ddfe835138bb33e2206ffe52

Comment by Jodi Levi (Inactive) [ 08/Jan/15 ]

Patch landed to Master.

Generated at Sat Feb 10 01:54:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.