[LU-4433] mds-survey test_1: FAIL: OST lustre-MDT0001 not setup Created: 04/Jan/14  Updated: 20/Nov/17  Resolved: 22/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.6.0, Lustre 2.5.1, Lustre 2.5.3, Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: dne
Environment:

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/5/
Distro/Arch: RHEL6.4/x86_64
MDSCOUNT=2


Issue Links:
Duplicate
Related
is related to LU-6341 Do not check security when accessing ... Resolved
Severity: 3
Rank (Obsolete): 12177

 Description   

While running mds-survey in DNE mode, test 1 failed as follows:

== mds-survey test 1: Metadata survey with zero-stripe == 18:52:57 (1388717577)
CMD: client-7vm3 lctl dl
+ file_count=102962 thrlo=1 thrhi=8 dir_count=4 layer=mdd stripe_count=0 rslt_loc=/tmp targets="client-7vm3:lustre-MDT0000 lustre-MDT0001" /usr/bin/mds-survey
Warning: Permanently added 'client-7vm3,10.10.4.236' (RSA) to the list of known hosts.
OST lustre-MDT0001 not setup
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0      0 1279000 126456 355052    0    0     6    25  490  332  1  6 90  2  1	
 mds-survey test_1: @@@@@@ FAIL: mds-survey failed

Maloo report: https://maloo.whamcloud.com/test_sets/64322640-749a-11e3-8b21-52540035b04c



 Comments   
Comment by Jian Yu [ 04/Jan/14 ]

In DNE mode, mds-survey test 2 failed as follows:

== mds-survey test 2: Metadata survey with stripe_count = 1 == 18:53:03 (1388717583)
CMD: client-7vm3 lctl dl
+ file_count=102962 thrlo=1 thrhi=8 dir_count=4 layer=mdd stripe_count=1 rslt_loc=/tmp targets="client-7vm3:lustre-MDT0000 lustre-MDT0001" /usr/bin/mds-survey
Need obdfilter to test stripe_count
cat: /tmp/mds_survey*: No such file or directory
 mds-survey test_2: @@@@@@ FAIL: mds-survey failed

Maloo report: https://maloo.whamcloud.com/test_sets/64322640-749a-11e3-8b21-52540035b04c

Comment by Jian Yu [ 07/Mar/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/39/ (2.5.1 RC1)
MDSCOUNT=2

The same failure occurred:
https://maloo.whamcloud.com/test_sets/8aea0bde-a557-11e3-a61d-52540035b04c

Comment by Sarah Liu [ 26/Jul/14 ]

Hit this in b2_6-rc2 DNE testing

client and server: lustre-b2_6-rc2 1 MDS with 2 MDTs

https://testing.hpdd.intel.com/test_sets/b99b1d4e-14ef-11e4-bb6a-5254006e85c2

== mds-survey test 1: Metadata survey with zero-stripe == 22:17:04 (1406351824)
CMD: onyx-46vm7 lctl dl
+ file_count=102960 thrlo=1 thrhi=8 dir_count=4 layer=mdd stripe_count=0 rslt_loc=/tmp targets="onyx-46vm7:lustre-MDT0000 lustre-MDT0001" /usr/bin/mds-survey
Warning: Permanently added 'onyx-46vm7' (RSA) to the list of known hosts.
OST lustre-MDT0001 not setup
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 1386232 136280 238416    0    0    14   114 1219  687  1 10 85  4  0	
 mds-survey test_1: @@@@@@ FAIL: mds-survey failed 
Comment by Jian Yu [ 31/Aug/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1)

The same failure occurred: https://testing.hpdd.intel.com/test_sets/3a8c8fc2-308a-11e4-a3d9-5254006e85c2

Comment by Sarah Liu [ 14/Mar/16 ]

same issue hit on master/#3324 DNE mode
https://testing.hpdd.intel.com/test_sets/6cf67e46-e7db-11e5-afa2-5254006e85c2

Comment by Jian Yu [ 11/Apr/16 ]

Hi Di,

Does /usr/bin/mds-survey support multiple MDT targets?

While running the following command on master branch:

file_count=117549 thrlo=1 thrhi=8 dir_count=4 layer=mdd stripe_count=0 rslt_loc=/tmp targets="eagle-36vm6:lustre-MDT0000 eagle-36vm6:lustre-MDT0002 eagle-37vm5:lustre-MDT0001 eagle-37vm5:lustre-MDT0003" /usr/bin/mds-survey

I hit the following errors:

Mon Apr 11 01:28:35 UTC 2016 /usr/bin/mds-survey from eagle-30vm6
error: test_mkdir: Object is remote
ERROR: fail test_mkdir
created directories on eagle-36vm6:lustre-MDT0002_ecc failed
program exited with error
=======> Create 4 directories on eagle-36vm6:lustre-MDT0000_ecc
=======> Create 4 directories on eagle-36vm6:lustre-MDT0002_ecc
error: test_mkdir: Object is remote
Mon Apr 11 01:28:35 UTC 2016 /usr/bin/mds-survey from eagle-30vm6
created directories on eagle-36vm6:lustre-MDT0002_ecc failed
Comment by Di Wang [ 11/Apr/16 ]

Hmm, if I remember correctly, b2_5 should support mds_survey on multiple MDT. Master definitely does. Maybe there is a bug in 2_5.

Comment by Jian Yu [ 11/Apr/16 ]

Hi Di,
The above command was run on master branch. It failed.

Comment by Gerrit Updater [ 11/Apr/16 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: http://review.whamcloud.com/19437
Subject: LU-4433 tests: fix mds-survey.sh to support multiple MDTs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2589497e3b2ea3b1d09e35742a12cf54e79ceeb7

Comment by Jian Yu [ 11/Apr/16 ]

Hi Di,

The above patch just fixes mds-survey.sh test script to support multiple MDTs. But for mds-survey, it still failed with multiple MDTs.

Comment by Di Wang [ 12/Apr/16 ]

Could you please re-run with debug-level = -1? thanks.

Comment by Jian Yu [ 13/Apr/16 ]

Hi Di,

Here is the report with debug=-1 and debug_mb=150:
https://testing.hpdd.intel.com/test_sets/9bb65320-011a-11e6-9ccf-5254006e85c2

Comment by Di Wang [ 15/Apr/16 ]

According to the debug log, creation fails in lod_declare_object_create()

00000004:00000001:0.0:1460512310.071969:0:15039:0:(lod_object.c:3470:lod_declare_object_create()) Process leaving via out (rc=18446744073709551550 : -66 : 0xffffffffffffffbe)
00000004:00000001:0.0:1460512310.071972:0:15039:0:(lod_object.c:3508:lod_declare_object_create()) Process leaving (rc=18446744073709551550 : -66 : ffffffffffffffbe)
00000004:00000001:0.0:1460512310.071974:0:15039:0:(mdd_object.c:374:mdd_declare_object_create_internal()) Process leaving (rc=18446744073709551550 : -66 : ffffffffffffffbe)

So in lod_declare_object_create()

static int lod_declare_object_create(const struct lu_env *env,
                                     struct dt_object *dt,
                                     struct lu_attr *attr,
                                     struct dt_allocation_hint *hint,
                                     struct dt_object_format *dof,
                                     struct thandle *th)
{
...........

                /* If the parent has default stripeEA, and client
                 * did not find it before sending create request,
                 * then MDT will return -EREMOTE, and client will
                 * retrieve the default stripeEA and re-create the
                 * sub directory.
                 *
                 * Note: if dah_eadata != NULL, it means creating the
                 * striped directory with specified stripeEA, then it
                 * should ignore the default stripeEA */
                if (hint != NULL && hint->dah_eadata == NULL) {
                        if (OBD_FAIL_CHECK(OBD_FAIL_MDS_STALE_DIR_LAYOUT))
                                GOTO(out, rc = -EREMOTE);

       ........   
}

If the create request does not have EA, and also the parent does not have default stripe EA, then the new created child has to be in the same MDT with its parent, otherwise it will return -EREMOTE, see LU-6341 (http://review.whamcloud.com/13990). Unfortunately, this did not consider echo client.

So there are two options to fix the problem

1. If echo client want to create a remote directory, then add lum into the creation spec, see echo_create_md_object().
2. Or add some flags in spec and hint, so let lod_declare_object_create() skip the this check for echo client.

Either way is fine to me.

Comment by Peter Jones [ 05/Aug/16 ]

Reassigning to Lai

Comment by nasf (Inactive) [ 15/Aug/16 ]

Take it as Peter suggested.

Comment by nasf (Inactive) [ 15/Aug/16 ]

Updated the patch http://review.whamcloud.com/#/c/19437/. Hope it can pass Maloo test.

Comment by Gerrit Updater [ 22/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19437/
Subject: LU-4433 tests: fix mds-survey.sh to support multiple MDTs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 530b16b187de9a27723205d9d759f260bfd350b8

Comment by nasf (Inactive) [ 22/Aug/16 ]

The patch has been landed to master.

Comment by Gerrit Updater [ 13/Oct/16 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: http://review.whamcloud.com/23146
Subject: LU-4433 obd: use ktime_t for calculating elapsed time
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c52ac644ab7bf1d3d9fe3fac5918ca2d2d584b80

Comment by James A Simmons [ 13/Oct/16 ]

Sorry mistyped

Generated at Sat Feb 10 01:42:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.