[LU-3886] sanity test_56a: @@@@@@ FAIL: /usr/bin/lfs getstripe --obd wrong: found 6, expected 3 Created: 05/Sep/13  Updated: 17/Mar/20  Resolved: 17/Mar/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Li Xi (Inactive) Assignee: Emoly Liu
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File run-sanity.sh    
Issue Links:
Duplicate
is duplicated by LU-7071 Interop 2.5.3<->master DNE: sanity te... Resolved
Related
is related to LU-3846 Sanity 56u error with two OSTs Resolved
is related to LU-3858 sanity test_27A: @@@@@@ FAIL: stripe ... Resolved
Severity: 3
Rank (Obsolete): 10111

 Description   

This problem is similar with LU-3846 and LU-3858. The test suit should wait for a few seconds after it clear the stripe of the directory. Otherwise, the newly created entries under the directory will have 2 stripe counts rather than 1.



 Comments   
Comment by Peter Jones [ 06/Sep/13 ]

Emoly

Could you please comment on this one?

Thanks

Peter

Comment by Emoly Liu [ 09/Sep/13 ]

"lfs setstripe -d" should be enough to clear directory striping information. LiXi, could you tell me how you hit this problem?

IMO, we can add "lfs getstripe -v" after "setstripe -d" to print that striping information, and see if this problem will happen again.

Comment by Li Xi (Inactive) [ 10/Sep/13 ]

I've post the script (run-sanity.sh) to hit this problem (and LU-3858, LU-3846). It will hit the problem every time it runs.

Comment by Li Xi (Inactive) [ 10/Sep/13 ]

What is interesting is that when I add a sleep into the test suit, the problem is gone. That makes me believe that the problem is similar with LU-3858. Emoly, would you please check the attachment of LU-3858 first? I.e. https://jira.hpdd.intel.com/secure/attachment/13414/run.sh. It shows that the effect of default stripe is delayed. Thanks!

test_56a() { # was test_56
rm -rf $DIR/$tdir
$SETSTRIPE -d $DIR
test_mkdir -p $DIR/$tdir/dir
NUMFILES=3
NUMFILESx2=$(($NUMFILES * 2))
sleep 10 # This will fix the problem.
for i in `seq 1 $NUMFILES` ; do
touch $DIR/$tdir/file$i
touch $DIR/$tdir/dir/file$i
done
......

Comment by Emoly Liu [ 10/Sep/13 ]

I just ran the script of "https://jira.hpdd.intel.com/secure/attachment/13414/run.sh" on my local VM. It showed me 10000 times "No errors after xxx iters".

My step is:
1. mount lustre
2. mkdir /mnt/lustre/d
3. sh run.sh /mnt/lustre/d

I tried several times, no error happened.

Comment by Li Xi (Inactive) [ 10/Sep/13 ]

Oh, sorry, plase run on lustre mount point '/mnt/lustre' rather than its directory '/mnt/lustre/dir', i.e. sh run.sh /mnt/lustre/

I got following output:

No error after 640 iters
-1 != 1
Does not become correct after 0 seconds
Does not become correct after 1 seconds
Does not become correct after 2 seconds
Does not become correct after 3 seconds
Does not become correct after 4 seconds
Does not become correct after 5 seconds
Does not become correct after 6 seconds
Become correct after 7 seconds

Comment by Emoly Liu [ 10/Sep/13 ]

Yes, this time I hit that. I will investigate it.

Comment by Emoly Liu [ 11/Sep/13 ]

The problem you found by run.sh is probably related to the following code:

When we set stripe for root(mount point), set_default is enabled in ll_dir_ioctl()

        case LL_IOC_LOV_SETSTRIPE: {
...
                int set_default = 0;
...
                if (inode->i_sb->s_root == file->f_dentry)
                        set_default = 1;

                /* in v1 and v3 cases lumv1 points to data */
                rc = ll_dir_setstripe(inode, lumv1, set_default);

Then, in ll_dir_setstripe() if set_default=1, we will call ll_send_mgc_param() to set information asynchronously.

        if (set_default && mgc->u.cli.cl_mgc_mgsexp) {
                /* Set root stripesize */
                /* Set root stripecount */
                /* Set root stripeoffset */
        }

Since you run setstripe very frequently and many times in run.sh, the config log queue might be very long (bottleneck), and mgs will take more time to process it.

BTW, can you hit this problem if you don't use run.sh, just run sanity.sh regularly?

Comment by Li Xi (Inactive) [ 11/Sep/13 ]

Yeah, I hit the problem of 'found 6, expected 3' every time when I run sanity.sh.

Comment by James Nunez (Inactive) [ 25/Jan/15 ]

I've hit this problem with lustre-master tag 2.6.92. Results at https://testing.hpdd.intel.com/test_sets/37e63f92-9f0d-11e4-91b3-5254006e85c2

Comment by Andreas Dilger [ 17/Mar/20 ]

Haven't seen this in a long time.

Generated at Sat Feb 10 01:37:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.