Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3886

sanity test_56a: @@@@@@ FAIL: /usr/bin/lfs getstripe --obd wrong: found 6, expected 3

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.5.0
    • None
    • 3
    • 10111

    Description

      This problem is similar with LU-3846 and LU-3858. The test suit should wait for a few seconds after it clear the stripe of the directory. Otherwise, the newly created entries under the directory will have 2 stripe counts rather than 1.

      Attachments

        Issue Links

          Activity

            [LU-3886] sanity test_56a: @@@@@@ FAIL: /usr/bin/lfs getstripe --obd wrong: found 6, expected 3

            Haven't seen this in a long time.

            adilger Andreas Dilger added a comment - Haven't seen this in a long time.

            I've hit this problem with lustre-master tag 2.6.92. Results at https://testing.hpdd.intel.com/test_sets/37e63f92-9f0d-11e4-91b3-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - I've hit this problem with lustre-master tag 2.6.92. Results at https://testing.hpdd.intel.com/test_sets/37e63f92-9f0d-11e4-91b3-5254006e85c2

            Yeah, I hit the problem of 'found 6, expected 3' every time when I run sanity.sh.

            lixi Li Xi (Inactive) added a comment - Yeah, I hit the problem of 'found 6, expected 3' every time when I run sanity.sh.
            emoly.liu Emoly Liu added a comment -

            The problem you found by run.sh is probably related to the following code:

            When we set stripe for root(mount point), set_default is enabled in ll_dir_ioctl()

                    case LL_IOC_LOV_SETSTRIPE: {
            ...
                            int set_default = 0;
            ...
                            if (inode->i_sb->s_root == file->f_dentry)
                                    set_default = 1;
            
                            /* in v1 and v3 cases lumv1 points to data */
                            rc = ll_dir_setstripe(inode, lumv1, set_default);
            

            Then, in ll_dir_setstripe() if set_default=1, we will call ll_send_mgc_param() to set information asynchronously.

                    if (set_default && mgc->u.cli.cl_mgc_mgsexp) {
                            /* Set root stripesize */
                            /* Set root stripecount */
                            /* Set root stripeoffset */
                    }
            

            Since you run setstripe very frequently and many times in run.sh, the config log queue might be very long (bottleneck), and mgs will take more time to process it.

            BTW, can you hit this problem if you don't use run.sh, just run sanity.sh regularly?

            emoly.liu Emoly Liu added a comment - The problem you found by run.sh is probably related to the following code: When we set stripe for root(mount point), set_default is enabled in ll_dir_ioctl() case LL_IOC_LOV_SETSTRIPE: { ... int set_default = 0; ... if (inode->i_sb->s_root == file->f_dentry) set_default = 1; /* in v1 and v3 cases lumv1 points to data */ rc = ll_dir_setstripe(inode, lumv1, set_default); Then, in ll_dir_setstripe() if set_default=1, we will call ll_send_mgc_param() to set information asynchronously. if (set_default && mgc->u.cli.cl_mgc_mgsexp) { /* Set root stripesize */ /* Set root stripecount */ /* Set root stripeoffset */ } Since you run setstripe very frequently and many times in run.sh, the config log queue might be very long (bottleneck), and mgs will take more time to process it. BTW, can you hit this problem if you don't use run.sh, just run sanity.sh regularly?
            emoly.liu Emoly Liu added a comment -

            Yes, this time I hit that. I will investigate it.

            emoly.liu Emoly Liu added a comment - Yes, this time I hit that. I will investigate it.

            Oh, sorry, plase run on lustre mount point '/mnt/lustre' rather than its directory '/mnt/lustre/dir', i.e. sh run.sh /mnt/lustre/

            I got following output:

            No error after 640 iters
            -1 != 1
            Does not become correct after 0 seconds
            Does not become correct after 1 seconds
            Does not become correct after 2 seconds
            Does not become correct after 3 seconds
            Does not become correct after 4 seconds
            Does not become correct after 5 seconds
            Does not become correct after 6 seconds
            Become correct after 7 seconds

            lixi Li Xi (Inactive) added a comment - Oh, sorry, plase run on lustre mount point '/mnt/lustre' rather than its directory '/mnt/lustre/dir', i.e. sh run.sh /mnt/lustre/ I got following output: No error after 640 iters -1 != 1 Does not become correct after 0 seconds Does not become correct after 1 seconds Does not become correct after 2 seconds Does not become correct after 3 seconds Does not become correct after 4 seconds Does not become correct after 5 seconds Does not become correct after 6 seconds Become correct after 7 seconds
            emoly.liu Emoly Liu added a comment -

            I just ran the script of "https://jira.hpdd.intel.com/secure/attachment/13414/run.sh" on my local VM. It showed me 10000 times "No errors after xxx iters".

            My step is:
            1. mount lustre
            2. mkdir /mnt/lustre/d
            3. sh run.sh /mnt/lustre/d

            I tried several times, no error happened.

            emoly.liu Emoly Liu added a comment - I just ran the script of "https://jira.hpdd.intel.com/secure/attachment/13414/run.sh" on my local VM. It showed me 10000 times "No errors after xxx iters". My step is: 1. mount lustre 2. mkdir /mnt/lustre/d 3. sh run.sh /mnt/lustre/d I tried several times, no error happened.

            What is interesting is that when I add a sleep into the test suit, the problem is gone. That makes me believe that the problem is similar with LU-3858. Emoly, would you please check the attachment of LU-3858 first? I.e. https://jira.hpdd.intel.com/secure/attachment/13414/run.sh. It shows that the effect of default stripe is delayed. Thanks!

            test_56a() { # was test_56
            rm -rf $DIR/$tdir
            $SETSTRIPE -d $DIR
            test_mkdir -p $DIR/$tdir/dir
            NUMFILES=3
            NUMFILESx2=$(($NUMFILES * 2))
            sleep 10 # This will fix the problem.
            for i in `seq 1 $NUMFILES` ; do
            touch $DIR/$tdir/file$i
            touch $DIR/$tdir/dir/file$i
            done
            ......

            lixi Li Xi (Inactive) added a comment - What is interesting is that when I add a sleep into the test suit, the problem is gone. That makes me believe that the problem is similar with LU-3858 . Emoly, would you please check the attachment of LU-3858 first? I.e. https://jira.hpdd.intel.com/secure/attachment/13414/run.sh . It shows that the effect of default stripe is delayed. Thanks! test_56a() { # was test_56 rm -rf $DIR/$tdir $SETSTRIPE -d $DIR test_mkdir -p $DIR/$tdir/dir NUMFILES=3 NUMFILESx2=$(($NUMFILES * 2)) sleep 10 # This will fix the problem. for i in `seq 1 $NUMFILES` ; do touch $DIR/$tdir/file$i touch $DIR/$tdir/dir/file$i done ......
            lixi Li Xi (Inactive) added a comment - - edited

            I've post the script (run-sanity.sh) to hit this problem (and LU-3858, LU-3846). It will hit the problem every time it runs.

            lixi Li Xi (Inactive) added a comment - - edited I've post the script (run-sanity.sh) to hit this problem (and LU-3858 , LU-3846 ). It will hit the problem every time it runs.
            emoly.liu Emoly Liu added a comment -

            "lfs setstripe -d" should be enough to clear directory striping information. LiXi, could you tell me how you hit this problem?

            IMO, we can add "lfs getstripe -v" after "setstripe -d" to print that striping information, and see if this problem will happen again.

            emoly.liu Emoly Liu added a comment - "lfs setstripe -d" should be enough to clear directory striping information. LiXi, could you tell me how you hit this problem? IMO, we can add "lfs getstripe -v" after "setstripe -d" to print that striping information, and see if this problem will happen again.

            People

              emoly.liu Emoly Liu
              lixi Li Xi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: