Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4515

Test failure sanity-quota test_34: Used space for user 60000 is 0, expected 2048

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • Lustre 2.6.0
    • None
    • 3
    • 12354

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/0c69b0d8-7e9d-11e3-b6d1-52540035b04c
      https://maloo.whamcloud.com/test_sets/e355e8f2-8148-11e3-81ba-52540035b04c

      The sub-test test_34 failed with the following error:

      Used space for user 60000 is 0, expected 2048

      Info required for matching: sanity-quota 34

      Attachments

        Issue Links

          Activity

            [LU-4515] Test failure sanity-quota test_34: Used space for user 60000 is 0, expected 2048

            Jodi, how has this test been disabled? Could this be related to issues with TEI-1403?

            utopiabound Nathaniel Clark added a comment - Jodi, how has this test been disabled? Could this be related to issues with TEI-1403?

            Test has been disabled. Will re-enable once fix is in place.

            jlevi Jodi Levi (Inactive) added a comment - Test has been disabled. Will re-enable once fix is in place.

            As Johann said the test depends on wait_delete_completed to operate correctly. wait_delete_completed relies on lproc stats maintained in osp_sync.c. I see http://review.whamcloud.com/7158 landed that touches osp code just about the time these errors started up. I don't see anything wrong in that mod, but it does make changes in the close neighborhood and the coincidence in time seems suspicious.

            bogl Bob Glossman (Inactive) added a comment - As Johann said the test depends on wait_delete_completed to operate correctly. wait_delete_completed relies on lproc stats maintained in osp_sync.c. I see http://review.whamcloud.com/7158 landed that touches osp code just about the time these errors started up. I don't see anything wrong in that mod, but it does make changes in the close neighborhood and the coincidence in time seems suspicious.

            Bobijam, since 2.4, this is the MDT which is responsible for sending the OST_SETATTR RPCs to change ownership of objects on the OSTs. This is done through OSP and that's why the test calls wait_delete_completed to wait for OSP to flush all pending requests.

            johann Johann Lombardi (Inactive) added a comment - Bobijam, since 2.4, this is the MDT which is responsible for sending the OST_SETATTR RPCs to change ownership of objects on the OSTs. This is done through OSP and that's why the test calls wait_delete_completed to wait for OSP to flush all pending requests.
            bobijam Zhenyu Xu added a comment - - edited

            from client ll_setattr_raw(), after it set attr on MDS, client only set ost attr for size & time change, won't do anything about uid change on OST.

                    if (attr->ia_valid & (ATTR_SIZE |
                                          ATTR_ATIME | ATTR_ATIME_SET |
                                          ATTR_MTIME | ATTR_MTIME_SET)) {
                            /* For truncate and utimes sending attributes to OSTs, setting
                             * mtime/atime to the past will be performed under PW [0:EOF]
                             * extent lock (new_size:EOF for truncate).  It may seem
                             * excessive to send mtime/atime updates to OSTs when not
                             * setting times to past, but it is necessary due to possible
                             * time de-synchronization between MDT inode and OST objects */
                            rc = ll_setattr_ost(inode, attr);
                    }
            

            I verified this behavior on ldiskfs backend, and the test passed for ldiskfs backend. From what's been observed, it likely relates to the quota implementation with zfs backend.

            bobijam Zhenyu Xu added a comment - - edited from client ll_setattr_raw(), after it set attr on MDS, client only set ost attr for size & time change, won't do anything about uid change on OST. if (attr->ia_valid & (ATTR_SIZE | ATTR_ATIME | ATTR_ATIME_SET | ATTR_MTIME | ATTR_MTIME_SET)) { /* For truncate and utimes sending attributes to OSTs, setting * mtime/atime to the past will be performed under PW [0:EOF] * extent lock (new_size:EOF for truncate). It may seem * excessive to send mtime/atime updates to OSTs when not * setting times to past, but it is necessary due to possible * time de-synchronization between MDT inode and OST objects */ rc = ll_setattr_ost(inode, attr); } I verified this behavior on ldiskfs backend, and the test passed for ldiskfs backend. From what's been observed, it likely relates to the quota implementation with zfs backend.

            The lack of OST_SETATTR requests is weird. The test relies on wait_delete_completed to wait for setattr RPCs to be sent, is it possible that this function does not work any more for setattr?

            johann Johann Lombardi (Inactive) added a comment - The lack of OST_SETATTR requests is weird. The test relies on wait_delete_completed to wait for setattr RPCs to be sent, is it possible that this function does not work any more for setattr?

            Niu: I have submitted a patch (http://review.whamcloud.com/#/c/8981/) to disable sanity-quota LU-4515 to avoid other patches to be failed for this bug. Please enable it when you submit patch to fix related issues. Thanks!

            yong.fan nasf (Inactive) added a comment - Niu: I have submitted a patch ( http://review.whamcloud.com/#/c/8981/ ) to disable sanity-quota LU-4515 to avoid other patches to be failed for this bug. Please enable it when you submit patch to fix related issues. Thanks!

            I was looking at the debug logs for https://maloo.whamcloud.com/test_sets/0c69b0d8-7e9d-11e3-b6d1-52540035b04c and it appears that the "chown" RPC ("MDS_REINT") is sent to the MDS, but no matching "OST_SETATTR" RPC is sent to the OST to change the object ownership. I also see in the logs:

            1389835717.361895:0:4338:0:(osd_quota.c:98:osd_acct_index_lookup()) lustre-MDT0000: id ea60 not found in DMU accounting ZAP
            

            which indicates to me (a non quota expert) that this user (0xea60=60000) doesn't have any files in quota accounting.

            adilger Andreas Dilger added a comment - I was looking at the debug logs for https://maloo.whamcloud.com/test_sets/0c69b0d8-7e9d-11e3-b6d1-52540035b04c and it appears that the " chown " RPC (" MDS_REINT ") is sent to the MDS, but no matching " OST_SETATTR " RPC is sent to the OST to change the object ownership. I also see in the logs: 1389835717.361895:0:4338:0:(osd_quota.c:98:osd_acct_index_lookup()) lustre-MDT0000: id ea60 not found in DMU accounting ZAP which indicates to me (a non quota expert) that this user (0xea60=60000) doesn't have any files in quota accounting.

            I did some digging into the test failures, and the earliest failures were hit with parent commit 035f3e4bf7532839dd88a4ae330fd67542e17cdd (http://review.whamcloud.com/8528) but I don't think that patch is the source of the problem. Nothing similar was hit in the two weeks prior to this.

            adilger Andreas Dilger added a comment - I did some digging into the test failures, and the earliest failures were hit with parent commit 035f3e4bf7532839dd88a4ae330fd67542e17cdd ( http://review.whamcloud.com/8528 ) but I don't think that patch is the source of the problem. Nothing similar was hit in the two weeks prior to this.

            I would suspect this is a regression in a patch that landed recently. The problem only appeared on 2014-01-16, so failing tests should be checked for the earliest parent that is common to all of the test failures (probably landed within a day of the start of tests failing).

            I also wrote a small fixup to the sanity-quota.sh test script because it was spewing a lot of garbage into the logs: http://review.whamcloud.com/8978 Note that this is NOT a fix to the problem reported here.

            adilger Andreas Dilger added a comment - I would suspect this is a regression in a patch that landed recently. The problem only appeared on 2014-01-16, so failing tests should be checked for the earliest parent that is common to all of the test failures (probably landed within a day of the start of tests failing). I also wrote a small fixup to the sanity-quota.sh test script because it was spewing a lot of garbage into the logs: http://review.whamcloud.com/8978 Note that this is NOT a fix to the problem reported here.

            Niu,
            Could you please take a look at this one?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Niu, Could you please take a look at this one? Thank you!

            People

              niu Niu Yawei (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: