Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11303

slow chgrp as user when quotas are enabled

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • Lustre 2.10.4
    • None
    • 3
    • 9223372036854775807

    Description

      Hi,

      we have had a user complain that chgrp of a few 1000 file directory tree takes 3x longer than the untar of that data.

      it seems likely that this is due to LU-5152 which AFAICT introduced code that forces a dt_sync for each chgrp as a user.

      is there another way to do this which avoids the dt_sync?

      in my experience most HPC sites use secondary (supplementary) groups extensively so that users can be members of several research projects. for various reasons this results in lots of files created with the wrong group for the file's location. as root we periodically trawl the filesystem to correct the group ownership of files to match their physical location (ie. poor mans directory/project quotas), but sometimes users still want to change the group ownerships themselves to "do the right thing", and now this goes a lot slower for them.

      so I suppose your expectation that unpriv users doing chgrp is rare is sort of valid because we do most of it manually and sporadically for them as root, but (again, in my experience) because of extensive use of supplementary groups in HPC, users wanting to do a chgrp is perhaps more common than you might think.

      project quotas would remove most of our reasons for using chgrp but maybe not all. unfortunately we aren't likely to try any more new things like project quotas any time soon.

      BTW it would be good to have lustre test users that had secondary groups in order to find problems like this. I don't see any at the moment. I was looking because I need one to make a regression test case for LU-11227 (related to LU-5152).

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-11303] slow chgrp as user when quotas are enabled
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/33996/
            Subject: LU-11303 quota: enforce block quota for chgrp
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 83f5544d8518ad12ea49e27829fff8f2739b86e2

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/33996/ Subject: LU-11303 quota: enforce block quota for chgrp Project: fs/lustre-release Branch: master Current Patch Set: Commit: 83f5544d8518ad12ea49e27829fff8f2739b86e2
            gerrit Gerrit Updater added a comment - - edited
            gerrit Gerrit Updater added a comment - - edited https://review.whamcloud.com/33996 has been updated

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33996
            Subject: LU-11303 quota: enforce block quota for chgrp
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f8cf68fcb0432a9c293f678428e6f4ac6fa53c37

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33996 Subject: LU-11303 quota: enforce block quota for chgrp Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f8cf68fcb0432a9c293f678428e6f4ac6fa53c37

            I see that patch https://review.whamcloud.com/16699 "LU-7239 mdd: make mdd_attr_set() synchronous less often" removes one source of sync operations on the MDS for chgrp, but does not address the dt_sync() call for chgrp to avoid over-quota on the OSS nodes.

            It possibly makes sense to do a simple check if the user is close to exceeding the quotas before enforcing the sync behaviour (e.g. quota free > file size). If they are not close to the quota limit there is no need to enforce the sync behaviour.

            adilger Andreas Dilger added a comment - I see that patch https://review.whamcloud.com/16699 " LU-7239 mdd: make mdd_attr_set() synchronous less often" removes one source of sync operations on the MDS for chgrp, but does not address the dt_sync() call for chgrp to avoid over-quota on the OSS nodes. It possibly makes sense to do a simple check if the user is close to exceeding the quotas before enforcing the sync behaviour (e.g. quota free > file size). If they are not close to the quota limit there is no need to enforce the sync behaviour.

            The landed patch was just a code cleanup and did not address the issue in this ticket.

            adilger Andreas Dilger added a comment - The landed patch was just a code cleanup and did not address the issue in this ticket.
            lflis Lukasz Flis added a comment -

            @adilger could you please comment if  this patch solves the problem with slow chgrp introduced by LU-5152 or is it just cosmetic cleanup to drop unknown opcodes in rpc?

            I have backported this patch (https://review.whamcloud.com/33107/) to b2_10
            but i wanted to be sure it's  fixing the problem before we go to the production with it

            lflis Lukasz Flis added a comment - @adilger could you please comment if  this patch solves the problem with slow chgrp introduced by LU-5152 or is it just cosmetic cleanup to drop unknown opcodes in rpc? I have backported this patch ( https://review.whamcloud.com/33107/ ) to b2_10 but i wanted to be sure it's  fixing the problem before we go to the production with it
            lflis Lukasz Flis added a comment -

            We can confirm the same problem in the: 2.10.5 on the HPC system in CYFRONET

            quota enforcement: enabled

            single chgrp on single file to a secondary group executed by non-root user can take from 10-140 seconds on a busy filesystem.

            chgrp command blocks on  fchownat syscall

             

            @Peter Jones: do you plan to include fix for next b2_10 release ( 2.10.6) ?

            lflis Lukasz Flis added a comment - We can confirm the same problem in the: 2.10.5 on the HPC system in CYFRONET quota enforcement: enabled single chgrp on single file to a secondary group executed by non-root user can take from 10-140 seconds on a busy filesystem. chgrp command blocks on  fchownat syscall   @Peter Jones: do you plan to include fix for next b2_10 release ( 2.10.6) ?
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33107/
            Subject: LU-11303 out: clean up osp_update_rpc_pack() macro
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3314af7c7b18bbd60e6a540105fd0ed6d7de6848

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33107/ Subject: LU-11303 out: clean up osp_update_rpc_pack() macro Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3314af7c7b18bbd60e6a540105fd0ed6d7de6848

            People

              hongchao.zhang Hongchao Zhang
              scadmin SC Admin
              Votes:
              2 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: