[LU-11303] slow chgrp as user when quotas are enabled Created: 30/Aug/18  Updated: 18/Jan/22  Resolved: 25/Aug/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.4
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: SC Admin (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 2
Labels: None

Issue Links:
Related
is related to LU-12351 quota not enforced on chgrp Open
is related to LU-13176 rename() to another directory should ... Resolved
is related to LU-12826 Project quotas: users can change proj... Resolved
is related to LU-5152 Can't enforce block quota when unpriv... Resolved
is related to LU-7239 mdd_attr_set() synchronous when it ne... Resolved
is related to LU-11227 client process hangs when lod_sync ac... Resolved
is related to LU-13176 rename() to another directory should ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

we have had a user complain that chgrp of a few 1000 file directory tree takes 3x longer than the untar of that data.

it seems likely that this is due to LU-5152 which AFAICT introduced code that forces a dt_sync for each chgrp as a user.

is there another way to do this which avoids the dt_sync?

in my experience most HPC sites use secondary (supplementary) groups extensively so that users can be members of several research projects. for various reasons this results in lots of files created with the wrong group for the file's location. as root we periodically trawl the filesystem to correct the group ownership of files to match their physical location (ie. poor mans directory/project quotas), but sometimes users still want to change the group ownerships themselves to "do the right thing", and now this goes a lot slower for them.

so I suppose your expectation that unpriv users doing chgrp is rare is sort of valid because we do most of it manually and sporadically for them as root, but (again, in my experience) because of extensive use of supplementary groups in HPC, users wanting to do a chgrp is perhaps more common than you might think.

project quotas would remove most of our reasons for using chgrp but maybe not all. unfortunately we aren't likely to try any more new things like project quotas any time soon.

BTW it would be good to have lustre test users that had secondary groups in order to find problems like this. I don't see any at the moment. I was looking because I need one to make a regression test case for LU-11227 (related to LU-5152).

cheers,
robin



 Comments   
Comment by Andreas Dilger [ 30/Aug/18 ]

Robin, do you also have quotas enabled on this filesystem?

Comment by SC Admin (Inactive) [ 30/Aug/18 ]

yes, the big dagg filesystem has group quotas enforcing.

we have user quotas enforcing on the /home Lustre filesystem. the other 2 small filesystems don't use quotas (/apps and /images).

cheers,
robin

Comment by Peter Jones [ 30/Aug/18 ]

Hongchao

Can you please investigate?

Thanks

Peter

Comment by Gerrit Updater [ 04/Sep/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33107
Subject: LU-11303 out: clean up osp_update_rpc_pack() macro
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 95682e519ed2c4e630f4a0ab17265ab91653ed99

Comment by Gerrit Updater [ 21/Sep/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33107/
Subject: LU-11303 out: clean up osp_update_rpc_pack() macro
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3314af7c7b18bbd60e6a540105fd0ed6d7de6848

Comment by Peter Jones [ 21/Sep/18 ]

Landed for 2.12

Comment by Lukasz Flis [ 12/Oct/18 ]

We can confirm the same problem in the: 2.10.5 on the HPC system in CYFRONET

quota enforcement: enabled

single chgrp on single file to a secondary group executed by non-root user can take from 10-140 seconds on a busy filesystem.

chgrp command blocks on  fchownat syscall

 

@Peter Jones: do you plan to include fix for next b2_10 release ( 2.10.6) ?

Comment by Lukasz Flis [ 12/Oct/18 ]

@adilger could you please comment if  this patch solves the problem with slow chgrp introduced by LU-5152 or is it just cosmetic cleanup to drop unknown opcodes in rpc?

I have backported this patch (https://review.whamcloud.com/33107/) to b2_10
but i wanted to be sure it's  fixing the problem before we go to the production with it

Comment by Andreas Dilger [ 06/Nov/18 ]

The landed patch was just a code cleanup and did not address the issue in this ticket.

Comment by Andreas Dilger [ 06/Nov/18 ]

I see that patch https://review.whamcloud.com/16699 "LU-7239 mdd: make mdd_attr_set() synchronous less often" removes one source of sync operations on the MDS for chgrp, but does not address the dt_sync() call for chgrp to avoid over-quota on the OSS nodes.

It possibly makes sense to do a simple check if the user is close to exceeding the quotas before enforcing the sync behaviour (e.g. quota free > file size). If they are not close to the quota limit there is no need to enforce the sync behaviour.

Comment by Gerrit Updater [ 09/Jan/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33996
Subject: LU-11303 quota: enforce block quota for chgrp
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f8cf68fcb0432a9c293f678428e6f4ac6fa53c37

Comment by Gerrit Updater [ 19/Jan/19 ]

https://review.whamcloud.com/33996 has been updated

Comment by Gerrit Updater [ 25/Aug/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/33996/
Subject: LU-11303 quota: enforce block quota for chgrp
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 83f5544d8518ad12ea49e27829fff8f2739b86e2

Comment by Peter Jones [ 25/Aug/21 ]

Landed for 2.15

Generated at Sat Feb 10 02:42:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.