[LU-17191] sanity-quota test_1b, 1d, 1f, 1i: FAIL: user write success, but expect EDQUOT Created: 13/Oct/23  Updated: 09/Nov/23  Resolved: 09/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Sergey Cheremencev Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16771 statfs_max_age not used with statfs()... Reopened
is related to LU-17046 sanity-quota test_1g: user write succ... Resolved
is related to LU-8130 Migrate from libcfs hash to rhashtable Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Tests sanity-quota 1b, 1d, 1f, 1i regularly fail on my local VM on the latest master(d8d4df24c6924). Nothing specific should be done to reproduce it:

uname -a
Linux vm1 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Fri Jun 17 18:46:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
...
bash ./llmount.sh
ONLY=1 bash ./sanit-quota.sh
...
== sanity-quota test complete, duration 287 sec ========== 02:28:36 (1697149716)
sanity-quota: FAIL: test_1b user write success, but expect EDQUOT
sanity-quota: FAIL: test_1d user write success, but expect EDQUOT
sanity-quota: FAIL: test_1f user write success, but expect EDQUOT
sanity-quota: FAIL: test_1i user write success, but expect EDQUOT
=== sanity-quota: start cleanup 02:28:36 (1697149716) === 

At first look the problem comes from the client side - osc_quota_chkdq doesn't return EDQUOT despite the fact it got appropriate flag from the server:

00000008:00000001:1.0:1697151056.504596:0:14647:0:(osc_request.c:2130:osc_brw_fini_request()) Process entered
00000008:04000000:1.0:1697151056.504599:0:14647:0:(osc_request.c:2153:osc_brw_fini_request()) setdq for [1000 1000 0] with valid 0x18000006b584fb9, flags 6100
00000001:00000001:1.0:1697151056.504604:0:14647:0:(osc_quota.c:92:osc_quota_setdq()) Process entered
00000001:00000001:1.0:1697151056.504609:0:14647:0:(osc_quota.c:166:osc_quota_setdq()) Process leaving (rc=18446744073709551600 : -16 : fffffffffffffff0)
00000008:00000001:1.0:1697151056.504614:0:14647:0:(osc_request.c:2185:osc_brw_fini_request()) Process leaving via out (rc=0 : 0 : 0x0) 
00000008:00000001:1.0:1697151056.504618:0:14647:0:(osc_request.c:2399:osc_brw_fini_request()) Process leaving (rc=0 : 0 : 0) 
...
00000001:00000001:3.0:1697151061.710836:0:2118:0:(osc_quota.c:40:osc_quota_chkdq()) Process entered
00000001:00000001:3.0:1697151061.710837:0:2118:0:(osc_quota.c:55:osc_quota_chkdq()) Process leaving (rc=0 : 0 : 0)
 

There is a -EBUSY error that from my point of view should be handled by another way:

diff --git a/lustre/osc/osc_quota.c b/lustre/osc/osc_quota.c
index b127361..f06276e 100644
--- a/lustre/osc/osc_quota.c
+++ b/lustre/osc/osc_quota.c
@@ -129,6 +129,8 @@ int osc_quota_setdq(struct client_obd *cli, u64 xid, const unsigned int qid[],
                        bits |= BIT(type);
                        rc = xa_insert(&cli->cl_quota_exceeded_ids, qid[type],
                                       xa_mk_value(bits), GFP_KERNEL);
+                       if (rc == -EBUSY)
+                               continue;
                        if (rc)
                                break; 

However, above fix doesn't help in my case and tests continue to fail. I guess xa_insert should return 0 and this is the problem.

I tried to revert "LU-8130 osc: convert osc_quota hash to xarray"(ac8c28f959d87c) and tests stopped failing.

simmonsja , can you take a look? I'll push a revert for ac8c28f959d, but if you can prepare a quick fix I will abandon my revert and help you to move on with that.



 Comments   
Comment by Gerrit Updater [ 13/Oct/23 ]

"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52685
Subject: LU-17191 osc: osc_quota_setdq returns EBUSY
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8e12eb2c69f9b1495a6e324eee4a143696e6756f

Comment by James A Simmons [ 16/Oct/23 ]

Sure. I don't notice this ticket until now. I see the issue. We found this while working on the NRS xarray patch but this patch missed it. Testing now and will have something soon.

Comment by Gerrit Updater [ 16/Oct/23 ]

"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52713
Subject: LU-17191 osc: handle xa_insert returing -EBUSY
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0b6fe9ffd7acca9281fa9df50c0754d371da6f01

Comment by Sergey Cheremencev [ 17/Oct/23 ]

Hi simmonsja ,

https://review.whamcloud.com/c/fs/lustre-release/+/52713 doesn't help. I also tried the similar quick fix before pushing revert. Yes, this problem should be fixed in the future but sanity-quota tests fail due to another one. The reason why 52713 shouldn't help is that the tests fail due to wrong quota edquot check for User. The user is the 1st one in a cycle so it doesn't matter whether we continue or break the cycle.

From my finding it fails due to following reason:

00000001:00000001:1.0:1697151056.504609:0:14647:0:(osc_quota.c:166:osc_quota_setdq()) Process leaving (rc=18446744073709551600 : -16 : fffffffffffffff0)
...
00000001:00000001:3.0:1697151061.710836:0:2118:0:(osc_quota.c:40:osc_quota_chkdq()) Process entered
00000001:00000001:3.0:1697151061.710837:0:2118:0:(osc_quota.c:55:osc_quota_chkdq()) Process leaving (rc=0 : 0 : 0) 

osc_quota_chkdq returns 0 instead of -EDQUOT. At the same time osc_quota_setdq can't add a new index into Xarray. 

Comment by James A Simmons [ 19/Oct/23 ]

Sorry I really needed to work on the RCU stall issues first so this is even looked at. I'm taking another look at this patch.

Comment by Andreas Dilger [ 04/Nov/23 ]

Hmm, the most recent patch still fails Janitor for sanity-quota test_1g.

Stephane had an interesting issue with project quota on the client (LU-16771) that would suggest being able to cache the (project) quota results on the client for a few seconds would be improve performance for applications that are statfs() intensive when project quotas are in use.

I wonder if it makes sense to change the current xarray implementation for the quota to be able to cache at least the project quota information (usage/limit), but potentially also user/group quota, to avoid frequent RPCs.  The slight drawback is that the quota tests would probably need to disable this cache, but that could easily be done by setting "llite.*.statfs_max_age=0" or =1.

It might make sense to change to an rhashtable at that point, since the Xarray implementation continues to have problems.  Alternately, we could store the project (+user+group?) quota as the Xarray value and store the "over quota" state as a mark on the Xarray entry?

Thoughts?

Comment by James A Simmons [ 04/Nov/23 ]

The failure of sanity-quota 1g is LU-17046 which was reported before the Xarray patch landed.  So I wouldn't count out the Xarray work.

Comment by Sergey Cheremencev [ 06/Nov/23 ]

Hi simmonsja , if you say that sanity-quota_1g fails due to LU-17046 can you give any details?

You probably looked into the logs or found why it fails. Ideally this should be the link to the  failure and a couple of words. It could save our time to finally fix LU-17046.

Thanks.

Comment by James A Simmons [ 06/Nov/23 ]

I found the source of the faliures. sanity-quota_1g is failing due to ("LU-13810 tests: increase limit for 1g"). 

Comment by Gerrit Updater [ 08/Nov/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52713/
Subject: LU-17191 osc: only call xa_insert for new entries
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 67e0d9e40acc6adcebf89e2a4ac3860f0c4273d2

Comment by Peter Jones [ 09/Nov/23 ]

Merged for 2.16

Generated at Sat Feb 10 03:33:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.