[LU-17191] sanity-quota test_1b, 1d, 1f, 1i: FAIL: user write success, but expect EDQUOT Created: 13/Oct/23 Updated: 09/Nov/23 Resolved: 09/Nov/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sergey Cheremencev | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Tests sanity-quota 1b, 1d, 1f, 1i regularly fail on my local VM on the latest master(d8d4df24c6924). Nothing specific should be done to reproduce it: uname -a Linux vm1 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Fri Jun 17 18:46:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux ... bash ./llmount.sh ONLY=1 bash ./sanit-quota.sh ... == sanity-quota test complete, duration 287 sec ========== 02:28:36 (1697149716) sanity-quota: FAIL: test_1b user write success, but expect EDQUOT sanity-quota: FAIL: test_1d user write success, but expect EDQUOT sanity-quota: FAIL: test_1f user write success, but expect EDQUOT sanity-quota: FAIL: test_1i user write success, but expect EDQUOT === sanity-quota: start cleanup 02:28:36 (1697149716) === At first look the problem comes from the client side - osc_quota_chkdq doesn't return EDQUOT despite the fact it got appropriate flag from the server: 00000008:00000001:1.0:1697151056.504596:0:14647:0:(osc_request.c:2130:osc_brw_fini_request()) Process entered 00000008:04000000:1.0:1697151056.504599:0:14647:0:(osc_request.c:2153:osc_brw_fini_request()) setdq for [1000 1000 0] with valid 0x18000006b584fb9, flags 6100 00000001:00000001:1.0:1697151056.504604:0:14647:0:(osc_quota.c:92:osc_quota_setdq()) Process entered 00000001:00000001:1.0:1697151056.504609:0:14647:0:(osc_quota.c:166:osc_quota_setdq()) Process leaving (rc=18446744073709551600 : -16 : fffffffffffffff0) 00000008:00000001:1.0:1697151056.504614:0:14647:0:(osc_request.c:2185:osc_brw_fini_request()) Process leaving via out (rc=0 : 0 : 0x0) 00000008:00000001:1.0:1697151056.504618:0:14647:0:(osc_request.c:2399:osc_brw_fini_request()) Process leaving (rc=0 : 0 : 0) ... 00000001:00000001:3.0:1697151061.710836:0:2118:0:(osc_quota.c:40:osc_quota_chkdq()) Process entered 00000001:00000001:3.0:1697151061.710837:0:2118:0:(osc_quota.c:55:osc_quota_chkdq()) Process leaving (rc=0 : 0 : 0) There is a -EBUSY error that from my point of view should be handled by another way: diff --git a/lustre/osc/osc_quota.c b/lustre/osc/osc_quota.c index b127361..f06276e 100644 --- a/lustre/osc/osc_quota.c +++ b/lustre/osc/osc_quota.c @@ -129,6 +129,8 @@ int osc_quota_setdq(struct client_obd *cli, u64 xid, const unsigned int qid[], bits |= BIT(type); rc = xa_insert(&cli->cl_quota_exceeded_ids, qid[type], xa_mk_value(bits), GFP_KERNEL); + if (rc == -EBUSY) + continue; if (rc) break; However, above fix doesn't help in my case and tests continue to fail. I guess xa_insert should return 0 and this is the problem. I tried to revert "LU-8130 osc: convert osc_quota hash to xarray"(ac8c28f959d87c) and tests stopped failing. simmonsja , can you take a look? I'll push a revert for ac8c28f959d, but if you can prepare a quick fix I will abandon my revert and help you to move on with that. |
| Comments |
| Comment by Gerrit Updater [ 13/Oct/23 ] |
|
"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52685 |
| Comment by James A Simmons [ 16/Oct/23 ] |
|
Sure. I don't notice this ticket until now. I see the issue. We found this while working on the NRS xarray patch but this patch missed it. Testing now and will have something soon. |
| Comment by Gerrit Updater [ 16/Oct/23 ] |
|
"James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52713 |
| Comment by Sergey Cheremencev [ 17/Oct/23 ] |
|
Hi simmonsja , https://review.whamcloud.com/c/fs/lustre-release/+/52713 doesn't help. I also tried the similar quick fix before pushing revert. Yes, this problem should be fixed in the future but sanity-quota tests fail due to another one. The reason why 52713 shouldn't help is that the tests fail due to wrong quota edquot check for User. The user is the 1st one in a cycle so it doesn't matter whether we continue or break the cycle. From my finding it fails due to following reason: 00000001:00000001:1.0:1697151056.504609:0:14647:0:(osc_quota.c:166:osc_quota_setdq()) Process leaving (rc=18446744073709551600 : -16 : fffffffffffffff0) ... 00000001:00000001:3.0:1697151061.710836:0:2118:0:(osc_quota.c:40:osc_quota_chkdq()) Process entered 00000001:00000001:3.0:1697151061.710837:0:2118:0:(osc_quota.c:55:osc_quota_chkdq()) Process leaving (rc=0 : 0 : 0) osc_quota_chkdq returns 0 instead of -EDQUOT. At the same time osc_quota_setdq can't add a new index into Xarray. |
| Comment by James A Simmons [ 19/Oct/23 ] |
|
Sorry I really needed to work on the RCU stall issues first so this is even looked at. I'm taking another look at this patch. |
| Comment by Andreas Dilger [ 04/Nov/23 ] |
|
Hmm, the most recent patch still fails Janitor for sanity-quota test_1g. Stephane had an interesting issue with project quota on the client (LU-16771) that would suggest being able to cache the (project) quota results on the client for a few seconds would be improve performance for applications that are statfs() intensive when project quotas are in use. I wonder if it makes sense to change the current xarray implementation for the quota to be able to cache at least the project quota information (usage/limit), but potentially also user/group quota, to avoid frequent RPCs. The slight drawback is that the quota tests would probably need to disable this cache, but that could easily be done by setting "llite.*.statfs_max_age=0" or =1. It might make sense to change to an rhashtable at that point, since the Xarray implementation continues to have problems. Alternately, we could store the project (+user+group?) quota as the Xarray value and store the "over quota" state as a mark on the Xarray entry? Thoughts? |
| Comment by James A Simmons [ 04/Nov/23 ] |
|
The failure of sanity-quota 1g is |
| Comment by Sergey Cheremencev [ 06/Nov/23 ] |
|
Hi simmonsja , if you say that sanity-quota_1g fails due to You probably looked into the logs or found why it fails. Ideally this should be the link to the failure and a couple of words. It could save our time to finally fix Thanks. |
| Comment by James A Simmons [ 06/Nov/23 ] |
|
I found the source of the faliures. sanity-quota_1g is failing due to (" |
| Comment by Gerrit Updater [ 08/Nov/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52713/ |
| Comment by Peter Jones [ 09/Nov/23 ] |
|
Merged for 2.16 |