Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.16.0
-
None
-
3
-
9223372036854775807
Description
Tests sanity-quota 1b, 1d, 1f, 1i regularly fail on my local VM on the latest master(d8d4df24c6924). Nothing specific should be done to reproduce it:
uname -a Linux vm1 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Fri Jun 17 18:46:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux ... bash ./llmount.sh ONLY=1 bash ./sanit-quota.sh ... == sanity-quota test complete, duration 287 sec ========== 02:28:36 (1697149716) sanity-quota: FAIL: test_1b user write success, but expect EDQUOT sanity-quota: FAIL: test_1d user write success, but expect EDQUOT sanity-quota: FAIL: test_1f user write success, but expect EDQUOT sanity-quota: FAIL: test_1i user write success, but expect EDQUOT === sanity-quota: start cleanup 02:28:36 (1697149716) ===
At first look the problem comes from the client side - osc_quota_chkdq doesn't return EDQUOT despite the fact it got appropriate flag from the server:
00000008:00000001:1.0:1697151056.504596:0:14647:0:(osc_request.c:2130:osc_brw_fini_request()) Process entered 00000008:04000000:1.0:1697151056.504599:0:14647:0:(osc_request.c:2153:osc_brw_fini_request()) setdq for [1000 1000 0] with valid 0x18000006b584fb9, flags 6100 00000001:00000001:1.0:1697151056.504604:0:14647:0:(osc_quota.c:92:osc_quota_setdq()) Process entered 00000001:00000001:1.0:1697151056.504609:0:14647:0:(osc_quota.c:166:osc_quota_setdq()) Process leaving (rc=18446744073709551600 : -16 : fffffffffffffff0) 00000008:00000001:1.0:1697151056.504614:0:14647:0:(osc_request.c:2185:osc_brw_fini_request()) Process leaving via out (rc=0 : 0 : 0x0) 00000008:00000001:1.0:1697151056.504618:0:14647:0:(osc_request.c:2399:osc_brw_fini_request()) Process leaving (rc=0 : 0 : 0) ... 00000001:00000001:3.0:1697151061.710836:0:2118:0:(osc_quota.c:40:osc_quota_chkdq()) Process entered 00000001:00000001:3.0:1697151061.710837:0:2118:0:(osc_quota.c:55:osc_quota_chkdq()) Process leaving (rc=0 : 0 : 0)
There is a -EBUSY error that from my point of view should be handled by another way:
diff --git a/lustre/osc/osc_quota.c b/lustre/osc/osc_quota.c index b127361..f06276e 100644 --- a/lustre/osc/osc_quota.c +++ b/lustre/osc/osc_quota.c @@ -129,6 +129,8 @@ int osc_quota_setdq(struct client_obd *cli, u64 xid, const unsigned int qid[], bits |= BIT(type); rc = xa_insert(&cli->cl_quota_exceeded_ids, qid[type], xa_mk_value(bits), GFP_KERNEL); + if (rc == -EBUSY) + continue; if (rc) break;
However, above fix doesn't help in my case and tests continue to fail. I guess xa_insert should return 0 and this is the problem.
I tried to revert "LU-8130 osc: convert osc_quota hash to xarray"(ac8c28f959d87c) and tests stopped failing.
simmonsja , can you take a look? I'll push a revert for ac8c28f959d, but if you can prepare a quick fix I will abandon my revert and help you to move on with that.