Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17191

sanity-quota test_1b, 1d, 1f, 1i: FAIL: user write success, but expect EDQUOT

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Tests sanity-quota 1b, 1d, 1f, 1i regularly fail on my local VM on the latest master(d8d4df24c6924). Nothing specific should be done to reproduce it:

      uname -a
      Linux vm1 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Fri Jun 17 18:46:08 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
      ...
      bash ./llmount.sh
      ONLY=1 bash ./sanit-quota.sh
      ...
      == sanity-quota test complete, duration 287 sec ========== 02:28:36 (1697149716)
      sanity-quota: FAIL: test_1b user write success, but expect EDQUOT
      sanity-quota: FAIL: test_1d user write success, but expect EDQUOT
      sanity-quota: FAIL: test_1f user write success, but expect EDQUOT
      sanity-quota: FAIL: test_1i user write success, but expect EDQUOT
      === sanity-quota: start cleanup 02:28:36 (1697149716) === 

      At first look the problem comes from the client side - osc_quota_chkdq doesn't return EDQUOT despite the fact it got appropriate flag from the server:

      00000008:00000001:1.0:1697151056.504596:0:14647:0:(osc_request.c:2130:osc_brw_fini_request()) Process entered
      00000008:04000000:1.0:1697151056.504599:0:14647:0:(osc_request.c:2153:osc_brw_fini_request()) setdq for [1000 1000 0] with valid 0x18000006b584fb9, flags 6100
      00000001:00000001:1.0:1697151056.504604:0:14647:0:(osc_quota.c:92:osc_quota_setdq()) Process entered
      00000001:00000001:1.0:1697151056.504609:0:14647:0:(osc_quota.c:166:osc_quota_setdq()) Process leaving (rc=18446744073709551600 : -16 : fffffffffffffff0)
      00000008:00000001:1.0:1697151056.504614:0:14647:0:(osc_request.c:2185:osc_brw_fini_request()) Process leaving via out (rc=0 : 0 : 0x0) 
      00000008:00000001:1.0:1697151056.504618:0:14647:0:(osc_request.c:2399:osc_brw_fini_request()) Process leaving (rc=0 : 0 : 0) 
      ...
      00000001:00000001:3.0:1697151061.710836:0:2118:0:(osc_quota.c:40:osc_quota_chkdq()) Process entered
      00000001:00000001:3.0:1697151061.710837:0:2118:0:(osc_quota.c:55:osc_quota_chkdq()) Process leaving (rc=0 : 0 : 0)
       

      There is a -EBUSY error that from my point of view should be handled by another way:

      diff --git a/lustre/osc/osc_quota.c b/lustre/osc/osc_quota.c
      index b127361..f06276e 100644
      --- a/lustre/osc/osc_quota.c
      +++ b/lustre/osc/osc_quota.c
      @@ -129,6 +129,8 @@ int osc_quota_setdq(struct client_obd *cli, u64 xid, const unsigned int qid[],
                              bits |= BIT(type);
                              rc = xa_insert(&cli->cl_quota_exceeded_ids, qid[type],
                                             xa_mk_value(bits), GFP_KERNEL);
      +                       if (rc == -EBUSY)
      +                               continue;
                              if (rc)
                                      break; 

      However, above fix doesn't help in my case and tests continue to fail. I guess xa_insert should return 0 and this is the problem.

      I tried to revert "LU-8130 osc: convert osc_quota hash to xarray"(ac8c28f959d87c) and tests stopped failing.

      simmonsja , can you take a look? I'll push a revert for ac8c28f959d, but if you can prepare a quick fix I will abandon my revert and help you to move on with that.

      Attachments

        Issue Links

          Activity

            [LU-17191] sanity-quota test_1b, 1d, 1f, 1i: FAIL: user write success, but expect EDQUOT
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52713/
            Subject: LU-17191 osc: only call xa_insert for new entries
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 67e0d9e40acc6adcebf89e2a4ac3860f0c4273d2

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52713/ Subject: LU-17191 osc: only call xa_insert for new entries Project: fs/lustre-release Branch: master Current Patch Set: Commit: 67e0d9e40acc6adcebf89e2a4ac3860f0c4273d2

            I found the source of the faliures. sanity-quota_1g is failing due to ("LU-13810 tests: increase limit for 1g"). 

            simmonsja James A Simmons added a comment - I found the source of the faliures. sanity-quota_1g is failing due to (" LU-13810 tests: increase limit for 1g"). 

            Hi simmonsja , if you say that sanity-quota_1g fails due to LU-17046 can you give any details?

            You probably looked into the logs or found why it fails. Ideally this should be the link to the  failure and a couple of words. It could save our time to finally fix LU-17046.

            Thanks.

            scherementsev Sergey Cheremencev added a comment - Hi simmonsja , if you say that sanity-quota_1g fails due to LU-17046 can you give any details? You probably looked into the logs or found why it fails. Ideally this should be the link to the  failure and a couple of words. It could save our time to finally fix LU-17046 . Thanks.

            The failure of sanity-quota 1g is LU-17046 which was reported before the Xarray patch landed.  So I wouldn't count out the Xarray work.

            simmonsja James A Simmons added a comment - The failure of sanity-quota 1g is LU-17046 which was reported before the Xarray patch landed.  So I wouldn't count out the Xarray work.

            Hmm, the most recent patch still fails Janitor for sanity-quota test_1g.

            Stephane had an interesting issue with project quota on the client (LU-16771) that would suggest being able to cache the (project) quota results on the client for a few seconds would be improve performance for applications that are statfs() intensive when project quotas are in use.

            I wonder if it makes sense to change the current xarray implementation for the quota to be able to cache at least the project quota information (usage/limit), but potentially also user/group quota, to avoid frequent RPCs.  The slight drawback is that the quota tests would probably need to disable this cache, but that could easily be done by setting "llite.*.statfs_max_age=0" or =1.

            It might make sense to change to an rhashtable at that point, since the Xarray implementation continues to have problems.  Alternately, we could store the project (+user+group?) quota as the Xarray value and store the "over quota" state as a mark on the Xarray entry?

            Thoughts?

            adilger Andreas Dilger added a comment - Hmm, the most recent patch still fails Janitor for sanity-quota test_1g. Stephane had an interesting issue with project quota on the client ( LU-16771 ) that would suggest being able to cache the (project) quota results on the client for a few seconds would be improve performance for applications that are statfs() intensive when project quotas are in use. I wonder if it makes sense to change the current xarray implementation for the quota to be able to cache at least the project quota information (usage/limit), but potentially also user/group quota, to avoid frequent RPCs.  The slight drawback is that the quota tests would probably need to disable this cache, but that could easily be done by setting " llite.*.statfs_max_age=0 " or =1. It might make sense to change to an rhashtable at that point, since the Xarray implementation continues to have problems.  Alternately, we could store the project (+user+group?) quota as the Xarray value and store the "over quota" state as a mark on the Xarray entry? Thoughts?

            Sorry I really needed to work on the RCU stall issues first so this is even looked at. I'm taking another look at this patch.

            simmonsja James A Simmons added a comment - Sorry I really needed to work on the RCU stall issues first so this is even looked at. I'm taking another look at this patch.

            Hi simmonsja ,

            https://review.whamcloud.com/c/fs/lustre-release/+/52713 doesn't help. I also tried the similar quick fix before pushing revert. Yes, this problem should be fixed in the future but sanity-quota tests fail due to another one. The reason why 52713 shouldn't help is that the tests fail due to wrong quota edquot check for User. The user is the 1st one in a cycle so it doesn't matter whether we continue or break the cycle.

            From my finding it fails due to following reason:

            00000001:00000001:1.0:1697151056.504609:0:14647:0:(osc_quota.c:166:osc_quota_setdq()) Process leaving (rc=18446744073709551600 : -16 : fffffffffffffff0)
            ...
            00000001:00000001:3.0:1697151061.710836:0:2118:0:(osc_quota.c:40:osc_quota_chkdq()) Process entered
            00000001:00000001:3.0:1697151061.710837:0:2118:0:(osc_quota.c:55:osc_quota_chkdq()) Process leaving (rc=0 : 0 : 0) 

            osc_quota_chkdq returns 0 instead of -EDQUOT. At the same time osc_quota_setdq can't add a new index into Xarray. 

            scherementsev Sergey Cheremencev added a comment - Hi simmonsja , https://review.whamcloud.com/c/fs/lustre-release/+/52713 doesn't help. I also tried the similar quick fix before pushing revert. Yes, this problem should be fixed in the future but sanity-quota tests fail due to another one. The reason why 52713 shouldn't help is that the tests fail due to wrong quota edquot check for User. The user is the 1st one in a cycle so it doesn't matter whether we continue or break the cycle. From my finding it fails due to following reason: 00000001:00000001:1.0:1697151056.504609:0:14647:0:(osc_quota.c:166:osc_quota_setdq()) Process leaving (rc=18446744073709551600 : -16 : fffffffffffffff0) ... 00000001:00000001:3.0:1697151061.710836:0:2118:0:(osc_quota.c:40:osc_quota_chkdq()) Process entered 00000001:00000001:3.0:1697151061.710837:0:2118:0:(osc_quota.c:55:osc_quota_chkdq()) Process leaving (rc=0 : 0 : 0) osc_quota_chkdq returns 0 instead of -EDQUOT. At the same time osc_quota_setdq can't add a new index into Xarray. 

            "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52713
            Subject: LU-17191 osc: handle xa_insert returing -EBUSY
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0b6fe9ffd7acca9281fa9df50c0754d371da6f01

            gerrit Gerrit Updater added a comment - "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52713 Subject: LU-17191 osc: handle xa_insert returing -EBUSY Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0b6fe9ffd7acca9281fa9df50c0754d371da6f01
            simmonsja James A Simmons added a comment - - edited

            Sure. I don't notice this ticket until now. I see the issue. We found this while working on the NRS xarray patch but this patch missed it. Testing now and will have something soon.

            simmonsja James A Simmons added a comment - - edited Sure. I don't notice this ticket until now. I see the issue. We found this while working on the NRS xarray patch but this patch missed it. Testing now and will have something soon.

            People

              simmonsja James A Simmons
              scherementsev Sergey Cheremencev
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: