Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17046

sanity-quota test_1g: user write success, but expect EDQUOT

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Feng Lei <flei@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/199b4874-fb26-4062-b506-70ca71268b02

      test_1g failed with the following error:

      user write success, but expect EDQUOT
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/97057 - 4.18.0-477.15.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/97057 - 4.18.0-477.15.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-quota test_1g - user write success, but expect EDQUOT

      Attachments

        Issue Links

          Activity

            [LU-17046] sanity-quota test_1g: user write success, but expect EDQUOT

            At first look a client can't add more than 256 pages in osc_enter_cache with max_dirty_mb == 1. Obviously according to the logs sometimes it might happen. Im going to change sanity-quota_1g to write $((OSTCOUNT*5)) MB instead of $((OSTCOUNT*3)) in https://review.whamcloud.com/c/fs/lustre-release/+/54208.

            scherementsev Sergey Cheremencev added a comment - At first look a client can't add more than 256 pages in osc_enter_cache with max_dirty_mb == 1. Obviously according to the logs sometimes it might happen. Im going to change sanity-quota_1g to write $((OSTCOUNT*5)) MB instead of $((OSTCOUNT*3)) in https://review.whamcloud.com/c/fs/lustre-release/+/54208 .

            From https://testing.whamcloud.com/test_sets/324929ef-a971-44cc-a217-c4a32fc464f0

            == SERVER SIDE == 
            00040000:04000000:1.0:1732203628.134190:0:15186:0:(qsd_entry.c:236:qsd_refresh_usage()) $$$ disk usage: 4096  qsd:lustre-OST0005 qtype:usr id:60000 enforced:1 granted: 5120 pending:0 waiting:0 req:0 usage: 4096 qunit:1024 qtune:512 edquot:1 default:no revoke:0
            == CLIENT SIDE ==
            00000001:00000001:1.0:1732203630.871117:0:56995:0:(osc_quota.c:25:osc_quota_chkdq()) Process entered
            00000001:00000001:1.0:1732203630.871118:0:56995:0:(osc_quota.c:40:osc_quota_chkdq()) Process leaving (rc=0 : 0 : 0) 
            == SERVER SIDE ==
            00040000:04000000:0.0:1732203630.884073:0:15191:0:(qsd_handler.c:744:qsd_op_begin0()) $$$ op_begin space:4160  qsd:lustre-OST0005 qtype:usr id:60000 enforced:1 granted: 5120 pending:0 waiting:0 req:0 usage: 4096 qunit:1024 qtune:512 edquot:1 default:no revoke:0
            ...
            00040000:04000000:0.0:1732203630.884539:0:15191:0:(qsd_handler.c:327:qsd_req_completion()) $$$ DQACQ returned -122, flags:0x1  qsd:lustre-OST0005 qtype:usr id:60000 enforced:1 granted: 5120 pending:0 waiting:4160 req:1 usage: 4096 qunit:1024 qtune:512 edquot:1 default:no revoke:0
            == CLIENT SIDE ==
            00000001:04000000:1.0:1732203630.888326:0:7021:0:(osc_quota.c:132:osc_quota_setdq()) lustre-OST0005-osc-ffff9c5b02677800: setdq to insert for usr 60000: rc = 0

            From above logs it is clear that everything is fine with the EDQUOT at server side. It had been set at 1732203628. The latest write is 4096KB. I think the problem is that the client sent 4MB BRW at one time despite max_dirty_mb == 1.

            scherementsev Sergey Cheremencev added a comment - From https://testing.whamcloud.com/test_sets/324929ef-a971-44cc-a217-c4a32fc464f0 == SERVER SIDE == 00040000:04000000:1.0:1732203628.134190:0:15186:0:(qsd_entry.c:236:qsd_refresh_usage()) $$$ disk usage: 4096  qsd:lustre-OST0005 qtype:usr id:60000 enforced:1 granted: 5120 pending:0 waiting:0 req:0 usage: 4096 qunit:1024 qtune:512 edquot:1 default:no revoke:0 == CLIENT SIDE == 00000001:00000001:1.0:1732203630.871117:0:56995:0:(osc_quota.c:25:osc_quota_chkdq()) Process entered 00000001:00000001:1.0:1732203630.871118:0:56995:0:(osc_quota.c:40:osc_quota_chkdq()) Process leaving (rc=0 : 0 : 0) == SERVER SIDE == 00040000:04000000:0.0:1732203630.884073:0:15191:0:(qsd_handler.c:744:qsd_op_begin0()) $$$ op_begin space:4160  qsd:lustre-OST0005 qtype:usr id:60000 enforced:1 granted: 5120 pending:0 waiting:0 req:0 usage: 4096 qunit:1024 qtune:512 edquot:1 default:no revoke:0 ... 00040000:04000000:0.0:1732203630.884539:0:15191:0:(qsd_handler.c:327:qsd_req_completion()) $$$ DQACQ returned -122, flags:0x1  qsd:lustre-OST0005 qtype:usr id:60000 enforced:1 granted: 5120 pending:0 waiting:4160 req:1 usage: 4096 qunit:1024 qtune:512 edquot:1 default:no revoke:0 == CLIENT SIDE == 00000001:04000000:1.0:1732203630.888326:0:7021:0:(osc_quota.c:132:osc_quota_setdq()) lustre-OST0005-osc-ffff9c5b02677800: setdq to insert for usr 60000: rc = 0 From above logs it is clear that everything is fine with the EDQUOT at server side. It had been set at 1732203628. The latest write is 4096KB. I think the problem is that the client sent 4MB BRW at one time despite max_dirty_mb == 1.

            https://testing.whamcloud.com/test_sets/256d94dd-78e0-4590-9f4b-9be204cadda4

            One more failure. And I see a lot of similar failures during couple of months. I was looking into the logs, but haven't found the reason. Client log includes just a couple seconds. From MDS and OSS logs it is clear that everything was working as expected. At least I see that edquot flag had been set and there was no problem with that.

            I don't think something wrong with the quota code, I guess it is again the problem of a test. 

            scherementsev Sergey Cheremencev added a comment - https://testing.whamcloud.com/test_sets/256d94dd-78e0-4590-9f4b-9be204cadda4 One more failure. And I see a lot of similar failures during couple of months. I was looking into the logs, but haven't found the reason. Client log includes just a couple seconds. From MDS and OSS logs it is clear that everything was working as expected. At least I see that edquot flag had been set and there was no problem with that. I don't think something wrong with the quota code, I guess it is again the problem of a test. 

            Thank you Oleg.

            simmonsja James A Simmons added a comment - Thank you Oleg.

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53038/
            Subject: LU-17046 tests: fix write success in 1g
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: de352465eb6a02aeb20357208c54e903585e12e3

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53038/ Subject: LU-17046 tests: fix write success in 1g Project: fs/lustre-release Branch: master Current Patch Set: Commit: de352465eb6a02aeb20357208c54e903585e12e3
            scherementsev Sergey Cheremencev added a comment - - edited

            There is a lot of failures of sanity-quota_1g last time. However, the most of them have different reasons. https://review.whamcloud.com/c/fs/lustre-release/+/53038 is only to fix the issue from the description. Other failures could be fixed by https://review.whamcloud.com/c/fs/lustre-release/+/52713 (LU-17191). If after landing these both patches sanity-quota_1g will be still failing, new failures should be worked out in a separate ticket.

            scherementsev Sergey Cheremencev added a comment - - edited There is a lot of failures of sanity-quota_1g last time. However, the most of them have different reasons. https://review.whamcloud.com/c/fs/lustre-release/+/53038 is only to fix the issue from the description. Other failures could be fixed by https://review.whamcloud.com/c/fs/lustre-release/+/52713 ( LU-17191 ). If after landing these both patches sanity-quota_1g will be still failing, new failures should be worked out in a separate ticket.

            "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53038
            Subject: LU-17046 tests: fix write success in 1g
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 072e51154c13e7abf035a75018e52dc94c04806c

            gerrit Gerrit Updater added a comment - "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53038 Subject: LU-17046 tests: fix write success in 1g Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 072e51154c13e7abf035a75018e52dc94c04806c

            From https://testing.whamcloud.com/test_sets/199b4874-fb26-4062-b506-70ca71268b02

             [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d1g.sanity-quota/f1g.sanity-quota-0] [count=10] [seek=10]
            dd: error writing '/mnt/lustre/d1g.sanity-quota/f1g.sanity-quota-0': Disk quota exceeded
            8+0 records in
            7+0 records out
            

            Due to preacquire requests it gets EDQUOT earlier and only 7MB instead of 10M. Despite there is no logs about changing edquot from 1 to 0(somehow there is nothing about that and I believe this part of logs missed), I think as was written only 17MB, not 20, OSTs could send release requests causing to change edquot flag from 1 to 0. Thus when it writes the rest 2MB(there are 2 OSTs) it doesn't fail as there is no EDQUOT flag anymore and finally it writes just 19(17+2) MB instead of 22(20+2).

            scherementsev Sergey Cheremencev added a comment - From https://testing.whamcloud.com/test_sets/199b4874-fb26-4062-b506-70ca71268b02  [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d1g.sanity-quota/f1g.sanity-quota-0] [count=10] [seek=10] dd: error writing '/mnt/lustre/d1g.sanity-quota/f1g.sanity-quota-0': Disk quota exceeded 8+0 records in 7+0 records out Due to preacquire requests it gets EDQUOT earlier and only 7MB instead of 10M. Despite there is no logs about changing edquot from 1 to 0(somehow there is nothing about that and I believe this part of logs missed), I think as was written only 17MB, not 20, OSTs could send release requests causing to change edquot flag from 1 to 0. Thus when it writes the rest 2MB(there are 2 OSTs) it doesn't fail as there is no EDQUOT flag anymore and finally it writes just 19(17+2) MB instead of 22(20+2).
            simmonsja James A Simmons added a comment - - edited

            Tracked down the commit that is causing this failure.

            commit 5176c0494338de34be9b2bc35e55f91daaab67a6
            ("LU-13810 tests: increase limit for 1g")

            Further test suggest this bug existed before this patch but this patch really makes it come out.

            simmonsja James A Simmons added a comment - - edited Tracked down the commit that is causing this failure. commit 5176c0494338de34be9b2bc35e55f91daaab67a6 (" LU-13810 tests: increase limit for 1g") Further test suggest this bug existed before this patch but this patch really makes it come out.

            "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52995
            Subject: LU-17046 tests: increase sleep for sanity-quota 1g
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f3e49e5c530451426a0a2505b9cd16ba6662e4f4

            gerrit Gerrit Updater added a comment - "James Simmons <jsimmons@infradead.org>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52995 Subject: LU-17046 tests: increase sleep for sanity-quota 1g Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f3e49e5c530451426a0a2505b9cd16ba6662e4f4

            People

              simmonsja James A Simmons
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: