[LU-14299] sanity-quota test 61 fails with 'write failed, expect succeed' Created: 06/Jan/21  Updated: 13/Jun/22  Resolved: 08/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

ZFS


Issue Links:
Related
is related to LU-12829 sanity-quota test 61 fails with 'writ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-quota test_61 fails with 'write failed, expect succeed'. The first time this test failed with this error message/case was on 05 May 2020 at https://testing.whamcloud.com/test_sets/f918d217-9487-479e-8294-16936426fe25 for Lustre 2.13.53.162. The failures are mostly seen for ZFS; 200 ZFS failures out of 218 total failures for this test with this error. So far, we are only seeing this failure on master( future 2.14.0).

Looking at a recent failure with no other sanity-quota test failures, at https://testing.whamcloud.com/test_sets/8ced1c69-2747-41c9-a580-8a3c5fcdf857, we see that sanity-quota test 61 fails with ‘Disk quota exceeded’ after increasing the default quota:

set to use default quota
set default quota
get default quota
Disk default usr quota:
     Filesystem   bquota  blimit  bgrace   iquota  ilimit  igrace
    /mnt/lustre  20480   20480       0      0       0      10
Test not out of quota
running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
 [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d61.sanity-quota/f61.sanity-quota-0] [count=10] [oflag=sync]
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 1.71261 s, 6.1 MB/s
Test out of quota
CMD: trevis-201vm4 lctl set_param -n os[cd]*.*MDT*.force_sync=1
CMD: trevis-201vm3 lctl set_param -n osd*.*OS*.force_sync=1
running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
 [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d61.sanity-quota/f61.sanity-quota-0] [count=40] [oflag=sync]
dd: error writing '/mnt/lustre/d61.sanity-quota/f61.sanity-quota-0': Disk quota exceeded
20+0 records in
19+0 records out
19922944 bytes (20 MB, 19 MiB) copied, 3.91893 s, 5.1 MB/s
Increase default quota
CMD: trevis-201vm4 lctl set_param -n os[cd]*.*MDT*.force_sync=1
CMD: trevis-201vm3 lctl set_param -n osd*.*OS*.force_sync=1
running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
 [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d61.sanity-quota/f61.sanity-quota-0] [count=40] [oflag=sync]
dd: error writing '/mnt/lustre/d61.sanity-quota/f61.sanity-quota-0': Disk quota exceeded
1+0 records in
0+0 records out
0 bytes copied, 0.00246058 s, 0.0 kB/s
CMD: trevis-201vm4 /usr/sbin/lctl get_param -n version 2>/dev/null
CMD: trevis-201vm4 zpool get all
 sanity-quota test_61: @@@@@@ FAIL: write failed, expect succeed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
  = /usr/lib64/lustre/tests/sanity-quota.sh:159:quota_error()
  = /usr/lib64/lustre/tests/sanity-quota.sh:3996:test_default_quota()
  = /usr/lib64/lustre/tests/sanity-quota.sh:4051:test_61()

Looking at the test, we increase the default quota, cancel OSC and MDC locks and sync data. We then run dd again and get the quota exceeded error

3986         log "Increase default quota"
3987         # increase default quota
3988         $LFS setquota $qdtype $qs $((LIMIT*3)) $qh $((LIMIT*3)) $DIR ||
3989                 error "set default quota failed"
3990 
3991         cancel_lru_locks osc
3992         cancel_lru_locks mdc
3993         sync; sync_all_data || true
3994         if [ $qpool == "data" ]; then
3995                 $RUNAS $DD of=$TESTFILE count=$((LIMIT*2 >> 10)) oflag=sync ||
3996                         quota_error $qtype $qid "write failed, expect succeed"
3997         else
3998                 $RUNAS createmany -m $TESTFILE $((LIMIT*2)) ||
3999                         quota_error $qtype $qid "create failed, expect succeed"
4000 
4001                 unlinkmany $TESTFILE $((LIMIT*2))
4002         fi

Logs for other failures are at
https://testing.whamcloud.com/test_sets/470f88c5-e555-4352-bd2f-ddb2f281e7b6
https://testing.whamcloud.com/test_sets/a973395d-3213-4b34-ae57-45155f98ee26



 Comments   
Comment by James Nunez (Inactive) [ 06/Jan/21 ]

Sergey -
These quota failures started around the time of the OST pool quotas patch landed. Would you please review this failure and could this failure be due to that patch? Does this test need to change based on the OST pool quota patch?

Thanks

Comment by Peter Jones [ 18/Jan/21 ]

sergey how is your investigation progressing?

Comment by Gerrit Updater [ 02/Feb/21 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41389
Subject: LU-14299 test: sleep to enable quota acquire again
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d233ba95e6b53dd94ac2ef8930c7f5e7037ca71b

Comment by Gerrit Updater [ 08/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41389/
Subject: LU-14299 test: sleep to enable quota acquire again
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 430e3f01ef2dc83ed317cf2b97be8a2ad50d9f13

Comment by Peter Jones [ 08/Feb/21 ]

Landed for 2.14

Comment by Gerrit Updater [ 13/Jun/22 ]

"Minh Diep <mdiep@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47618
Subject: LU-14299 test: sleep to enable quota acquire again
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 4070cf9f90d1b37268fd6bc5ac9a2c419cd63e56

Generated at Sat Feb 10 03:08:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.