[LU-11929] sanity-quota test 7d hangs with ‘(qmt_handler.c:421:qmt_dqacq0()) $$$ Release too much!’ on the MDS Created: 05/Feb/19  Updated: 11/Feb/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.10.7, Lustre 2.10.8, Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: ZFS
Environment:

ZFS


Issue Links:
Related
is related to LU-10313 sanity-lfsck test_33: DQACQ failed wi... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-quota test_7d hangs for ZFS testing for 2.10.7; 2.10.6.13 and 2.10.6.35 only ... so far.

Looking at the logs at https://testing.whamcloud.com/test_sessions/6d3f7b3a-f92d-4924-9192-f7eecd7d5b49 , the last lines of the client (vm9) test_log are

CMD: trevis-40vm11 /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0006.quota_slave.info |
				grep user | awk '{ print \$3 }'
CMD: trevis-40vm12 lctl set_param fail_val=0 fail_loc=0
fail_val=0
fail_loc=0
running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
 [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d7d.sanity-quota/f7d.sanity-quota] [count=21] [oflag=sync]
dd: error writing '/mnt/lustre/d7d.sanity-quota/f7d.sanity-quota': Disk quota exceeded
20+0 records in
19+0 records out
19922944 bytes (20 MB) copied, 2033.43 s, 9.8 kB/s
running as uid/gid/euid/egid 60001/60001/60001/60001, groups:
 [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d7d.sanity-quota/f7d.sanity-quota-1] [count=21] [oflag=sync]

On the MDS (vm12), we see:

[44392.869341] Lustre: DEBUG MARKER: == sanity-quota test 7d: Quota reintegration (Transfer index in multiple bulks) ====================== 23:08:17 (1549148897)
[44393.070435] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
[44399.467416] Lustre: DEBUG MARKER: lctl set_param fail_val=0 fail_loc=0
[44399.859790] Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.quota.ost=none
[44414.275402] Lustre: DEBUG MARKER: lctl set_param fail_val=0 fail_loc=0x608
[44414.633037] Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.quota.ost=u
[44438.618804] Lustre: DEBUG MARKER: lctl set_param fail_val=0 fail_loc=0
[44953.937332] LustreError: 27855:0:(qmt_handler.c:421:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0000_UUID release:1048576 granted:0, total:0 qmt:lustre-QMT0000 pool:0-dt id:500 enforced:1 hard:14150438 soft:13476608 granted:0 time:0 qunit:1048576 edquot:0 may_rel:0 revoke:0
[44953.941728] LustreError: 27855:0:(qmt_handler.c:421:qmt_dqacq0()) Skipped 18 previous similar messages
[45553.974287] LustreError: 27856:0:(qmt_handler.c:421:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0000_UUID release:1048576 granted:0, total:0 qmt:lustre-QMT0000 pool:0-dt id:500 enforced:1 hard:14150438 soft:13476608 granted:0 time:0 qunit:1048576 edquot:0 may_rel:0 revoke:0
[45553.974290] LustreError: 27855:0:(qmt_handler.c:421:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0000_UUID release:1048576 granted:0, total:0 qmt:lustre-QMT0000 pool:0-dt id:500 enforced:1 hard:14150438 soft:13476608 granted:0 time:0 qunit:1048576 edquot:0 may_rel:0 revoke:0
[45553.974295] LustreError: 27855:0:(qmt_handler.c:421:qmt_dqacq0()) Skipped 17 previous similar messages
[46154.360051] LustreError: 27856:0:(qmt_handler.c:421:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0000_UUID release:1048576 granted:0, total:0 qmt:lustre-QMT0000 pool:0-dt id:500 enforced:1 hard:14150438 soft:13476608 granted:0 time:0 qunit:1048576 edquot:0 may_rel:0 revoke:0
[46154.365743] LustreError: 27856:0:(qmt_handler.c:421:qmt_dqacq0()) Skipped 18 previous similar messages
[46814.027256] LustreError: 26491:0:(qmt_handler.c:421:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0000_UUID release:1048576 granted:0, total:0 qmt:lustre-QMT0000 pool:0-dt id:500 enforced:1 hard:14150438 soft:13476608 granted:0 time:0 qunit:1048576 edquot:0 may_rel:0 revoke:0
[46814.032250] LustreError: 26491:0:(qmt_handler.c:421:qmt_dqacq0()) Skipped 21 previous similar messages
[47414.195940] LustreError: 27855:0:(qmt_handler.c:421:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0000_UUID release:1048576 granted:0, total:0 qmt:lustre-QMT0000 pool:0-dt id:500 enforced:1 hard:14150438 soft:13476608 granted:0 time:0 qunit:1048576 edquot:0 may_rel:0 revoke:0
[47414.200389] LustreError: 27855:0:(qmt_handler.c:421:qmt_dqacq0()) Skipped 19 previous similar messages

ON the OSS, we see

[44434.303364] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0006.quota_slave.info |
[44434.303364] 				grep user | awk '{ print $3 }'
[44434.662998] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osd-zfs.lustre-OST0006.quota_slave.info |
[44434.662998] 				grep user | awk '{ print $3 }'
[44530.264427] LustreError: 23139:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:lustre-OST0000 qtype:usr id:500 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
[44530.268212] LustreError: 23139:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 19 previous similar messages
[45130.433168] LustreError: 23139:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:lustre-OST0000 qtype:usr id:500 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
[45130.436534] LustreError: 23139:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 19 previous similar messages
[45790.255910] LustreError: 23139:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:lustre-OST0000 qtype:usr id:500 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
[45790.259658] LustreError: 23139:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 21 previous similar messages
[46450.453245] LustreError: 23139:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:lustre-OST0000 qtype:grp id:500 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
[46450.456931] LustreError: 23139:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 21 previous similar messages
[47110.650703] LustreError: 23139:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:lustre-OST0000 qtype:usr id:500 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
[47110.654670] LustreError: 23139:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 21 previous similar messages
[47770.563329] LustreError: 23140:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:lustre-OST0000 qtype:grp id:500 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
[47770.568287] LustreError: 23140:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 21 previous similar messages

We’ve only seen this test hang with these error messages twice; the other failure logs are at
https://testing.whamcloud.com/test_sets/feb38b6a-1a93-11e9-8388-52540065bddc .

Yet, we’ve seen sanity-quota test_7d hang many times in a DNE configuration with ZFS; these failures do not have the error messages above. We’ve seen the hangs with no error messages for 2.10.6, 2.10.7 and 2.11.56.



 Comments   
Comment by James Nunez (Inactive) [ 05/Feb/19 ]

This issue may be the same as LU-10313

Comment by James Nunez (Inactive) [ 17/Sep/19 ]

It looks like we are seeing a very similar error for obdfilter-survey test 1b with logs at:
https://testing.whamcloud.com/test_sets/b533b6e6-d739-11e9-9fc9-52540065bddc

Comment by Xiao Zhenggang [ 08/Mar/20 ]

we meet the same problem in 2.12.2 , and we used ldiskfs, quota enabled, without DNE 

Comment by James Nunez (Inactive) [ 26/Jan/21 ]

We're seeing obdfilter-survey test 2a hang, for ldiskfs, with the following in the MDS console

[64922.450035] Lustre: DEBUG MARKER: == obdfilter-survey test 2a: Stripe F/S over the Network ============================================= 16:49:13 (1611593353)
[65230.880156] LustreError: 2589254:0:(qmt_handler.c:699:qmt_dqacq0()) $$$ Release too much! uuid:lustre-MDT0000-lwp-OST0000_UUID release: 786421 granted:0, total:262145  qmt:lustre-QMT0000 pool:dt-0x0 id:500 enforced:1 hard:13350758 soft:12715008 granted:262145 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:no

and the following connection error on the OSS

[145874.903451] Lustre: DEBUG MARKER: == obdfilter-survey test 2a: Stripe F/S over the Network ============================================= 16:49:13 (1611593353)
[145875.141081] LustreError: 1713491:0:(tgt_grant.c:919:tgt_grant_alloc()) lustre-OST0000: client lustre-OST0000_osc_UUID/00000000fa1440da requesting > max (2147483647), 34359738368
[145885.380313] Lustre: lustre-OST0000: Client lustre-OST0000_osc_UUID (at 10.9.53.1@tcp) reconnecting
[145885.382119] Lustre: Skipped 6 previous similar messages
[145892.718592] LustreError: 1776368:0:(ldlm_lib.c:3462:target_bulk_io()) @@@ bulk WRITE failed: rc = -107  req@000000003b27e49f x1689865548924928/t0(0) o4->lustre-OST0000_osc_UUID@10.9.53.1@tcp:0/0 lens 488/448 e 0 to 0 dl 1611593384 ref 1 fl Interpret:/0/0 rc 0/0 job:'ptlrpcd_00_01.0'
[145892.718838] Lustre: lustre-OST0000: Bulk IO write error with lustre-OST0000_osc_UUID (at 10.9.53.1@tcp), client will retry: rc = -107
[145892.722920] LustreError: 1776368:0:(ldlm_lib.c:3462:target_bulk_io()) Skipped 1 previous similar message
Generated at Sat Feb 10 02:48:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.