[LU-4857] (qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -22 Created: 03/Apr/14  Updated: 23/Jun/16  Resolved: 23/Jun/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: javed shaikh (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS 6.4 / Kernel 2.6.32-358.18.1.el6_lustre.x86_64


Severity: 3
Rank (Obsolete): 13395

 Description   

hi,

i'm seeing following messages from 2.4.2 OSS every few minutes:

2014-04-02T16:32:48+11:00 lemming17 kernel: LustreError: 17701:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:short-OST011f qtype:grp id:6644 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
2014-04-02T16:32:48+11:00 lemming17 kernel: LustreError: 17701:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:short-OST011f qtype:grp id:6644 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
2014-04-02T16:31:50+11:00 lemming27 kernel: LustreError: 21284:0:qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:short-OST0115 qtype:grp id:6644 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0
2014-04-02T16:31:50+11:00 lemming27 kernel: LustreError: 21284:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:short-OST0115 qtype:grp id:6644 enforced:1 granted:1048576 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0

surprisingly, the errors are spewed only from two OSSes out of the lot and only for specific OSTs.

at the same time MDS is throwing following:

2014-04-02T16:32:36+11:00 gerbil5 kernel: LustreError: 17470:0:(qmt_handler.c:431:qmt_dqacq0()) $$$ Release too much! uuid:short-MDT0000-lwp-OST011f_UUID release:1048576 granted:0, total:354880716 qmt:short-QMT0000 pool:0-dt id:6644 enforced:1 hard:2516582400 soft:12582 91200 granted:354880716 time:0 qunit:1048576 edquot:0 may_rel:0 revoke:0
2014-04-02T16:32:36+11:00 gerbil5 kernel: LustreError: 17470:0:(qmt_handler.c:431:qmt_dqacq0()) $$$ Release too much! uuid:short-MDT000 0-lwp-OST011f_UUID release:1048576 granted:0, total:354880716 qmt:short-QMT0000 pool:0-dt id:6644 enforced:1 hard:2516582400 soft:12582 91200 granted:354880716 time:0 qunit:1048576 edquot:0 may_rel:0 revoke:0
2014-04-02T16:32:39+11:00 gerbil5 kernel: LustreError: 4733:0:(qmt_handler.c:431:qmt_dqacq0()) $$$ Release too much! uuid:short-MDT0000 -lwp-OST0115_UUID release:1048576 granted:0, total:354880716 qmt:short-QMT0000 pool:0-dt id:6644 enforced:1 hard:2516582400 soft:1258291200 granted:354880716 time:0 qunit:1048576 edquot:0 may_rel:0 revoke:0
2014-04-02T16:32:39+11:00 gerbil5 kernel: LustreError: 4733:0:(qmt_handler.c:431:qmt_dqacq0()) $$$ Release too much! uuid:short-MDT0000-lwp-OST0115_UUID release:1048576 granted:0, total:354880716 qmt:short-QMT0000 pool:0-dt id:6644 enforced:1 hard:2516582400 soft:1258291200 granted:354880716 time:0 qunit:1048576 edquot:0 may_rel:0 revoke:0

these errors are seen after all the servers have been rebooted afresh as part of maintenance cycle (some other LBUGs were fixed).

any pointers what could be causing it?



 Comments   
Comment by Niu Yawei (Inactive) [ 03/Apr/14 ]

This looks like the quota slave index files on master and slave are not synchronized somehow.

1. Did the error message show up without a stop, or it disappeared after a while?
2. Check the quota_slave.info proc file for the OST with erros. (cat /proc/fs/lustre/osd-ldiskfs/$OSTNAME/quota_slave/info)
3. Try to trigger reintegration to see if it can solve the problem. (ehco 1 > /proc/fs/lustre/osd-ldiskfs/$OSTNAME/quota_slave/force_reint)

If force reintegration can't resolve the problem, could you collect lustre log with D_QUOTA and D_TRACE enabled? (execute following instructions on the OSS to collect log)

  • lctl clear
  • lctl set_param debug=+trace; lctl set_param debug=+quota
  • lctl debug_deamon start $TMPFILE 500
  • lctl mark "============= start reint ================="
  • ehco 1 > /proc/fs/lustre/osd-ldiskfs/$OSTNAME/quota_slave/force_reint
  • wait for reintegration done. (when the "user uptodate" and "group uptodate" in /proc/fs/lustre/osd-ldiskfs/$OSTNAME/quota_slave/info changes to "glb[1],slv[1],reint[0]")
  • lctl debug_daemon stop
  • lctl debug_file $TMPFILE $LOGFILE
  • attach log file in the ticket.
Comment by javed shaikh (Inactive) [ 04/Apr/14 ]

1. error msgs rolled every 10mins
3. triggered reintergation and it seems to have stopped both OSS and MDS errors
2. this is after reintegration:

   $ cat /proc/fs/lustre/osd-ldiskfs/short-OST0115/quota_slave/info
   target name:    short-OST0115
   pool ID:        0
   type:           dt
   quota enabled:  g
   conn to master: setup
   space acct:     ug
   user uptodate:  glb[0],slv[0],reint[0]
   group uptodate: glb[1],slv[1],reint[0]
   

i'll observe for a day and then get back. thanks.

Comment by javed shaikh (Inactive) [ 07/Apr/14 ]

update.
it did stop on the above OSS/OSTs. but started on other OSS/OSTs in last two days. forced reintegration stops it.
so guess, need to keep forcing as cases crop up.

Comment by javed shaikh (Inactive) [ 07/Apr/14 ]

this ticket can be closed. thanks.

Comment by Niu Yawei (Inactive) [ 08/Apr/14 ]

Hi, Javed, is there any error messages in the MDT or OST before this start? I'd like to see why the reintegration failed.

Comment by javed shaikh (Inactive) [ 08/Apr/14 ]

just checked for one of affected OSTs. no error msgs. only msgs related to client evictions, orphan obj deletions.

Comment by javed shaikh (Inactive) [ 14/Apr/14 ]

just got this from another OSS:

2014-04-10T17:56:09+10:00 lemming2 kernel: Lustre: short-OST008e: haven't heard from client 4ab84a25-da95-49af-a512-e406eb4af84f (at 10.9.3.2@o2ib3) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8820371ff400, cur 1397116569 expire 1397116419 last 1397116342
2014-04-10T17:56:09+10:00 lemming2 kernel: Lustre: Skipped 20 previous similar messages

2014-04-11T09:20:49+10:00 lemming2 kernel: LustreError: 25971:0:(ldlm_resource.c:1165:ldlm_resource_get()) short-OST008e: lvbo_init failed for resource 0xb65a43:0x0: rc = -2

2014-04-11T10:08:10+10:00 lemming2 kernel: LustreError: 25593:0:(qsd_handler.c:344:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4 qsd:short-OST008e qtype:grp id:4093 enforced:1 granted:65536 pending:0 waiting:0 req:1 usage:0 qunit:0 qtune:0 edquot:0

though there's one more OSS where the DQACQ error occurred on OST without prior errors on that OST.

Comment by Niu Yawei (Inactive) [ 23/Jun/16 ]

dup of LU-6629

Generated at Sat Feb 10 01:46:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.