[LU-4180] ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2) Created: 29/Oct/13 Updated: 07/Aug/14 Resolved: 07/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 4 |
| Rank (Obsolete): | 11317 |
| Description |
|
MDS hung with the following errors. There is some evidence that it happed right after setting a users quota to zero. Then after reboot/recover it hung again. — First set of errors messages ---- SOME STACK TRACES ---- — second set of errors after recover ---- — Stack trace after reboot/recover ----- [3]kdb> btp 7664 |
| Comments |
| Comment by Peter Jones [ 29/Oct/13 ] |
|
Niu Could you please comment? Thanks Peter |
| Comment by Jay Lan (Inactive) [ 29/Oct/13 ] |
|
This problem affects production systems. |
| Comment by Mahmoud Hanafi [ 29/Oct/13 ] |
|
Setting console logging to 0 allowed the system to respond again. But /var/log/messages still scrolling these messages |
| Comment by Mahmoud Hanafi [ 29/Oct/13 ] |
|
was able to get a D_TRACE and D_QUOTA on the MDS uploaded the files. |
| Comment by Niu Yawei (Inactive) [ 30/Oct/13 ] |
|
The log shows dqacq_handler() returns -ENOENT on lustre_dqget(), but I don't see why we can't afford fake dquot in dqacq_handler(), it looks like a bug to me. See dqacq_handler(): cfs_down_write(&mds->mds_qonoff_sem);
dquot = lustre_dqget(obd, info, qdata->qd_id, QDATA_IS_GRP(qdata), 0);
^^^ I think this should be 1
if (IS_ERR(dquot)) {
cfs_up_write(&mds->mds_qonoff_sem);
GOTO(skip, rc = PTR_ERR(dquot));
}
However, I didn't see why the quota_search_lqs() called before lustre_dqget() didn't return -ENOENT, I don't see where the lqs is inserted in the hash. Did you re-set limit for the user after reboot (beofre capture the log)? |
| Comment by Mahmoud Hanafi [ 30/Oct/13 ] |
|
The limit was set BEFORE the log capture. This is sequence of events. |
| Comment by Niu Yawei (Inactive) [ 31/Oct/13 ] |
|
I see, thanks Mahmoud. Looks something wrong happened when you clear the quota limit (lfs setquota -u pmoram -B 0 -b 0 -I 0 -i 0), that caused a partial global limit setting, and I believe quota recovery can't recover such failure (it can only recover the inconsistence of limits on master & slave happened on dqacq/dqrel). I think re-set quota limit is the only way to recover this (just as what you did), could you also try to re-clear limit to see if things going well? In the new quota (2.4) architecture, such kind of partial global limit setting can be recovered automatically. |
| Comment by Mahmoud Hanafi [ 04/Nov/13 ] |
|
re-clearing the quota didn't case the issue. But the MDS has been reboot a few time. We sill would like to understand why this happened and like to see it fixed. |
| Comment by Niu Yawei (Inactive) [ 05/Nov/13 ] |
Do you mean re-cleaning the quota limit didn't resolve the problem of "error message like ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)" ? |
| Comment by Mahmoud Hanafi [ 05/Nov/13 ] |
|
The error was resolved when I set the quota. But I wanted to reproduce the error by clearing/setting the quota to zero I was not able to reproduce it. At lease for now. |
| Comment by Niu Yawei (Inactive) [ 06/Nov/13 ] |
To reproduce the problem: 1. set 10M hardlimit for user xxx; |
| Comment by Mahmoud Hanafi [ 07/Aug/14 ] |
|
Close no longer issue in 2.4.3 |
| Comment by Peter Jones [ 07/Aug/14 ] |
|
Thanks Mahmoud |