[LU-4180] ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2) Created: 29/Oct/13  Updated: 07/Aug/14  Resolved: 07/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File quota.out.gz     Text File trace.out.gz    
Severity: 4
Rank (Obsolete): 11317

 Description   

MDS hung with the following errors. There is some evidence that it happed right after setting a users quota to zero. Then after reboot/recover it hung again.

— First set of errors messages
<3>LustreError: 8304:0:(ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)
<3>LustreError: 8304:0:(ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)
<3>LustreError: 8304:0:(ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)
<3>LustreError: 8304:0:(ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)
<3>LustreError: 8304:0:(ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)

---- SOME STACK TRACES ----
Stack traceback for pid 8262
0xffff883f8749cae0 8262 2 1 15 R 0xffff883f8749d180 ldlm_cb_26
sp ip Function (args)
======================= <nmi>
======================= <normal>
0xffff883f9c8cf900 0xffffffff8151f04c _spin_lock+0x1c (0xffffffff8204c3a0)
0xffff883f9c8cf908 0xffffffff8135ce06 serial8250_console_write+0x156 (unknown, 0xffffffff81dc7012, unknown)
0xffff883f9c8cf958 0xffffffff81069745 __call_console_drivers+0x75 (unknown, unknown)
0xffff883f9c8cf988 0xffffffff810697aa _call_console_drivers+0x4a (unknown, unknown, unknown)
0xffff883f9c8cf9a8 0xffffffff81069d78 release_console_sem+0x108
0xffff883f9c8cf9e8 0xffffffff8106a478 vprintk+0x248 (unknown, unknown)
0xffff883f9c8cfa88 0xffffffff8151c151 printk+0x41
0xffff883f9c8cfae8 0xffffffffa04f8105 [libcfs]cfs_print_to_console+0x75 (unknown, unknown, 0xffff883fd42b9182, unknown, unknown, 0xffffffffa07d7180)
0xffff883f9c8cfb18 0xffffffffa05038de [libcfs]libcfs_debug_vmsg2+0x5de (0xffffffffa083de20, 0xffffffffa07f1c0e, 0xffff883f9c8cfc90, 0x0, 0x3, 0xffff8839c7b1cc00)
0xffff883f9c8cfc88 0xffffffffa0503ea1 [libcfs]libcfs_debug_msg+0x41 (unknown, unknown, 0xfffffffe, 0xfffffffe, 0x3, 0xffff8839c7b1cc00)
0xffff883f9c8cfce8 0xffffffffa075e253 [ptlrpc]target_handle_dqacq_callback+0x6e3 (0xffff881e0ef11400)
0xffff883f9c8cfd38 0xffffffffa076df87 [ptlrpc]ldlm_callback_handler+0xa17 (0xffff881e0ef11400)
0xffff883f9c8cfdc8 0xffffffffa079fb4e [ptlrpc]ptlrpc_main+0xc4e (unknown)
0xffff883f9c8cff48 0xffffffff8100c0ca child_rip+0xa (unknown, unknown)
Enter <q> to end, <cr> to continue:

— second set of errors after recover ----
Lustre: MDS mdd_obd-nbp8-MDT0000: nbp8-OST001a_UUID now active, resetting orphans
LustreError: 6186:0:(ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)
LustreError: 7771:0:(ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)
LustreError: 6185:0:(ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)
LustreError: 7798:0:(ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)

— Stack trace after reboot/recover -----
Stack traceback for pid 8063
0xffff883fc6e92aa0 8063 2 1 8 R 0xffff883fc6e93140 qmaster_recovd
sp ip Function (args)
0xffff883fc6ef7a18 0xffffffff8100c313 kdb_interrupt+0x13 (0xffffffffa05ada40, 0x0, 0x4e3a, 0x0, 0x0, 0xffff883f8a5bd940)
0xffff883fc6ef7a78 0xffffffff8151f04e _spin_lock+0x1e (0xffffffffa05ada40)
0xffff883fc6ef7aa8 0xffffffffa0582d76 [lnet]LNetMDBind+0x1d6 (unknown, 0xffff883f9310c978, unknown, unknown, unknown, unknown, 0xffff883fc2e91e00, 0x100000130, 0xc6ef7ba0)
0xffff883fc6ef7af8 0xffffffffa0788478 [ptlrpc]ptl_send_buf+0x128 (0xffff883f9310c978, 0xffff883fc2e91e00, unknown, unknown, 0xffff883f9310c980, 0xffff881e28d19b40, 0xffff883f0000001c, 0x52700f3927367, 0x1500000000)
0xffff883fc6ef7ba8 0xffffffffa078b774 [ptlrpc]ptl_send_rpc+0x544 (0xffff883f9310c800, unknown)
0xffff883fc6ef7c68 0xffffffffa0781993 [ptlrpc]ptlrpc_send_new_req+0x433 (0xffff883f9310c800)
0xffff883fc6ef7cd8 0xffffffffa078553a [ptlrpc]ptlrpc_set_wait+0x6a (0xffff883f8bcdd8c0)
0xffff883fc6ef7d88 0xffffffffa0785cfc [ptlrpc]ptlrpc_queue_wait+0x7c (0xffff883f9310c800)
0xffff883fc6ef7da8 0xffffffffa09c6011 [lquota]client_quota_ctl+0x151 (unknown, unknown, 0xffff883fa47df5c0)
0xffff883fc6ef7de8 0xffffffffa09c6af7 [lquota]lov_quota_ctl+0x427 (unknown, unknown, 0xffff883fa47df5c0)
0xffff883fc6ef7e48 0xffffffffa09cd9d9 [lquota]obd_quotactl+0xd9 (0xffff881e9c387400, 0xffff883fa47df5c0)
0xffff883fc6ef7e78 0xffffffffa09d4ade [lquota]qmaster_recovery_main+0x8ae (unknown)
0xffff883fc6ef7f48 0xffffffff8100c0ca child_rip+0xa (unknown, unknown)

[3]kdb> btp 7664
Stack traceback for pid 7664
0xffff883fcdcad500 7664 2 1 27 R 0xffff883fcdcadba0 qmaster_recovd
sp ip Function (args)
0xffff883fcdcaf998 0xffffffff8100c313 kdb_interrupt+0x13 (0xffffffffa05ada40, 0xffff883fc646e400, 0x4e3a, 0x0, 0x0, 0x130)
0xffff883fcdcaf9f8 0xffffffff8151f051 _spin_lock+0x21 (0xffffffffa05ada40)
0xffff883fcdcafa28 0xffffffffa05889f9 [lnet]lnet_send+0x59 (0x500000a971b3c, 0xffff883fc646e400)
0xffff883fcdcafa98 0xffffffffa0589f04 [lnet]LNetPut+0x2d4 (0x500000a971b3c, unknown, unknown, 0x500000a971b49, 0x3039, unknown, 0x52700f3927364, 0x100000000, 0x0)
0xffff883fcdcafaf8 0xffffffffa078852a [ptlrpc]ptl_send_buf+0x1da (0xffff883fd7360978, 0xffff883f8e6a3c00, unknown, unknown, 0xffff883fd7360980, 0xffff881e28eca740, 0xffff883f0000001c, 0x52700f3927364, 0x1500000000)
0xffff883fcdcafba8 0xffffffffa078b774 [ptlrpc]ptl_send_rpc+0x544 (0xffff883fd7360800, unknown)
0xffff883fcdcafc68 0xffffffffa0781993 [ptlrpc]ptlrpc_send_new_req+0x433 (0xffff883fd7360800)
0xffff883fcdcafcd8 0xffffffffa078553a [ptlrpc]ptlrpc_set_wait+0x6a (0xffff883f8a8b26c0)
0xffff883fcdcafd88 0xffffffffa0785cfc [ptlrpc]ptlrpc_queue_wait+0x7c (0xffff883fd7360800)
0xffff883fcdcafda8 0xffffffffa09c6011 [lquota]client_quota_ctl+0x151 (unknown, unknown, 0xffff883f8e8b3f40)
0xffff883fcdcafde8 0xffffffffa09c6af7 [lquota]lov_quota_ctl+0x427 (unknown, unknown, 0xffff883f8e8b3f40)
0xffff883fcdcafe48 0xffffffffa09cd9d9 [lquota]obd_quotactl+0xd9 (0xffff881e9c387400, 0xffff883f8e8b3f40)
0xffff883fcdcafe78 0xffffffffa09d4ade [lquota]qmaster_recovery_main+0x8ae (unknown)
0xffff883fcdcaff48 0xffffffff8100c0ca child_rip+0xa (unknown, unknown)



 Comments   
Comment by Peter Jones [ 29/Oct/13 ]

Niu

Could you please comment?

Thanks

Peter

Comment by Jay Lan (Inactive) [ 29/Oct/13 ]

This problem affects production systems.

Comment by Mahmoud Hanafi [ 29/Oct/13 ]

Setting console logging to 0 allowed the system to respond again.

But /var/log/messages still scrolling these messages
Oct 29 13:42:30 nbp8-mds1 kernel: LustreError: 6724:0:(ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)
Oct 29 13:42:30 nbp8-mds1 kernel: LustreError: 6724:0:(ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)

Comment by Mahmoud Hanafi [ 29/Oct/13 ]

was able to get a D_TRACE and D_QUOTA on the MDS uploaded the files.

Comment by Niu Yawei (Inactive) [ 30/Oct/13 ]

The log shows dqacq_handler() returns -ENOENT on lustre_dqget(), but I don't see why we can't afford fake dquot in dqacq_handler(), it looks like a bug to me. See dqacq_handler():

        cfs_down_write(&mds->mds_qonoff_sem);
        dquot = lustre_dqget(obd, info, qdata->qd_id, QDATA_IS_GRP(qdata), 0);
                                                                          ^^^ I think this should be 1
        if (IS_ERR(dquot)) {
                cfs_up_write(&mds->mds_qonoff_sem);
                GOTO(skip, rc = PTR_ERR(dquot));
        }

However, I didn't see why the quota_search_lqs() called before lustre_dqget() didn't return -ENOENT, I don't see where the lqs is inserted in the hash. Did you re-set limit for the user after reboot (beofre capture the log)?

Comment by Mahmoud Hanafi [ 30/Oct/13 ]

The limit was set BEFORE the log capture.

This is sequence of events.
1. User id(2885) quota we set to zero. (lfs setquota -u pmoran -B 0 -b 0 -I 0 -i 0) this case the errors on the MDS and made it unresponsive.
2. The MDS was droped to kdb and stacktraces where taken
3. MDS was reboot
4. MDS Recovered and started printing the errors.
5. MDS was reboot gain but this time mdt was mounted with abort recovery
6. MDS started printing errors.
7. Console logging was turned off and the host became responsive to allow the debug dumps.
8. Quota was set for user id 2885 this stop the errors (lfs setquota -u pmoran -b 80000000000 -B 100000000000 -I 10000 -I 50000)

Comment by Niu Yawei (Inactive) [ 31/Oct/13 ]

I see, thanks Mahmoud. Looks something wrong happened when you clear the quota limit (lfs setquota -u pmoram -B 0 -b 0 -I 0 -i 0), that caused a partial global limit setting, and I believe quota recovery can't recover such failure (it can only recover the inconsistence of limits on master & slave happened on dqacq/dqrel).

I think re-set quota limit is the only way to recover this (just as what you did), could you also try to re-clear limit to see if things going well?

In the new quota (2.4) architecture, such kind of partial global limit setting can be recovered automatically.

Comment by Mahmoud Hanafi [ 04/Nov/13 ]

re-clearing the quota didn't case the issue. But the MDS has been reboot a few time. We sill would like to understand why this happened and like to see it fixed.

Comment by Niu Yawei (Inactive) [ 05/Nov/13 ]

re-clearing the quota didn't case the issue.

Do you mean re-cleaning the quota limit didn't resolve the problem of "error message like ldlm_lib.c:2467:target_handle_dqacq_callback()) dqacq/dqrel failed! (rc:-2)" ?

Comment by Mahmoud Hanafi [ 05/Nov/13 ]

The error was resolved when I set the quota. But I wanted to reproduce the error by clearing/setting the quota to zero I was not able to reproduce it. At lease for now.

Comment by Niu Yawei (Inactive) [ 06/Nov/13 ]

The error was resolved when I set the quota. But I wanted to reproduce the error by clearing/setting the quota to zero I was not able to reproduce it. At lease for now.

To reproduce the problem:

1. set 10M hardlimit for user xxx;
2. stripe a file owned by user xxx on OST1; (use command 'lfs setstripe -i 0 -c 1 $filename');
3. deactivate the OSC for OST1 on MDT to simulate connection problem between MDT and OST1; (use command 'lctl deactivate --device $oscdev deactivate' on MDS)
4. clear quota limit for user xxx;
5. reboot MDS;
6. write to the file as user xxx to see if the error messages showing up on MDS;

Comment by Mahmoud Hanafi [ 07/Aug/14 ]

Close no longer issue in 2.4.3

Comment by Peter Jones [ 07/Aug/14 ]

Thanks Mahmoud

Generated at Sat Feb 10 01:40:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.