[LU-3460] recovery-small test_51 timeout: lqe_iter_cb(): Inuse quota entry Created: 12/Jun/13  Updated: 06/Nov/13  Resolved: 11/Jul/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: Lustre 2.5.0, Lustre 2.4.2

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 8656

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/f96a0ba0-d2f2-11e2-a349-52540035b04c.

The sub-test test_51 failed with the following error in the MDS log:

lqe_iter_cb(): Inuse quota entry

Info required for matching: recovery-small 51



 Comments   
Comment by Jodi Levi (Inactive) [ 17/Jun/13 ]

Niu, Johann,
Could one of you have a look at this ticket and please comment?
Thank you!

Comment by Niu Yawei (Inactive) [ 18/Jun/13 ]
 /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "umount -d /mnt/mds1");echo XXRETCODE:$?'
Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:810:ldlm_resource_complain()) Namespace lustre-MDT0000-lwp-MDT0000 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:810:ldlm_resource_complain()) Skipped 1 previous similar message
Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:816:ldlm_resource_complain()) Resource: ffff88006a8fec00 (8589934598/65536/0/0) (rc: 1)
Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:816:ldlm_resource_complain()) Skipped 1 previous similar message
Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1457:ldlm_resource_dump()) --- Resource: ffff88006a8fec00 (8589934598/65536/0/0) (rc: 2)
Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1460:ldlm_resource_dump()) Granted locks (in reverse order):
Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1463:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-MDT0000 lock: ffff88006a8ff340/0x21e5b86f50e944e2 lrc: 2/1,0 mode: CR/CR res: 8589934598/65536 rrc: 2 type: PLN flags: 0x1106400000000 nid: local remote: 0x21e5b86f50e9450c expref: -99 pid: 28646 timeout: 0lvb_type: 2
Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1463:ldlm_resource_dump()) Skipped 1 previous similar message
Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1457:ldlm_resource_dump()) --- Resource: ffff88006a8feac0 (8589934598/16842752/0/0) (rc: 2)
Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1460:ldlm_resource_dump()) Granted locks (in reverse order):

Looks somebody is still holding the reference on id lock, so the lqe can't be put on cleanup (id lock holding lqe reference), but I don't see why the id locks are not freed yet.

Comment by Niu Yawei (Inactive) [ 21/Jun/13 ]

This is probably related to LU-2067.

It is possible that lock reader/writer isn't dropped to zero when ldlm_namespace_cleanup() is called, imagine following scenario:

  • ldlm_cli_enqueue() is called to create the lock, and increased lock reader/writer;
  • before the enqueue request is added in imp_sending_list or imp_delayed_list, shutdown happened;
  • shutdown procedure aborted inflight RPCs, but the enqueue request can't be aborted since it's neither on sending list nor delayed list;
  • shutdown procedure moving on to obd_import_event(IMP_EVENT_ACTIVE)->ldlm_namespace_cleanup() to cleanup all locks;
  • ldlm_namespace_cleanup() found that the lock just created still has 1 reader/writer, because the interpret callback for this lock enqueue hasn't been called yet (where the reader/writer is dropped;

That's why we can see the warnning message from ldlm_namespace_cleanup(), though the lock will be released finally.

The problem of lqe refecount is that: even if the lock will be cancelled at the end, the lqe could be possibly re-added in adjust list (see qsd_id_blocking_ast()->qsd_adjust()), then the lqe refcount will never be dropped since the quota writeback thread has been stopped. I'll cook a patch to fix this problem.

Comment by Niu Yawei (Inactive) [ 21/Jun/13 ]

http://review.whamcloud.com/6731

Comment by Sarah Liu [ 27/Jun/13 ]

another instance on zfs
https://maloo.whamcloud.com/test_sets/cb382a8e-ddcc-11e2-85a3-52540035b04c

Comment by Niu Yawei (Inactive) [ 11/Jul/13 ]

patch landed.

Comment by Jian Yu [ 04/Nov/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/
Distro/Arch: RHEL6.4/x86_64

FSTYPE=zfs
MDSCOUNT=1
OSTCOUNT=2

recovery-small test 51 hung as follows:

Stopping /mnt/mds1 (opts:) on fat-intel-2vm3
CMD: fat-intel-2vm3 umount -d /mnt/mds1

Console log on MDS:

15:12:27:Lustre: DEBUG MARKER: umount -d /mnt/mds1
15:12:27:LustreError: 26988:0:(ldlm_resource.c:804:ldlm_resource_complain()) lustre-MDT0000-lwp-MDT0000: namespace resource [0x200000006:0x1010000:0x0].0 (ffff88007017ea80) refcount nonzero (1) after lock cleanup; forcing cleanup.
15:12:27:LustreError: 26988:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0 (ffff88007017ea80) refcount = 2
15:12:27:LustreError: 26988:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order):
15:12:27:LustreError: 26988:0:(ldlm_resource.c:1421:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-MDT0000 lock: ffff88006ef0bb40/0xdd1be0eccfc5313 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x1010000:0x0].0 rrc: 2 type: PLN flags: 0x1106400000000 nid: local remote: 0xdd1be0eccfc5359 expref: -99 pid: 26657 timeout: 0 lvb_type: 2
15:12:27:LustreError: 26988:0:(ldlm_resource.c:804:ldlm_resource_complain()) lustre-MDT0000-lwp-MDT0000: namespace resource [0x200000006:0x10000:0x0].0 (ffff88007017ebc0) refcount nonzero (1) after lock cleanup; forcing cleanup.
15:12:27:LustreError: 26988:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x10000:0x0].0 (ffff88007017ebc0) refcount = 2
15:12:27:LustreError: 26988:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order):
15:12:27:LustreError: 26988:0:(ldlm_resource.c:1421:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-MDT0000 lock: ffff88007017f240/0xdd1be0eccfc531a lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x10000:0x0].0 rrc: 2 type: PLN flags: 0x1106400000000 nid: local remote: 0xdd1be0eccfc5360 expref: -99 pid: 26656 timeout: 0 lvb_type: 2
15:12:28:LustreError: 26988:0:(ldlm_lib.c:2137:target_stop_recovery_thread()) lustre-MDT0000: Aborting recovery
15:12:28:Lustre: 26659:0:(ldlm_lib.c:1801:target_recovery_overseer()) recovery is aborted, evict exports in recovery
15:12:28:Lustre: 26659:0:(ldlm_lib.c:1801:target_recovery_overseer()) Skipped 2 previous similar messages
15:12:28:LustreError: 26650:0:(osp_precreate.c:737:osp_precreate_cleanup_orphans()) lustre-OST0000-osc-MDT0000: cannot cleanup orphans: rc = -5
15:12:28:LustreError: 26988:0:(lquota_entry.c:147:lqe_iter_cb()) $$$ Inuse quota entry qsd:lustre-MDT0000 qtype:usr id:60000 enforced:1 granted:0 pending:0 waiting:0 req:0 usage:0 qunit:0 qtune:0 edquot:0

Maloo report: https://maloo.whamcloud.com/test_sets/2edc7780-4451-11e3-8472-52540035b04c

replay-single test 61b also hit the same failure:
https://maloo.whamcloud.com/test_sets/74517e7e-4450-11e3-8472-52540035b04c

replay-dual test 16 also hit the same failure:
https://maloo.whamcloud.com/test_sets/9bf98132-4451-11e3-8472-52540035b04c

The patch is needed on Lustre b2_4 branch.

Comment by Bob Glossman (Inactive) [ 04/Nov/13 ]

backport to b2_4: http://review.whamcloud.com/8169

Generated at Sat Feb 10 01:34:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.