[LU-3460] recovery-small test_51 timeout: lqe_iter_cb(): Inuse quota entry Created: 12/Jun/13 Updated: 06/Nov/13 Resolved: 11/Jul/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.5.0, Lustre 2.4.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 8656 |
| Description |
|
This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/f96a0ba0-d2f2-11e2-a349-52540035b04c. The sub-test test_51 failed with the following error in the MDS log:
Info required for matching: recovery-small 51 |
| Comments |
| Comment by Jodi Levi (Inactive) [ 17/Jun/13 ] |
|
Niu, Johann, |
| Comment by Niu Yawei (Inactive) [ 18/Jun/13 ] |
/usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "umount -d /mnt/mds1");echo XXRETCODE:$?' Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:810:ldlm_resource_complain()) Namespace lustre-MDT0000-lwp-MDT0000 resource refcount nonzero (1) after lock cleanup; forcing cleanup. Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:810:ldlm_resource_complain()) Skipped 1 previous similar message Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:816:ldlm_resource_complain()) Resource: ffff88006a8fec00 (8589934598/65536/0/0) (rc: 1) Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:816:ldlm_resource_complain()) Skipped 1 previous similar message Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1457:ldlm_resource_dump()) --- Resource: ffff88006a8fec00 (8589934598/65536/0/0) (rc: 2) Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1460:ldlm_resource_dump()) Granted locks (in reverse order): Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1463:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-MDT0000 lock: ffff88006a8ff340/0x21e5b86f50e944e2 lrc: 2/1,0 mode: CR/CR res: 8589934598/65536 rrc: 2 type: PLN flags: 0x1106400000000 nid: local remote: 0x21e5b86f50e9450c expref: -99 pid: 28646 timeout: 0lvb_type: 2 Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1463:ldlm_resource_dump()) Skipped 1 previous similar message Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1457:ldlm_resource_dump()) --- Resource: ffff88006a8feac0 (8589934598/16842752/0/0) (rc: 2) Jun 11 06:22:36 client-27vm7 kernel: LustreError: 28977:0:(ldlm_resource.c:1460:ldlm_resource_dump()) Granted locks (in reverse order): Looks somebody is still holding the reference on id lock, so the lqe can't be put on cleanup (id lock holding lqe reference), but I don't see why the id locks are not freed yet. |
| Comment by Niu Yawei (Inactive) [ 21/Jun/13 ] |
|
This is probably related to It is possible that lock reader/writer isn't dropped to zero when ldlm_namespace_cleanup() is called, imagine following scenario:
That's why we can see the warnning message from ldlm_namespace_cleanup(), though the lock will be released finally. The problem of lqe refecount is that: even if the lock will be cancelled at the end, the lqe could be possibly re-added in adjust list (see qsd_id_blocking_ast()->qsd_adjust()), then the lqe refcount will never be dropped since the quota writeback thread has been stopped. I'll cook a patch to fix this problem. |
| Comment by Niu Yawei (Inactive) [ 21/Jun/13 ] |
| Comment by Sarah Liu [ 27/Jun/13 ] |
|
another instance on zfs |
| Comment by Niu Yawei (Inactive) [ 11/Jul/13 ] |
|
patch landed. |
| Comment by Jian Yu [ 04/Nov/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/ FSTYPE=zfs recovery-small test 51 hung as follows: Stopping /mnt/mds1 (opts:) on fat-intel-2vm3 CMD: fat-intel-2vm3 umount -d /mnt/mds1 Console log on MDS: 15:12:27:Lustre: DEBUG MARKER: umount -d /mnt/mds1 15:12:27:LustreError: 26988:0:(ldlm_resource.c:804:ldlm_resource_complain()) lustre-MDT0000-lwp-MDT0000: namespace resource [0x200000006:0x1010000:0x0].0 (ffff88007017ea80) refcount nonzero (1) after lock cleanup; forcing cleanup. 15:12:27:LustreError: 26988:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0 (ffff88007017ea80) refcount = 2 15:12:27:LustreError: 26988:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order): 15:12:27:LustreError: 26988:0:(ldlm_resource.c:1421:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-MDT0000 lock: ffff88006ef0bb40/0xdd1be0eccfc5313 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x1010000:0x0].0 rrc: 2 type: PLN flags: 0x1106400000000 nid: local remote: 0xdd1be0eccfc5359 expref: -99 pid: 26657 timeout: 0 lvb_type: 2 15:12:27:LustreError: 26988:0:(ldlm_resource.c:804:ldlm_resource_complain()) lustre-MDT0000-lwp-MDT0000: namespace resource [0x200000006:0x10000:0x0].0 (ffff88007017ebc0) refcount nonzero (1) after lock cleanup; forcing cleanup. 15:12:27:LustreError: 26988:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x10000:0x0].0 (ffff88007017ebc0) refcount = 2 15:12:27:LustreError: 26988:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order): 15:12:27:LustreError: 26988:0:(ldlm_resource.c:1421:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-MDT0000 lock: ffff88007017f240/0xdd1be0eccfc531a lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x10000:0x0].0 rrc: 2 type: PLN flags: 0x1106400000000 nid: local remote: 0xdd1be0eccfc5360 expref: -99 pid: 26656 timeout: 0 lvb_type: 2 15:12:28:LustreError: 26988:0:(ldlm_lib.c:2137:target_stop_recovery_thread()) lustre-MDT0000: Aborting recovery 15:12:28:Lustre: 26659:0:(ldlm_lib.c:1801:target_recovery_overseer()) recovery is aborted, evict exports in recovery 15:12:28:Lustre: 26659:0:(ldlm_lib.c:1801:target_recovery_overseer()) Skipped 2 previous similar messages 15:12:28:LustreError: 26650:0:(osp_precreate.c:737:osp_precreate_cleanup_orphans()) lustre-OST0000-osc-MDT0000: cannot cleanup orphans: rc = -5 15:12:28:LustreError: 26988:0:(lquota_entry.c:147:lqe_iter_cb()) $$$ Inuse quota entry qsd:lustre-MDT0000 qtype:usr id:60000 enforced:1 granted:0 pending:0 waiting:0 req:0 usage:0 qunit:0 qtune:0 edquot:0 Maloo report: https://maloo.whamcloud.com/test_sets/2edc7780-4451-11e3-8472-52540035b04c replay-single test 61b also hit the same failure: replay-dual test 16 also hit the same failure: The patch is needed on Lustre b2_4 branch. |
| Comment by Bob Glossman (Inactive) [ 04/Nov/13 ] |
|
backport to b2_4: http://review.whamcloud.com/8169 |