[LU-4365] recovery-small test_51: LBUG: (qsd_lib.c:294:qsd_qtype_fini()) ASSERTION( atomic_read(&qqi->qqi_ref) == 1 ) failed Created: 09/Dec/13  Updated: 31/Dec/13  Resolved: 17/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1, Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/63/
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs


Severity: 3
Rank (Obsolete): 11948

 Description   

recovery-small test_51 hung as follows:

Failing mds1 on client-30vm3
CMD: client-30vm3 grep -c /mnt/mds1' ' /proc/mounts
Stopping /mnt/mds1 (opts:) on client-30vm3
CMD: client-30vm3 umount -d /mnt/mds1

LBUG occurred on MDS:

23:04:50:Lustre: DEBUG MARKER: umount -d /mnt/mds1
23:04:50:LustreError: 7276:0:(ldlm_resource.c:804:ldlm_resource_complain()) lustre-MDT0000-lwp-MDT0000: namespace resource [0x200000006:0x1010000:0x0].0 (ffff8800709d3980) refcount nonzero (1) after lock cleanup; forcing cleanup.
23:04:50:LustreError: 7276:0:(ldlm_resource.c:804:ldlm_resource_complain()) Skipped 1 previous similar message
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0 (ffff8800709d3980) refcount = 2
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order):
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1421:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-MDT0000 lock: ffff880063768b40/0x3b627f62d662915 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x1010000:0x0].0 rrc: 2 type: PLN flags: 0x1106400000000 nid: local remote: 0x3b627f62d66293f expref: -99 pid: 6945 timeout: 0 lvb_type: 2
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1421:ldlm_resource_dump()) Skipped 1 previous similar message
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x10000:0x0].0 (ffff8800709d3ac0) refcount = 2
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order):
23:04:50:LustreError: 7276:0:(ldlm_lib.c:2137:target_stop_recovery_thread()) lustre-MDT0000: Aborting recovery
23:04:50:Lustre: 6947:0:(ldlm_lib.c:1801:target_recovery_overseer()) recovery is aborted, evict exports in recovery
23:04:50:Lustre: 6947:0:(ldlm_lib.c:1801:target_recovery_overseer()) Skipped 2 previous similar messages
23:04:50:LustreError: 6937:0:(osp_precreate.c:737:osp_precreate_cleanup_orphans()) lustre-OST0000-osc-MDT0000: cannot cleanup orphans: rc = -5
23:04:50:LustreError: 7276:0:(qsd_lib.c:294:qsd_qtype_fini()) ASSERTION( atomic_read(&qqi->qqi_ref) == 1 ) failed: 
23:04:50:LustreError: 7276:0:(qsd_lib.c:294:qsd_qtype_fini()) LBUG
23:04:50:Pid: 7276, comm: umount
23:04:50:
23:04:50:Call Trace:
23:04:50: [<ffffffffa05c2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
23:04:50: [<ffffffffa05c2e97>] lbug_with_loc+0x47/0xb0 [libcfs]
23:04:51: [<ffffffffa0efcc23>] qsd_qtype_fini+0x363/0x3a0 [lquota]
23:04:51: [<ffffffffa0efcfc8>] qsd_fini+0xd8/0x460 [lquota]
23:04:51: [<ffffffffa0d751a8>] osd_shutdown+0x38/0xe0 [osd_zfs]
23:04:51: [<ffffffffa0d78571>] osd_process_config+0x141/0x190 [osd_zfs]
23:04:51: [<ffffffffa0843b41>] lod_process_config+0x4e1/0x17a0 [lod]
23:04:51: [<ffffffffa05d32d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
23:04:51: [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
23:04:52: [<ffffffffa118ed00>] mdd_process_config+0x210/0x610 [mdd]
23:04:52: [<ffffffffa1722c56>] mdt_stack_fini+0x176/0xbe0 [mdt]
23:04:52: [<ffffffffa1189950>] ? mdd_init_capa_ctxt+0x120/0x130 [mdd]
23:04:52: [<ffffffffa1723dca>] mdt_device_fini+0x70a/0xd90 [mdt]
23:04:52: [<ffffffffa071fbc7>] class_cleanup+0x577/0xda0 [obdclass]
23:04:52: [<ffffffffa06f4b06>] ? class_name2dev+0x56/0xe0 [obdclass]
23:04:52: [<ffffffffa07214ac>] class_process_config+0x10bc/0x1c80 [obdclass]
23:04:52: [<ffffffffa05cdd98>] ? libcfs_log_return+0x28/0x40 [libcfs]
23:04:52: [<ffffffffa071ad11>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
23:04:52: [<ffffffffa07221e9>] class_manual_cleanup+0x179/0x6f0 [obdclass]
23:04:52: [<ffffffffa06f4b06>] ? class_name2dev+0x56/0xe0 [obdclass]
23:04:52: [<ffffffffa07571dc>] server_put_super+0x5bc/0xf00 [obdclass]
23:04:52: [<ffffffff8118366b>] generic_shutdown_super+0x5b/0xe0
23:04:52: [<ffffffff81183756>] kill_anon_super+0x16/0x60
23:04:52: [<ffffffffa07240a6>] lustre_kill_super+0x36/0x60 [obdclass]
23:04:52: [<ffffffff81183ef7>] deactivate_super+0x57/0x80
23:04:52: [<ffffffff811a21ef>] mntput_no_expire+0xbf/0x110
23:04:52: [<ffffffff811a2c5b>] sys_umount+0x7b/0x3a0
23:04:52: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
23:04:52:
23:04:52:Kernel panic - not syncing: LBUG

Maloo report: https://maloo.whamcloud.com/test_sets/beee4bb4-60cb-11e3-bd66-52540035b04c



 Comments   
Comment by Jian Yu [ 09/Dec/13 ]

The same test passed in another ZFS full group test session on Lustre b2_4 build #63:
https://maloo.whamcloud.com/test_sets/40487e62-5fb6-11e3-85c5-52540035b04c

By searching on Maloo, I found all of the previous recovery-small test 51 failures were LU-3460. Is this a potential regression introduced by the fix of LU-3460?

Comment by Peter Jones [ 10/Dec/13 ]

Niu

Could you please comment?

Thanks

Peter

Comment by Zhenyu Xu [ 12/Dec/13 ]

Hi Niu,

I have a question about qsd_glb_blocking_ast(), in which

        case LDLM_CB_CANCELING: {
                struct qsd_qtype_info *qqi;

                LDLM_DEBUG(lock, "canceling global quota lock");

                qqi = qsd_glb_ast_data_get(lock, true);
                if (qqi == NULL)
                        break;

...
                lu_ref_del(&qqi->qqi_reference, "ast_data_get", lock);
                qqi_putref(qqi);

since it calls qsd_glb_ast_data_get(lock, true) to reset the lock->l_ast_data and in it the qqi's ref has been put, why here blocking ast put its ref again?

While in qsd_glb_glimpse_ast(), it calls qsd_glb_ast_data_get(lock, false) and qqi's ref is kept.

Comment by Jian Yu [ 13/Dec/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/67/
Distro/Arch: RHEL6.4/x86_64 (server), SLES11SP2/x86_64 (client)

sanity-scrub test 3 also hit this LBUG:
https://maloo.whamcloud.com/test_sets/4bdfe456-6347-11e3-8c76-52540035b04c

Comment by Niu Yawei (Inactive) [ 16/Dec/13 ]

Hi, Bobi

The blocking ast calls qsd_glb_ast_data_get(lock, true) to put the qqi ref which holding by the lock (since the lock is goning to be canceled), and then put the ref which just got in qsd_glb_ast_data_get() (for caller). In qsd_glb_glimpse_ast(), it only needs to put the ref holding for caller.

23:04:50:LustreError: 7276:0:(ldlm_resource.c:804:ldlm_resource_complain()) lustre-MDT0000-lwp-MDT0000: namespace resource [0x200000006:0x1010000:0x0].0 (ffff8800709d3980) refcount nonzero (1) after lock cleanup; forcing cleanup.
23:04:50:LustreError: 7276:0:(ldlm_resource.c:804:ldlm_resource_complain()) Skipped 1 previous similar message
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0 (ffff8800709d3980) refcount = 2
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order):
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1421:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-MDT0000 lock: ffff880063768b40/0x3b627f62d662915 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x1010000:0x0].0 rrc: 2 type: PLN flags: 0x1106400000000 nid: local remote: 0x3b627f62d66293f expref: -99 pid: 6945 timeout: 0 lvb_type: 2
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1421:ldlm_resource_dump()) Skipped 1 previous similar message
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x10000:0x0].0 (ffff8800709d3ac0) refcount = 2
23:04:50:LustreError: 7276:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order):

The qqi reference is holding by the global quota locks, and the two quota global locks were not cleanup because the lock reader was not 0, but I don't see who is holding the lock.

23:04:38:LustreError: 5385:0:(qsd_reint.c:58:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1010000:0x0], rc:-5
23:04:38:LustreError: 5385:0:(qsd_reint.c:58:qsd_reint_completion()) Skipped 1 previous similar message

I see there were some lock enqueue failures in the log, maybe there is some bug in the error path of lock enqueue which caused the lock not dropped properly?

Comment by Niu Yawei (Inactive) [ 16/Dec/13 ]

Looks there is a race: On server umount, if quota reint process is just holding the global locks for reintegraion, the locks will not be cleared by server_put_super() > lustre_disconnect_lwp(), and later on when qsd_fini() stops reint process, the global locks will be dropped, however, because lock cancel is done asynchronously, when the qsd_fini() reachs LASSERT(cfs_atomic_read(&qqi>qqi_ref) == 1), the global locks haven't been canceled yet and they still holding qqi refcount.

seems we'd wait a while for such case in qsd_fini(), I'll try to cooke a patch soon.

Comment by Niu Yawei (Inactive) [ 16/Dec/13 ]

patch for master: http://review.whamcloud.com/8583

Comment by Niu Yawei (Inactive) [ 16/Dec/13 ]

patch for b2_4: http://review.whamcloud.com/8586

Comment by Jian Yu [ 17/Dec/13 ]

Patch landed on master branch for 2.6.0 and b2_4 branch for 2.4.2.

Comment by Jodi Levi (Inactive) [ 17/Dec/13 ]

Patches have landed. Can this ticket be closed, or is more work expected/needed in this ticket?

Comment by Peter Jones [ 17/Dec/13 ]

Landed for 2.4.2 and 2.5. Will land to 2.5.1 when work starts on that release

Generated at Sat Feb 10 01:42:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.