[LU-4365] recovery-small test_51: LBUG: (qsd_lib.c:294:qsd_qtype_fini()) ASSERTION( atomic_read(&qqi->qqi_ref) == 1 ) failed Created: 09/Dec/13 Updated: 31/Dec/13 Resolved: 17/Dec/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/63/ |
||
| Severity: | 3 |
| Rank (Obsolete): | 11948 |
| Description |
|
recovery-small test_51 hung as follows: Failing mds1 on client-30vm3 CMD: client-30vm3 grep -c /mnt/mds1' ' /proc/mounts Stopping /mnt/mds1 (opts:) on client-30vm3 CMD: client-30vm3 umount -d /mnt/mds1 LBUG occurred on MDS: 23:04:50:Lustre: DEBUG MARKER: umount -d /mnt/mds1 23:04:50:LustreError: 7276:0:(ldlm_resource.c:804:ldlm_resource_complain()) lustre-MDT0000-lwp-MDT0000: namespace resource [0x200000006:0x1010000:0x0].0 (ffff8800709d3980) refcount nonzero (1) after lock cleanup; forcing cleanup. 23:04:50:LustreError: 7276:0:(ldlm_resource.c:804:ldlm_resource_complain()) Skipped 1 previous similar message 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0 (ffff8800709d3980) refcount = 2 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order): 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1421:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-MDT0000 lock: ffff880063768b40/0x3b627f62d662915 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x1010000:0x0].0 rrc: 2 type: PLN flags: 0x1106400000000 nid: local remote: 0x3b627f62d66293f expref: -99 pid: 6945 timeout: 0 lvb_type: 2 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1421:ldlm_resource_dump()) Skipped 1 previous similar message 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x10000:0x0].0 (ffff8800709d3ac0) refcount = 2 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order): 23:04:50:LustreError: 7276:0:(ldlm_lib.c:2137:target_stop_recovery_thread()) lustre-MDT0000: Aborting recovery 23:04:50:Lustre: 6947:0:(ldlm_lib.c:1801:target_recovery_overseer()) recovery is aborted, evict exports in recovery 23:04:50:Lustre: 6947:0:(ldlm_lib.c:1801:target_recovery_overseer()) Skipped 2 previous similar messages 23:04:50:LustreError: 6937:0:(osp_precreate.c:737:osp_precreate_cleanup_orphans()) lustre-OST0000-osc-MDT0000: cannot cleanup orphans: rc = -5 23:04:50:LustreError: 7276:0:(qsd_lib.c:294:qsd_qtype_fini()) ASSERTION( atomic_read(&qqi->qqi_ref) == 1 ) failed: 23:04:50:LustreError: 7276:0:(qsd_lib.c:294:qsd_qtype_fini()) LBUG 23:04:50:Pid: 7276, comm: umount 23:04:50: 23:04:50:Call Trace: 23:04:50: [<ffffffffa05c2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 23:04:50: [<ffffffffa05c2e97>] lbug_with_loc+0x47/0xb0 [libcfs] 23:04:51: [<ffffffffa0efcc23>] qsd_qtype_fini+0x363/0x3a0 [lquota] 23:04:51: [<ffffffffa0efcfc8>] qsd_fini+0xd8/0x460 [lquota] 23:04:51: [<ffffffffa0d751a8>] osd_shutdown+0x38/0xe0 [osd_zfs] 23:04:51: [<ffffffffa0d78571>] osd_process_config+0x141/0x190 [osd_zfs] 23:04:51: [<ffffffffa0843b41>] lod_process_config+0x4e1/0x17a0 [lod] 23:04:51: [<ffffffffa05d32d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 23:04:51: [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 23:04:52: [<ffffffffa118ed00>] mdd_process_config+0x210/0x610 [mdd] 23:04:52: [<ffffffffa1722c56>] mdt_stack_fini+0x176/0xbe0 [mdt] 23:04:52: [<ffffffffa1189950>] ? mdd_init_capa_ctxt+0x120/0x130 [mdd] 23:04:52: [<ffffffffa1723dca>] mdt_device_fini+0x70a/0xd90 [mdt] 23:04:52: [<ffffffffa071fbc7>] class_cleanup+0x577/0xda0 [obdclass] 23:04:52: [<ffffffffa06f4b06>] ? class_name2dev+0x56/0xe0 [obdclass] 23:04:52: [<ffffffffa07214ac>] class_process_config+0x10bc/0x1c80 [obdclass] 23:04:52: [<ffffffffa05cdd98>] ? libcfs_log_return+0x28/0x40 [libcfs] 23:04:52: [<ffffffffa071ad11>] ? lustre_cfg_new+0x391/0x7e0 [obdclass] 23:04:52: [<ffffffffa07221e9>] class_manual_cleanup+0x179/0x6f0 [obdclass] 23:04:52: [<ffffffffa06f4b06>] ? class_name2dev+0x56/0xe0 [obdclass] 23:04:52: [<ffffffffa07571dc>] server_put_super+0x5bc/0xf00 [obdclass] 23:04:52: [<ffffffff8118366b>] generic_shutdown_super+0x5b/0xe0 23:04:52: [<ffffffff81183756>] kill_anon_super+0x16/0x60 23:04:52: [<ffffffffa07240a6>] lustre_kill_super+0x36/0x60 [obdclass] 23:04:52: [<ffffffff81183ef7>] deactivate_super+0x57/0x80 23:04:52: [<ffffffff811a21ef>] mntput_no_expire+0xbf/0x110 23:04:52: [<ffffffff811a2c5b>] sys_umount+0x7b/0x3a0 23:04:52: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 23:04:52: 23:04:52:Kernel panic - not syncing: LBUG Maloo report: https://maloo.whamcloud.com/test_sets/beee4bb4-60cb-11e3-bd66-52540035b04c |
| Comments |
| Comment by Jian Yu [ 09/Dec/13 ] |
|
The same test passed in another ZFS full group test session on Lustre b2_4 build #63: By searching on Maloo, I found all of the previous recovery-small test 51 failures were |
| Comment by Peter Jones [ 10/Dec/13 ] |
|
Niu Could you please comment? Thanks Peter |
| Comment by Zhenyu Xu [ 12/Dec/13 ] |
|
Hi Niu, I have a question about qsd_glb_blocking_ast(), in which case LDLM_CB_CANCELING: { struct qsd_qtype_info *qqi; LDLM_DEBUG(lock, "canceling global quota lock"); qqi = qsd_glb_ast_data_get(lock, true); if (qqi == NULL) break; ... lu_ref_del(&qqi->qqi_reference, "ast_data_get", lock); qqi_putref(qqi); since it calls qsd_glb_ast_data_get(lock, true) to reset the lock->l_ast_data and in it the qqi's ref has been put, why here blocking ast put its ref again? While in qsd_glb_glimpse_ast(), it calls qsd_glb_ast_data_get(lock, false) and qqi's ref is kept. |
| Comment by Jian Yu [ 13/Dec/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/67/ sanity-scrub test 3 also hit this LBUG: |
| Comment by Niu Yawei (Inactive) [ 16/Dec/13 ] |
|
Hi, Bobi The blocking ast calls qsd_glb_ast_data_get(lock, true) to put the qqi ref which holding by the lock (since the lock is goning to be canceled), and then put the ref which just got in qsd_glb_ast_data_get() (for caller). In qsd_glb_glimpse_ast(), it only needs to put the ref holding for caller. 23:04:50:LustreError: 7276:0:(ldlm_resource.c:804:ldlm_resource_complain()) lustre-MDT0000-lwp-MDT0000: namespace resource [0x200000006:0x1010000:0x0].0 (ffff8800709d3980) refcount nonzero (1) after lock cleanup; forcing cleanup. 23:04:50:LustreError: 7276:0:(ldlm_resource.c:804:ldlm_resource_complain()) Skipped 1 previous similar message 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x1010000:0x0].0 (ffff8800709d3980) refcount = 2 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order): 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1421:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-MDT0000 lock: ffff880063768b40/0x3b627f62d662915 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x1010000:0x0].0 rrc: 2 type: PLN flags: 0x1106400000000 nid: local remote: 0x3b627f62d66293f expref: -99 pid: 6945 timeout: 0 lvb_type: 2 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1421:ldlm_resource_dump()) Skipped 1 previous similar message 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1415:ldlm_resource_dump()) --- Resource: [0x200000006:0x10000:0x0].0 (ffff8800709d3ac0) refcount = 2 23:04:50:LustreError: 7276:0:(ldlm_resource.c:1418:ldlm_resource_dump()) Granted locks (in reverse order): The qqi reference is holding by the global quota locks, and the two quota global locks were not cleanup because the lock reader was not 0, but I don't see who is holding the lock. 23:04:38:LustreError: 5385:0:(qsd_reint.c:58:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1010000:0x0], rc:-5 23:04:38:LustreError: 5385:0:(qsd_reint.c:58:qsd_reint_completion()) Skipped 1 previous similar message I see there were some lock enqueue failures in the log, maybe there is some bug in the error path of lock enqueue which caused the lock not dropped properly? |
| Comment by Niu Yawei (Inactive) [ 16/Dec/13 ] |
|
Looks there is a race: On server umount, if quota reint process is just holding the global locks for reintegraion, the locks will not be cleared by server_put_super() seems we'd wait a while for such case in qsd_fini(), I'll try to cooke a patch soon. |
| Comment by Niu Yawei (Inactive) [ 16/Dec/13 ] |
|
patch for master: http://review.whamcloud.com/8583 |
| Comment by Niu Yawei (Inactive) [ 16/Dec/13 ] |
|
patch for b2_4: http://review.whamcloud.com/8586 |
| Comment by Jian Yu [ 17/Dec/13 ] |
|
Patch landed on master branch for 2.6.0 and b2_4 branch for 2.4.2. |
| Comment by Jodi Levi (Inactive) [ 17/Dec/13 ] |
|
Patches have landed. Can this ticket be closed, or is more work expected/needed in this ticket? |
| Comment by Peter Jones [ 17/Dec/13 ] |
|
Landed for 2.4.2 and 2.5. Will land to 2.5.1 when work starts on that release |