[LU-8736] stuck during umount in soak-test Created: 19/Oct/16 Updated: 18/Dec/17 |
|
| Status: | In Progress |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Di Wang | Assignee: | Lai Siyao |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | soak | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
In latest soak-test, one of MDT stuck during umount LustreError: 0-0: Forced cleanup waiting for soaked-MDT0000-osp-MDT0002 namespace with 1 resources in use, (rc=-110) The stack trace umount S 0000000000000011 0 8015 8013 0x00000080 ffff8803d9b33808 0000000000000086 ffff8803d9b337d0 ffff8803d9b337cc ffff8803d9b33868 ffff88043fe84000 00001b24f314dc54 ffff880038635a00 00000000000003ff 0000000101c3089b ffff8803f3c31ad8 ffff8803d9b33fd8 Call Trace: [<ffffffff8153a9b2>] schedule_timeout+0x192/0x2e0 [<ffffffff81089fa0>] ? process_timeout+0x0/0x10 [<ffffffffa0abded0>] __ldlm_namespace_free+0x1c0/0x560 [ptlrpc] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa0abe2df>] ldlm_namespace_free_prior+0x6f/0x220 [ptlrpc] [<ffffffffa13b0db2>] osp_process_config+0x4a2/0x680 [osp] [<ffffffff81291947>] ? find_first_bit+0x47/0x80 [<ffffffffa12c5650>] lod_sub_process_config+0x100/0x1f0 [lod] [<ffffffffa12cad66>] lod_process_config+0x646/0x1580 [lod] [<ffffffffa113e4ff>] ? lfsck_stop+0x15f/0x4c0 [lfsck] [<ffffffffa0801032>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs] [<ffffffffa1343253>] mdd_process_config+0x113/0x5e0 [mdd] [<ffffffffa11fee62>] mdt_device_fini+0x482/0x13e0 [mdt] [<ffffffffa08df626>] ? class_disconnect_exports+0x116/0x2f0 [obdclass] [<ffffffffa08f82c2>] class_cleanup+0x582/0xd30 [obdclass] [<ffffffffa08dae56>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa08fa5d6>] class_process_config+0x1b66/0x24c0 [obdclass] [<ffffffffa07fc151>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff8117904c>] ? __kmalloc+0x21c/0x230 [<ffffffffa08fb3ef>] class_manual_cleanup+0x4bf/0xc90 [obdclass] [<ffffffffa08dae56>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa092983c>] server_put_super+0x8bc/0xcd0 [obdclass] [<ffffffff81194aeb>] generic_shutdown_super+0x5b/0xe0 [<ffffffff81194bd6>] kill_anon_super+0x16/0x60 [<ffffffffa08fe596>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81195377>] deactivate_super+0x57/0x80 [<ffffffff811b533f>] mntput_no_expire+0xbf/0x110 [<ffffffff811b5e8b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b And it seems there is a MDT handler thread (mdt_rename), which holds the remote lock on soaked-MDT0000-osp-MDT0002, but then stuck on local lock enqueue, which then block the namespace cleanup of umount. mdt01_016 S 000000000000000a 0 7405 2 0x00000080 ffff8804027ab900 0000000000000046 0000000000000000 ffffffff810a1c1c ffff880433fef520 ffff8804027ab880 00000a768c137fd5 0000000000000000 ffff8804027ab8c0 0000000100ab043e ffff880433fefad8 ffff8804027abfd8 Call Trace: [<ffffffff810a1c1c>] ? remove_wait_queue+0x3c/0x50 [<ffffffffa0ad54b0>] ? ldlm_expired_completion_wait+0x0/0x250 [ptlrpc] [<ffffffffa0ada07d>] ldlm_completion_ast+0x68d/0x9b0 [ptlrpc] [<ffffffff81067650>] ? default_wake_function+0x0/0x20 [<ffffffffa0ad93fe>] ldlm_cli_enqueue_local+0x21e/0x810 [ptlrpc] [<ffffffffa0ad99f0>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [<ffffffffa11fa770>] ? mdt_blocking_ast+0x0/0x2e0 [mdt] [<ffffffffa12074a4>] mdt_object_local_lock+0x3a4/0xb00 [mdt] [<ffffffffa11fa770>] ? mdt_blocking_ast+0x0/0x2e0 [mdt] [<ffffffffa0ad99f0>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [<ffffffffa1208103>] mdt_object_lock_internal+0x63/0x320 [mdt] [<ffffffffa1218e9e>] ? mdt_lookup_version_check+0x9e/0x350 [mdt] [<ffffffffa1208580>] mdt_reint_object_lock+0x20/0x60 [mdt] [<ffffffffa121cba7>] mdt_reint_rename_or_migrate+0x1317/0x2690 [mdt] [<ffffffffa11fa770>] ? mdt_blocking_ast+0x0/0x2e0 [mdt] [<ffffffffa0ad99f0>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] [<ffffffffa09238c0>] ? lu_ucred+0x20/0x30 [obdclass] [<ffffffffa0b06b00>] ? lustre_pack_reply_v2+0xf0/0x280 [ptlrpc] [<ffffffffa121df53>] mdt_reint_rename+0x13/0x20 [mdt] [<ffffffffa121704d>] mdt_reint_rec+0x5d/0x200 [mdt] [<ffffffffa1201d5b>] mdt_reint_internal+0x62b/0xa50 [mdt] [<ffffffffa120262b>] mdt_reint+0x6b/0x120 [mdt] [<ffffffffa0b6b0cc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc] [<ffffffffa0b17821>] ptlrpc_main+0xd31/0x1800 [ptlrpc] [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0 [<ffffffffa0b16af0>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] [<ffffffff810a138e>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a12f0>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 20/Oct/16 ] |
|
Cliff, can you file the above comment as a separate ticket? |
| Comment by Joseph Gmitter (Inactive) [ 20/Oct/16 ] |
|
Hi Lai, Can you please look at the first issue in this ticket? Thanks. |
| Comment by Andreas Dilger [ 20/Oct/16 ] |
|
Cliff, Di, The other possibility is if this is a circular locking deadlock, which would need stack traces on both MDS nodes to see if there is another thread also stuck waiting for a lock. |
| Comment by Cliff White (Inactive) [ 20/Oct/16 ] |
|
For the initial problem, no - MDT0002 is mounted on lola-10, and was not involved in the failover at all. It was neither stopped nor umounted |
| Comment by Bob Glossman (Inactive) [ 20/Oct/16 ] |
|
console logs covering the time period |
| Comment by Di Wang [ 20/Oct/16 ] |
The other possibility is if this is a circular locking deadlock, which would need stack traces on both MDS nodes to see if there is another thread also stuck waiting for a lock. Yes, it looks like a circular locking deadlock here, see that mdt_reint_rename trace (on MDT2) I posted above, after the rename process on MDT2 got the remote lock(lock_A), then being blocked in local lock(lock_B) enqueue, which is probably holding by MDT0, then in the mean time umount happened, so MDT0 can not return the lock to MDT2 successfully, because of umount on MDT2 ( Can it? that is my guess, but need some investigation). Then we saw umount is hanging there for namespace cleanup because lock_A is still hold by the rename process. So it looks like we need re-order the umount process to make sure all of remote locks has been released before namespace cleanup. Need think a bit. |
| Comment by Andreas Dilger [ 19/Apr/17 ] |
|
Di, any further ideas on this? |