[LU-8746] lustre umount hangs in distribute_txn_fini Created: 21/Oct/16 Updated: 09/May/17 Resolved: 09/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Giuseppe Di Natale (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
on our DNE testbed, umount hangs with the following call stack: 2016-10-20 15:07:39 [14285.581098] INFO: task umount:182090 blocked for more than 120 seconds. 2016-10-20 15:07:39 [14285.589651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2016-10-20 15:07:39 [14285.599447] umount D ffff887918187c20 0 182090 182089 0x00000080 2016-10-20 15:07:39 [14285.608492] ffff887918187a58 0000000000000086 ffff887918187fd8 ffff887f6e073ec0 2016-10-20 15:07:39 [14285.617837] ffff887918187fd8 ffff887918187fd8 ffff887f6e073ec0 ffff887efb6fb0b0 2016-10-20 15:07:39 [14285.627161] ffff887ec3f50240 ffff887efb6fb100 ffff887918187c20 ffff887918187c20 2016-10-20 15:07:39 [14285.636476] Call Trace: 2016-10-20 15:07:39 [14285.640213] [<ffffffff8169f039>] schedule+0x29/0x70 2016-10-20 15:07:39 [14285.646853] [<ffffffffa0c679bd>] distribute_txn_fini+0xcd/0x130 [ptlrpc] 2016-10-20 15:07:39 [14285.655433] [<ffffffff810b4520>] ? wake_up_atomic_t+0x30/0x30 2016-10-20 15:07:39 [14285.662928] [<ffffffffa0f953ee>] lod_process_config+0x93e/0x15c0 [lod] 2016-10-20 15:07:39 [14285.671314] [<ffffffffa0e170a8>] ? lfsck_stop+0x1b8/0x4f0 [lfsck] 2016-10-20 15:07:39 [14285.679182] [<ffffffff811e5a63>] ? __kmalloc+0x233/0x280 2016-10-20 15:07:39 [14285.686155] [<ffffffffa0ff6ee2>] mdd_process_config+0x82/0x5c0 [mdd] 2016-10-20 15:07:39 [14285.694330] [<ffffffffa0eb91d6>] mdt_device_fini+0x1c6/0xfc0 [mdt] 2016-10-20 15:07:39 [14285.702290] [<ffffffffa099821c>] class_cleanup+0x8dc/0xd70 [obdclass] 2016-10-20 15:07:39 [14285.710504] [<ffffffffa099abfc>] class_process_config+0x1e2c/0x2f70 [obdclass] 2016-10-20 15:07:39 [14285.719555] [<ffffffff811e5a63>] ? __kmalloc+0x233/0x280 2016-10-20 15:07:39 [14285.726494] [<ffffffffa099411b>] ? lustre_cfg_new+0x8b/0x400 [obdclass] 2016-10-20 15:07:39 [14285.734866] [<ffffffffa099be2f>] class_manual_cleanup+0xef/0x810 [obdclass] 2016-10-20 15:07:39 [14285.743601] [<ffffffffa09cde8e>] server_put_super+0x8de/0xcd0 [obdclass] 2016-10-20 15:07:39 [14285.752002] [<ffffffff81209572>] generic_shutdown_super+0x72/0xf0 2016-10-20 15:07:39 [14285.759721] [<ffffffff81209942>] kill_anon_super+0x12/0x20 2016-10-20 15:07:39 [14285.766748] [<ffffffffa099f592>] lustre_kill_super+0x32/0x50 [obdclass] 2016-10-20 15:07:39 [14285.775033] [<ffffffff81209cf9>] deactivate_locked_super+0x49/0x60 2016-10-20 15:07:39 [14285.782803] [<ffffffff8120a2f6>] deactivate_super+0x46/0x60 2016-10-20 15:07:39 [14285.789884] [<ffffffff812282c5>] mntput_no_expire+0xc5/0x120 2016-10-20 15:07:39 [14285.797046] [<ffffffff81229440>] SyS_umount+0xa0/0x3b0 2016-10-20 15:07:39 [14285.803615] [<ffffffff816aa4c9>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 21/Oct/16 ] |
|
Hi Lai, Can you please investigate this issue? Thanks. |
| Comment by Andreas Dilger [ 21/Oct/16 ] |
|
Are the other MDS nodes being stopped at this time also? It looks like this MDS is waiting on finishing up some operation with another MDS when it is being shut diwn, and if that MDS is already stopped it won't get the answer it is waiting for any time soon. It might make sense to add checks in the places where the MDS is waiting on other MDS RPCs to see if it is in the process of unmounting, and stop waiting on it? |
| Comment by Di Wang [ 21/Oct/16 ] |
|
Hmm, distribute_txn_fini() is waiting for the tdtd thread to stop here, which is used to tracking distribute transaction committed status, canceling committed logs, etc. So that thread must be stuck somewhere. Could you please post the stack trace for that thread as well? (named as tdtd-xx). Thanks |
| Comment by Giuseppe Di Natale (Inactive) [ 21/Oct/16 ] |
|
I'm going to have to try and reproduce this issue, at which point I could get try and get the call stacks you need. Also, Andreas, looking through the logs, it does in fact appear that two MDSs were trying to unmount their targets. |
| Comment by Andreas Dilger [ 26/Apr/17 ] |
|
I hit this same hang during cleanup with v2.9.55_78. It seems like the threads aren't doing anything, just waiting for the refcount to drop to zero: kernel: umount D ffff8800251afc30 0 18172 18171 0x00000080 kernel: Call Trace: kernel: schedule+0x29/0x70 kernel: distribute_txn_fini+0xcd/0x130 [ptlrpc] kernel: lod_process_config+0x98f/0x15f0 [lod] kernel: mdd_process_config+0x88/0x600 [mdd] kernel: mdt_device_fini+0x1c6/0x10b0 [mdt] kernel: class_cleanup+0x7f4/0xd80 [obdclass] kernel: class_process_config+0x1f84/0x2c30 [obdclass] kernel: class_manual_cleanup+0xef/0x810 [obdclass] kernel: server_put_super+0x8de/0xcd0 [obdclass] kernel: generic_shutdown_super+0x56/0xe0 kernel: kill_anon_super+0x12/0x20 kernel: lustre_kill_super+0x32/0x50 [obdclass] kernel: deactivate_locked_super+0x49/0x60 kernel: deactivate_super+0x46/0x60 kernel: mntput_no_expire+0xc5/0x120 kernel: SyS_umount+0x9f/0x3c0 (gdb) list *distribute_txn_fini+0xcd 0xcb4dd is in distribute_txn_fini (lustre/target/update_trans.c:1744). 1739 1740 spin_lock(&tdtd->tdtd_batchid_lock); 1741 lut->lut_tdtd_commit_thread.t_flags = SVC_STOPPING; 1742 spin_unlock(&tdtd->tdtd_batchid_lock); 1743 wake_up(&tdtd->tdtd_commit_thread_waitq); 1744 wait_event(lut->lut_tdtd_commit_thread.t_ctl_waitq, 1745 lut->lut_tdtd_commit_thread.t_flags & SVC_STOPPED); 1746 1747 dtrq_list_destroy(tdtd); (minor note, it would be good to give these threads a better name than "tdtd", like "tgt_txn" or "dist_txn" or similar. kernel: tdtd-0 S ffff8800376dd080 0 17234 2 0x00000080 kernel: Call Trace: kernel: schedule+0x29/0x70 kernel: distribute_txn_commit_thread+0x110c/0x1410 [ptlrpc] kernel: kthread+0xcf/0xe0 (gdb) list *distribute_txn_commit_thread+0x110c 0xcd621 is in distribute_txn_commit_thread (ustre/target/update_trans.c:1644). 1641 }; 1642 1643 l_wait_event(tdtd->tdtd_commit_thread_waitq, 1644 atomic_read(&tdtd->tdtd_refcount) == 0, &lwi); 1645 1646 spin_lock(&tdtd->tdtd_batchid_lock); |
| Comment by Di Wang [ 26/Apr/17 ] |
1643 l_wait_event(tdtd->tdtd_commit_thread_waitq, 1644 atomic_read(&tdtd->tdtd_refcount) == 0, &lwi); tdtd_refcount != 0 means there are some inflight transactions to wait for batchid updating. The transaction callback should decrease the refcount. Somehow the transaction seems cancelled, but refcount is not decreased. Hmm, it seems tdtd_refcount is not restored in some error handler cases in distribute_txn_commit_batchid_update(). I will cook a patch. |
| Comment by Gerrit Updater [ 28/Apr/17 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/26888 |
| Comment by Gerrit Updater [ 09/May/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26888/ |
| Comment by Peter Jones [ 09/May/17 ] |
|
Landed for 2.10 |