[LU-8746] lustre umount hangs in distribute_txn_fini Created: 21/Oct/16  Updated: 09/May/17  Resolved: 09/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Major
Reporter: Giuseppe Di Natale (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: llnl

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

on our DNE testbed, umount hangs with the following call stack:

2016-10-20 15:07:39 [14285.581098] INFO: task umount:182090 blocked for more than 120 seconds.
2016-10-20 15:07:39 [14285.589651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2016-10-20 15:07:39 [14285.599447] umount          D ffff887918187c20     0 182090 182089 0x00000080
2016-10-20 15:07:39 [14285.608492]  ffff887918187a58 0000000000000086 ffff887918187fd8 ffff887f6e073ec0
2016-10-20 15:07:39 [14285.617837]  ffff887918187fd8 ffff887918187fd8 ffff887f6e073ec0 ffff887efb6fb0b0
2016-10-20 15:07:39 [14285.627161]  ffff887ec3f50240 ffff887efb6fb100 ffff887918187c20 ffff887918187c20
2016-10-20 15:07:39 [14285.636476] Call Trace:
2016-10-20 15:07:39 [14285.640213]  [<ffffffff8169f039>] schedule+0x29/0x70
2016-10-20 15:07:39 [14285.646853]  [<ffffffffa0c679bd>] distribute_txn_fini+0xcd/0x130 [ptlrpc]
2016-10-20 15:07:39 [14285.655433]  [<ffffffff810b4520>] ? wake_up_atomic_t+0x30/0x30
2016-10-20 15:07:39 [14285.662928]  [<ffffffffa0f953ee>] lod_process_config+0x93e/0x15c0 [lod]
2016-10-20 15:07:39 [14285.671314]  [<ffffffffa0e170a8>] ? lfsck_stop+0x1b8/0x4f0 [lfsck]
2016-10-20 15:07:39 [14285.679182]  [<ffffffff811e5a63>] ? __kmalloc+0x233/0x280
2016-10-20 15:07:39 [14285.686155]  [<ffffffffa0ff6ee2>] mdd_process_config+0x82/0x5c0 [mdd]
2016-10-20 15:07:39 [14285.694330]  [<ffffffffa0eb91d6>] mdt_device_fini+0x1c6/0xfc0 [mdt]
2016-10-20 15:07:39 [14285.702290]  [<ffffffffa099821c>] class_cleanup+0x8dc/0xd70 [obdclass]
2016-10-20 15:07:39 [14285.710504]  [<ffffffffa099abfc>] class_process_config+0x1e2c/0x2f70 [obdclass]
2016-10-20 15:07:39 [14285.719555]  [<ffffffff811e5a63>] ? __kmalloc+0x233/0x280
2016-10-20 15:07:39 [14285.726494]  [<ffffffffa099411b>] ? lustre_cfg_new+0x8b/0x400 [obdclass]
2016-10-20 15:07:39 [14285.734866]  [<ffffffffa099be2f>] class_manual_cleanup+0xef/0x810 [obdclass]
2016-10-20 15:07:39 [14285.743601]  [<ffffffffa09cde8e>] server_put_super+0x8de/0xcd0 [obdclass]
2016-10-20 15:07:39 [14285.752002]  [<ffffffff81209572>] generic_shutdown_super+0x72/0xf0
2016-10-20 15:07:39 [14285.759721]  [<ffffffff81209942>] kill_anon_super+0x12/0x20
2016-10-20 15:07:39 [14285.766748]  [<ffffffffa099f592>] lustre_kill_super+0x32/0x50 [obdclass]
2016-10-20 15:07:39 [14285.775033]  [<ffffffff81209cf9>] deactivate_locked_super+0x49/0x60
2016-10-20 15:07:39 [14285.782803]  [<ffffffff8120a2f6>] deactivate_super+0x46/0x60
2016-10-20 15:07:39 [14285.789884]  [<ffffffff812282c5>] mntput_no_expire+0xc5/0x120
2016-10-20 15:07:39 [14285.797046]  [<ffffffff81229440>] SyS_umount+0xa0/0x3b0
2016-10-20 15:07:39 [14285.803615]  [<ffffffff816aa4c9>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Joseph Gmitter (Inactive) [ 21/Oct/16 ]

Hi Lai,

Can you please investigate this issue?

Thanks.
Joe

Comment by Andreas Dilger [ 21/Oct/16 ]

Are the other MDS nodes being stopped at this time also? It looks like this MDS is waiting on finishing up some operation with another MDS when it is being shut diwn, and if that MDS is already stopped it won't get the answer it is waiting for any time soon.

It might make sense to add checks in the places where the MDS is waiting on other MDS RPCs to see if it is in the process of unmounting, and stop waiting on it?

Comment by Di Wang [ 21/Oct/16 ]

Hmm, distribute_txn_fini() is waiting for the tdtd thread to stop here, which is used to tracking distribute transaction committed status, canceling committed logs, etc. So that thread must be stuck somewhere. Could you please post the stack trace for that thread as well? (named as tdtd-xx). Thanks

Comment by Giuseppe Di Natale (Inactive) [ 21/Oct/16 ]

I'm going to have to try and reproduce this issue, at which point I could get try and get the call stacks you need.

Also, Andreas, looking through the logs, it does in fact appear that two MDSs were trying to unmount their targets.

Comment by Andreas Dilger [ 26/Apr/17 ]

I hit this same hang during cleanup with v2.9.55_78. It seems like the threads aren't doing anything, just waiting for the refcount to drop to zero:

kernel: umount          D ffff8800251afc30     0 18172  18171 0x00000080
kernel: Call Trace:
kernel: schedule+0x29/0x70
kernel: distribute_txn_fini+0xcd/0x130 [ptlrpc]
kernel: lod_process_config+0x98f/0x15f0 [lod]
kernel: mdd_process_config+0x88/0x600 [mdd]
kernel: mdt_device_fini+0x1c6/0x10b0 [mdt]
kernel: class_cleanup+0x7f4/0xd80 [obdclass]
kernel: class_process_config+0x1f84/0x2c30 [obdclass]
kernel: class_manual_cleanup+0xef/0x810 [obdclass]
kernel: server_put_super+0x8de/0xcd0 [obdclass]
kernel: generic_shutdown_super+0x56/0xe0
kernel: kill_anon_super+0x12/0x20
kernel: lustre_kill_super+0x32/0x50 [obdclass]
kernel: deactivate_locked_super+0x49/0x60
kernel: deactivate_super+0x46/0x60
kernel: mntput_no_expire+0xc5/0x120
kernel: SyS_umount+0x9f/0x3c0

(gdb) list *distribute_txn_fini+0xcd
0xcb4dd is in distribute_txn_fini
(lustre/target/update_trans.c:1744).
1739
1740            spin_lock(&tdtd->tdtd_batchid_lock);
1741            lut->lut_tdtd_commit_thread.t_flags = SVC_STOPPING;
1742            spin_unlock(&tdtd->tdtd_batchid_lock);
1743            wake_up(&tdtd->tdtd_commit_thread_waitq);
1744            wait_event(lut->lut_tdtd_commit_thread.t_ctl_waitq,
1745                       lut->lut_tdtd_commit_thread.t_flags & SVC_STOPPED);
1746
1747            dtrq_list_destroy(tdtd);

(minor note, it would be good to give these threads a better name than "tdtd", like "tgt_txn" or "dist_txn" or similar.

kernel: tdtd-0          S ffff8800376dd080     0 17234      2 0x00000080
kernel: Call Trace:
kernel: schedule+0x29/0x70
kernel: distribute_txn_commit_thread+0x110c/0x1410 [ptlrpc]
kernel: kthread+0xcf/0xe0

(gdb) list *distribute_txn_commit_thread+0x110c
0xcd621 is in distribute_txn_commit_thread (ustre/target/update_trans.c:1644).
1641         };
1642 
1643         l_wait_event(tdtd->tdtd_commit_thread_waitq,
1644                      atomic_read(&tdtd->tdtd_refcount) == 0, &lwi);
1645 
1646         spin_lock(&tdtd->tdtd_batchid_lock);
Comment by Di Wang [ 26/Apr/17 ]
1643         l_wait_event(tdtd->tdtd_commit_thread_waitq,
1644                      atomic_read(&tdtd->tdtd_refcount) == 0, &lwi);

tdtd_refcount != 0 means there are some inflight transactions to wait for batchid updating. The transaction callback should decrease the refcount. Somehow the transaction seems cancelled, but refcount is not decreased. Hmm, it seems tdtd_refcount is not restored in some error handler cases in distribute_txn_commit_batchid_update(). I will cook a patch.

Comment by Gerrit Updater [ 28/Apr/17 ]

wangdi (di.wang@intel.com) uploaded a new patch: https://review.whamcloud.com/26888
Subject: LU-8746 update: restore tdtd_refcount during failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5a74baab7cf18a42090f9b348537f0cd5ae173ad

Comment by Gerrit Updater [ 09/May/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26888/
Subject: LU-8746 update: restore tdtd_refcount during failure
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 92bbd06e4ddd2505fceaada9cda4bb0974733204

Comment by Peter Jones [ 09/May/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:20:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.