[LU-5039] MDS mount hangs on orphan recovery Created: 09/May/14 Updated: 12/Aug/14 Resolved: 23/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Christopher Morrone | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4 | ||
| Severity: | 3 |
| Rank (Obsolete): | 13931 |
| Description |
|
Running Lustre 2.4.0-28chaos (see github.com/chaos/lustre), we find that sometimes after a reboot the MDS can get stuck during mount cleaning up the orphan files in the PENDING directory. Some times we have 100,000+ files to process, and this can take literally hours. The symptoms are pretty similar to Here is a backtrace of the offending thread: 2014-03-06 22:34:12 Process tgt_recov (pid: 15478, threadinfo ffff8807bc436000, task ffff88081a6e2080) 2014-03-06 22:34:12 Stack: 2014-03-06 22:34:12 ffff88072e3df000 0000000000000000 0000000000003f14 ffff88072e3df060 2014-03-06 22:34:12 <d> ffff8807bc437a40 ffffffffa0341396 ffff8807bc437a20 ffff88072e3df038 2014-03-06 22:34:12 <d> 0000000000000014 ffff8807f9fbf530 0000000000000000 0000000000003f14 2014-03-06 22:34:12 Call Trace: 2014-03-06 22:34:12 [<ffffffffa0341396>] __dbuf_hold_impl+0x66/0x480 [zfs] 2014-03-06 22:34:12 [<ffffffffa034182f>] dbuf_hold_impl+0x7f/0xb0 [zfs] 2014-03-06 22:34:12 [<ffffffffa03428e0>] dbuf_hold+0x20/0x30 [zfs] 2014-03-06 22:34:12 [<ffffffffa03486e7>] dmu_buf_hold+0x97/0x1d0 [zfs] 2014-03-06 22:34:12 [<ffffffffa03369a0>] ? remove_reference+0xa0/0xc0 [zfs] 2014-03-06 22:34:12 [<ffffffffa039e76b>] zap_idx_to_blk+0xab/0x140 [zfs] 2014-03-06 22:34:12 [<ffffffffa039ff61>] zap_deref_leaf+0x51/0x80 [zfs] 2014-03-06 22:34:12 [<ffffffffa039f956>] ? zap_put_leaf+0x86/0xe0 [zfs] 2014-03-06 22:34:12 [<ffffffffa03a03dc>] fzap_cursor_retrieve+0xfc/0x2a0 [zfs] 2014-03-06 22:34:12 [<ffffffffa03a593b>] zap_cursor_retrieve+0x17b/0x2f0 [zfs] 2014-03-06 22:34:12 [<ffffffffa0d1739c>] ? udmu_zap_cursor_init_serialized+0x2c/0x30 [osd_zfs] 2014-03-06 22:34:12 [<ffffffffa0d29058>] osd_index_retrieve_skip_dots+0x28/0x60 [osd_zfs] 2014-03-06 22:34:12 [<ffffffffa0d29638>] osd_dir_it_next+0x98/0x120 [osd_zfs] 2014-03-06 22:34:12 [<ffffffffa0f08161>] lod_it_next+0x21/0x90 [lod] 2014-03-06 22:34:12 [<ffffffffa0dd1989>] __mdd_orphan_cleanup+0xa9/0xca0 [mdd] 2014-03-06 22:34:12 [<ffffffffa0de134d>] mdd_recovery_complete+0xed/0x170 [mdd] 2014-03-06 22:34:12 [<ffffffffa0e34cb5>] mdt_postrecov+0x35/0xd0 [mdt] 2014-03-06 22:34:12 [<ffffffffa0e36178>] mdt_obd_postrecov+0x78/0x90 [mdt] 2014-03-06 22:34:12 [<ffffffffa08745c0>] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc] 2014-03-06 22:34:12 [<ffffffffa086f8ae>] ? ldlm_reprocess_all_ns+0x3e/0x110 [ptlrpc] 2014-03-06 22:34:12 [<ffffffffa0885004>] target_recovery_thread+0xc64/0x1980 [ptlrpc] 2014-03-06 22:34:12 [<ffffffffa08843a0>] ? target_recovery_thread+0x0/0x1980 [ptlrpc] 2014-03-06 22:34:12 [<ffffffff8100c10a>] child_rip+0xa/0x20 2014-03-06 22:34:12 [<ffffffffa08843a0>] ? target_recovery_thread+0x0/0x1980 [ptlrpc] 2014-03-06 22:34:12 [<ffffffffa08843a0>] ? target_recovery_thread+0x0/0x1980 [ptlrpc] 2014-03-06 22:34:12 [<ffffffff8100c100>] ? child_rip+0x0/0x20 The mount process is blocked while this is going on. The cleanup is completely sequential and on ZFS very slow, on the order of 10 per second. The orphan cleanup task really needs to be backgrounded (and perhaps parallelized) rather than blocking the MDT mount processes. |
| Comments |
| Comment by Peter Jones [ 10/May/14 ] |
|
Alex I think that it is best that you comment on this one Peter |
| Comment by Alex Zhuravlev [ 12/May/14 ] |
|
the idea is correct and fine. though I'm very confused by 10/second - we should be able to do much faster, given no LDLM contention, etc. |
| Comment by Peter Jones [ 23/May/14 ] |
|
Niu Could you please create a patch based on Oleg's suggestion (to follow) Thanks Peter |
| Comment by Oleg Drokin [ 23/May/14 ] |
|
While these slow deletions are extreme and at least we can speed up the startup by doing deetions from a separate thread once reovery is complete. Need to be careful about of mds failure while doing this split handling and another recovery-restart - we probably would need to move all entries to old pending and redo the process. there sohul not be many due to recovery I hope |
| Comment by Niu Yawei (Inactive) [ 26/May/14 ] |
What bad will happen if we just start a thread to delete orphans from the original PENDING? |
| Comment by Oleg Drokin [ 02/Jun/14 ] |
|
the possible problem is that list of items in PENDING is not fixed, new files might be added. |
| Comment by Alex Zhuravlev [ 03/Jun/14 ] |
|
there is open count in mdd object which tells whether the file is in use. probably we'll have to add locking to protect the last close vs. the cleanup procedure.. |
| Comment by Niu Yawei (Inactive) [ 03/Jun/14 ] |
Indeed, I've talked with Oleg about this, and looks we already have lock serialized last close and orphan cleanup. I'll compose patch soon. Thank you all. |
| Comment by Niu Yawei (Inactive) [ 04/Jun/14 ] |
|
cleanup orphan asynchronously: http://review.whamcloud.com/10584 |
| Comment by Niu Yawei (Inactive) [ 20/Jun/14 ] |
|
patch landed on master, need we backport it b2_4 & b2_5? |
| Comment by Peter Jones [ 20/Jun/14 ] |
|
Yes I thin that we should |
| Comment by Niu Yawei (Inactive) [ 23/Jun/14 ] |
|
b2_4: http://review.whamcloud.com/10779 |
| Comment by Peter Jones [ 23/Jun/14 ] |
|
Landed for 2.6 |