[LU-5039] MDS mount hangs on orphan recovery Created: 09/May/14  Updated: 12/Aug/14  Resolved: 23/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: mn4

Severity: 3
Rank (Obsolete): 13931

 Description   

Running Lustre 2.4.0-28chaos (see github.com/chaos/lustre), we find that sometimes after a reboot the MDS can get stuck during mount cleaning up the orphan files in the PENDING directory. Some times we have 100,000+ files to process, and this can take literally hours. The symptoms are pretty similar to LU-5038, but I believe that the cause is different.

Here is a backtrace of the offending thread:

2014-03-06 22:34:12 Process tgt_recov (pid: 15478, threadinfo ffff8807bc436000, task ffff88081a6e2080)
2014-03-06 22:34:12 Stack:
2014-03-06 22:34:12  ffff88072e3df000 0000000000000000 0000000000003f14 ffff88072e3df060
2014-03-06 22:34:12 <d> ffff8807bc437a40 ffffffffa0341396 ffff8807bc437a20 ffff88072e3df038
2014-03-06 22:34:12 <d> 0000000000000014 ffff8807f9fbf530 0000000000000000 0000000000003f14
2014-03-06 22:34:12 Call Trace:
2014-03-06 22:34:12  [<ffffffffa0341396>] __dbuf_hold_impl+0x66/0x480 [zfs]
2014-03-06 22:34:12  [<ffffffffa034182f>] dbuf_hold_impl+0x7f/0xb0 [zfs]
2014-03-06 22:34:12  [<ffffffffa03428e0>] dbuf_hold+0x20/0x30 [zfs]
2014-03-06 22:34:12  [<ffffffffa03486e7>] dmu_buf_hold+0x97/0x1d0 [zfs]
2014-03-06 22:34:12  [<ffffffffa03369a0>] ? remove_reference+0xa0/0xc0 [zfs]
2014-03-06 22:34:12  [<ffffffffa039e76b>] zap_idx_to_blk+0xab/0x140 [zfs]
2014-03-06 22:34:12  [<ffffffffa039ff61>] zap_deref_leaf+0x51/0x80 [zfs]
2014-03-06 22:34:12  [<ffffffffa039f956>] ? zap_put_leaf+0x86/0xe0 [zfs]
2014-03-06 22:34:12  [<ffffffffa03a03dc>] fzap_cursor_retrieve+0xfc/0x2a0 [zfs]
2014-03-06 22:34:12  [<ffffffffa03a593b>] zap_cursor_retrieve+0x17b/0x2f0 [zfs]
2014-03-06 22:34:12  [<ffffffffa0d1739c>] ? udmu_zap_cursor_init_serialized+0x2c/0x30 [osd_zfs]
2014-03-06 22:34:12  [<ffffffffa0d29058>] osd_index_retrieve_skip_dots+0x28/0x60 [osd_zfs]
2014-03-06 22:34:12  [<ffffffffa0d29638>] osd_dir_it_next+0x98/0x120 [osd_zfs]
2014-03-06 22:34:12  [<ffffffffa0f08161>] lod_it_next+0x21/0x90 [lod]
2014-03-06 22:34:12  [<ffffffffa0dd1989>] __mdd_orphan_cleanup+0xa9/0xca0 [mdd]
2014-03-06 22:34:12  [<ffffffffa0de134d>] mdd_recovery_complete+0xed/0x170 [mdd]
2014-03-06 22:34:12  [<ffffffffa0e34cb5>] mdt_postrecov+0x35/0xd0 [mdt]
2014-03-06 22:34:12  [<ffffffffa0e36178>] mdt_obd_postrecov+0x78/0x90 [mdt]
2014-03-06 22:34:12  [<ffffffffa08745c0>] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc]
2014-03-06 22:34:12  [<ffffffffa086f8ae>] ? ldlm_reprocess_all_ns+0x3e/0x110 [ptlrpc]
2014-03-06 22:34:12  [<ffffffffa0885004>] target_recovery_thread+0xc64/0x1980 [ptlrpc]
2014-03-06 22:34:12  [<ffffffffa08843a0>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
2014-03-06 22:34:12  [<ffffffff8100c10a>] child_rip+0xa/0x20
2014-03-06 22:34:12  [<ffffffffa08843a0>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
2014-03-06 22:34:12  [<ffffffffa08843a0>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
2014-03-06 22:34:12  [<ffffffff8100c100>] ? child_rip+0x0/0x20

The mount process is blocked while this is going on. The cleanup is completely sequential and on ZFS very slow, on the order of 10 per second.

The orphan cleanup task really needs to be backgrounded (and perhaps parallelized) rather than blocking the MDT mount processes.



 Comments   
Comment by Peter Jones [ 10/May/14 ]

Alex

I think that it is best that you comment on this one

Peter

Comment by Alex Zhuravlev [ 12/May/14 ]

the idea is correct and fine. though I'm very confused by 10/second - we should be able to do much faster, given no LDLM contention, etc.

Comment by Peter Jones [ 23/May/14 ]

Niu

Could you please create a patch based on Oleg's suggestion (to follow)

Thanks

Peter

Comment by Oleg Drokin [ 23/May/14 ]

While these slow deletions are extreme and at least we can speed up the startup by doing deetions from a separate thread once reovery is complete.
Basiclly we'll create a new PENDING dir and will move all entries fclaimed by recovery rom old pending there. Then we just spawn another thread to delete the old pending and its content.

Need to be careful about of mds failure while doing this split handling and another recovery-restart - we probably would need to move all entries to old pending and redo the process. there sohul not be many due to recovery I hope

Comment by Niu Yawei (Inactive) [ 26/May/14 ]

Basiclly we'll create a new PENDING dir and will move all entries fclaimed by recovery rom old pending there. Then we just spawn another thread to delete the old pending and its content.
Need to be careful about of mds failure while doing this split handling and another recovery-restart - we probably would need to move all entries to old pending and redo the process. there sohul not be many due to recovery I hope

What bad will happen if we just start a thread to delete orphans from the original PENDING?

Comment by Oleg Drokin [ 02/Jun/14 ]

the possible problem is that list of items in PENDING is not fixed, new files might be added.
Can we reliably and race-free tell the ones that are still needed from those that are stale and need to be killed?
Also things like NFS further omplicate thingsby possible brifly reattaching to deleted files that were ought to be deleted.

Comment by Alex Zhuravlev [ 03/Jun/14 ]

there is open count in mdd object which tells whether the file is in use. probably we'll have to add locking to protect the last close vs. the cleanup procedure..

Comment by Niu Yawei (Inactive) [ 03/Jun/14 ]

there is open count in mdd object which tells whether the file is in use. probably we'll have to add locking to protect the last close vs. the cleanup procedure..

Indeed, I've talked with Oleg about this, and looks we already have lock serialized last close and orphan cleanup. I'll compose patch soon. Thank you all.

Comment by Niu Yawei (Inactive) [ 04/Jun/14 ]

cleanup orphan asynchronously: http://review.whamcloud.com/10584

Comment by Niu Yawei (Inactive) [ 20/Jun/14 ]

patch landed on master, need we backport it b2_4 & b2_5?

Comment by Peter Jones [ 20/Jun/14 ]

Yes I thin that we should

Comment by Niu Yawei (Inactive) [ 23/Jun/14 ]

b2_4: http://review.whamcloud.com/10779
b2_5: http://review.whamcloud.com/10780

Comment by Peter Jones [ 23/Jun/14 ]

Landed for 2.6

Generated at Sat Feb 10 01:48:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.