[LU-5038] Mount hangs for hours processing some llog Created: 09/May/14 Updated: 14/Jun/18 Resolved: 26/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl, mn4 | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 13930 | ||||||||||||
| Description |
|
Running lustre 2.4.0-28chaos (see github.com/chaos/lustre) over ZFS, we have repeatedly seen the MDS hang for hours on mount after an ungraceful reboot. Recovery looks like it is complete: # porter-mds1 /proc/fs/lustre > cat mdt/lse-MDT0000/recovery_status status: COMPLETE recovery_start: 0 recovery_duration: 0 completed_clients: 3846/3846 replayed_requests: 0 last_transno: -1 VBR: DISABLED IR: ENABLED The MGS mount seems to be complete, but the mount command for the MDT is still hung after two hours. The backtrace is: crash> bt 14674 PID: 14674 TASK: ffff880832fd0040 CPU: 1 COMMAND: "mount.lustre" #0 [ffff8807d29370f8] schedule at ffffffff81510bd2 #1 [ffff8807d29371c0] io_schedule at ffffffff815113b3 #2 [ffff8807d29371e0] cv_wait_common at ffffffffa02915cc [spl] #3 [ffff8807d2937250] __cv_wait_io at ffffffffa0291658 [spl] #4 [ffff8807d2937260] zio_wait at ffffffffa03d686b [zfs] #5 [ffff8807d29372a0] dbuf_read at ffffffffa0343f8d [zfs] #6 [ffff8807d2937360] dmu_buf_hold at ffffffffa034b728 [zfs] #7 [ffff8807d29373d0] zap_get_leaf_byblk at ffffffffa03a286f [zfs] #8 [ffff8807d2937440] zap_deref_leaf at ffffffffa03a2b2a [zfs] #9 [ffff8807d2937480] fzap_lookup at ffffffffa03a2ef0 [zfs] #10 [ffff8807d2937500] zap_lookup_norm at ffffffffa03a88f1 [zfs] #11 [ffff8807d2937570] zap_lookup at ffffffffa03a8a33 [zfs] #12 [ffff8807d29375a0] osd_fid_lookup at ffffffffa0d2a710 [osd_zfs] #13 [ffff8807d29375f0] osd_object_init at ffffffffa0d24311 [osd_zfs] #14 [ffff8807d2937650] lu_object_alloc at ffffffffa0737c9d [obdclass] #15 [ffff8807d29376b0] lu_object_find_at at ffffffffa0738805 [obdclass] #16 [ffff8807d2937770] dt_locate_at at ffffffffa073ad9a [obdclass] #17 [ffff8807d29377b0] llog_osd_open at ffffffffa071867c [obdclass] #18 [ffff8807d2937820] llog_open at ffffffffa06e360a [obdclass] #19 [ffff8807d2937870] llog_cat_id2handle at ffffffffa06e8772 [obdclass] #20 [ffff8807d29378e0] cat_cancel_cb at ffffffffa06e9cf9 [obdclass] #21 [ffff8807d2937930] llog_process_thread at ffffffffa06e510b [obdclass] #22 [ffff8807d29379e0] llog_process_or_fork at ffffffffa06e6e1d [obdclass] #23 [ffff8807d2937a30] llog_cat_init_and_process at ffffffffa06e779a [obdclass] #24 [ffff8807d2937a60] mdd_prepare at ffffffffa0de8f77 [mdd] #25 [ffff8807d2937af0] mdt_prepare at ffffffffa0e3e10a [mdt] #26 [ffff8807d2937b50] server_start_targets at ffffffffa0761c06 [obdclass] #27 [ffff8807d2937c90] server_fill_super at ffffffffa07631e4 [obdclass] #28 [ffff8807d2937d70] lustre_fill_super at ffffffffa0733838 [obdclass] #29 [ffff8807d2937da0] get_sb_nodev at ffffffff811855af #30 [ffff8807d2937de0] lustre_get_sb at ffffffffa072b2a5 [obdclass] #31 [ffff8807d2937e00] vfs_kern_mount at ffffffff81184bcb #32 [ffff8807d2937e50] do_kern_mount at ffffffff81184d72 #33 [ffff8807d2937ea0] do_mount at ffffffff811a4dd2 #34 [ffff8807d2937f20] sys_mount at ffffffff811a5460 #35 [ffff8807d2937f80] system_call_fastpath at ffffffff8100b0b2 This time around it cleared up after I grabbed this info, so I don't have much more to go on. Hopefully you folks can explain what crazy llogs are being processed under the mount command, and why that is happening (and prevents any filesystem usage until the processing is finished). There is probably some issue with processing going slower on the zfs osd, but I'd like to understand if what Lustre is doing is at all sane. |
| Comments |
| Comment by Peter Jones [ 10/May/14 ] |
|
Alex I think that it is best that you comment on this one Peter |
| Comment by Alex Zhuravlev [ 12/May/14 ] |
|
could you please grab all the stacks once this happens? (echo t >/proc/sysrq-trigger) notice the process got stuck in dbuf_read() -> zio_wait() which is ZFS's internal path to read a block. |
| Comment by Alex Zhuravlev [ 12/May/14 ] |
|
hmm, probably I was too speedy.. I guess for a reason one of the llogs (changelog or hsm action log) is huge.. Chris, would you mind for provide debug logs to see what's hapenning exactly? |
| Comment by Prakash Surya (Inactive) [ 12/May/14 ] |
|
Alex, what exactly to the llogs do? is there some docs that explain them at a high level? I've seen something similar, if not the same, on our test systems. In that case, it appeared like the mount process was making forward process, just at an abysmal rate. The disks were not fully utilized, but they were continuously doing reads originating from the mount.lustre process. Without really understanding how llogs fit into the bigger picture, it seemed like there were simply "a lot" of logs that it needed to process, and doing so was really slow since it is a serial single threaded process limited by reads to the disk. I could be wrong, but that's just how it appeared from the outside; I never verified this with the source code though. How would change logs play into this? Does that mean running robinhood might be affecting the time it takes to mount? |
| Comment by Alex Zhuravlev [ 12/May/14 ] |
|
I'm asking for log to understand which specific llog the system was reading. unless you're using HSM, this should be changelogs - basically a log of all the changes (including opens) - so 3rd party tools (like robinhood) can watch the activity. the important bit here is that the records are supposed to be cancelled in seconds by the consumer. so it might be a bug in our cancellation code or the consumer wasn't acting quickly enough. it would be interesting to learn the size of actual llogs the system was processing - see llog_reader utility. |
| Comment by Prakash Surya (Inactive) [ 12/May/14 ] |
AFAIK, this is not the case for our file systems using the changelogs. Often, robinhood is not able to read changelogs at the rate they are being created. As a result, changelogs tend to accumulate on a system and eventually get consumed when user activity dies down, allowing for the changelog consumer to "catch up". So, if the accumulation of "a lot" of changelogs causes the mount to take a significant amount of time to complete, I'm not sure we'll be able to keep using that feature for the time being. EDIT: And I just want to make sure it's clear.. By "eventually get consumed", I mean possibly weeks later; not seconds as you said. This is especially so when we first start using robinhood on an existing filesystem. It can take weeks to fully traverse the filesystem and populate the database with the initial state, meanwhile changelogs get created and are not consumed until the initial filesystem scan completes. |
| Comment by Prakash Surya (Inactive) [ 12/May/14 ] |
I'll try and get some real data regarding the llogs and push it your way, I need to look into some other issues first though. |
| Comment by Alex Zhuravlev [ 12/May/14 ] |
|
OK, if there is no very strong need in the changelogs, I'd recommend to turn it off. another reason is that currently the llog is protected with a single mutex serializing all the modifications (no totally, but to some extent). in general, it seems that due to COW nature of ZFS small modifications (like with OI and llogs) lead to significant fragmentation which in turn makes subsequent reads quite expensive. so something like zap_prefetch() Brian suggested should be use in more cases.. llog can be another one. |
| Comment by Prakash Surya (Inactive) [ 12/May/14 ] |
But we do have a strong need for it. Our admins have been trying to lean more heavily on using robinhood for certain things, but if it is going to take hours to mount as a result of using robinhood, we'll have to seriously reconsider its usage. |
| Comment by Alex Zhuravlev [ 12/May/14 ] |
|
OK, the both clarifications are very important. I'd hope that relatively small changes to prefetch stuff can improve significantly, but I'm going to look at the llog code to check couple ides. thanks. |
| Comment by Christopher Morrone [ 12/May/14 ] |
|
Can you please explain why the changelog llog needs to be walked at filesystem mount time, blocking all access to the filesystem until that is complete? |
| Comment by Alex Zhuravlev [ 13/May/14 ] |
|
there is no strong requirement to do this in the context of mount, we can do this in background. to my knowledge that never was an issue because usually the number of alive records in the catalog is very small. then the whole process seem to be badly affected by slow OI population (that issue with FID->dnode and cache misses in ZAP). I'll make a patch, but it will take a bit as the locking is tricky here. |
| Comment by Alex Zhuravlev [ 13/May/14 ] |
|
http://review.whamcloud.com/10308 – I can't ask you to test the patch immediately though. |
| Comment by Prakash Surya (Inactive) [ 13/May/14 ] |
|
Alex, just to give you some more context. Our admin has decided to turn off robinhood on the particular filesystem that we're seeing the most issues on, with the feeling that it is has been the cause for some of our grief. According to our calculations, it would have take just over a month to complete the initial scan of the filesystem; it has over 700 million files it it currently. That's not a good feeling to have to ditch a "feature" because it doesn't actually work when we try to use it on a production filesystem. He tried to deregister the changelog user on the MDS, letting it run over night, and it still has not completed. Is the deregister process essentially the same as what is going on at mount time? Also, there were over 400 million changelogs entries at the point that we tried to stop and deregister the changelog user; so definitely not a "very small" amount.
I can see why you might think that, but I'm not sure I buy it. For a large filesystem, I don't see why you can have a reasonable expectation to get a large percentage of cache hits for the initial scan; you'll probably be pulling in information for files that haven't be touched recently which will come from disk. Whereas, changelog entries will accumulate for "new" files, so I'd expect much better cache hit rates there. Thus, at least during the initial scan of a heavily utilized filesystem, I'd expect the changelogs to accumulate much faster than the rate which robinhood can traverse the FS. |
| Comment by Alex Zhuravlev [ 13/May/14 ] |
|
Prakash, we do not scan all the entries, at mount we scan only the headers of the plain llogs which contain actual changelog entries. given a plain llog contains upto 64K, we'd need to touch 400M/64K=~6.1K records in the catalog at mount. |
| Comment by Prakash Surya (Inactive) [ 13/May/14 ] |
Ah, I see! So that means roughly 6.1K random 64K reads? Why does it need to read the llogs in the first place? I can't seem to remember exactly what purpose they serve.\
Why the 8K read? |
| Comment by Alex Zhuravlev [ 13/May/14 ] |
|
actually I wrote that in the first try to reply, but removed. but I'm happy you asked so, we do have a notion of "llog" - basically a relatively small object containing a header (8K) and up to 64K records. the idea is to be able to be able to reclaim space when some fraction of records is consumed. with a single object that would be an issue given not every disk filesystem provides a way to punch (deallocate) space in a middle of object. |
| Comment by Niu Yawei (Inactive) [ 06/Aug/14 ] |
|
backport this patch and two dependent fixes to b2_4: |
| Comment by Christopher Morrone [ 06/Aug/14 ] |
|
Not only do we need a fix for b2_4, more specifically we need a fix for b2_4 that is compatible with Patch Set 7 of change 9488 from |
| Comment by Niu Yawei (Inactive) [ 07/Aug/14 ] |
|
Chris, these three patches and the patch from |
| Comment by Christopher Morrone [ 11/Aug/14 ] |
|
Hmmm, so they are. My bad. I'm not sure what I was trying to pull last Wednesday, but it must have been the wrong thing. Thanks! |
| Comment by Michael MacDonald (Inactive) [ 06/Nov/14 ] |
|
FYI, the landing on master seems to have introduced a regression in changelog functionality: |
| Comment by Andreas Dilger [ 26/Feb/15 ] |
|
Patch was landed to master for 2.7.0. |