Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5038

Mount hangs for hours processing some llog

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • None
    • 3
    • 13930

    Description

      Running lustre 2.4.0-28chaos (see github.com/chaos/lustre) over ZFS, we have repeatedly seen the MDS hang for hours on mount after an ungraceful reboot.

      Recovery looks like it is complete:

      # porter-mds1 /proc/fs/lustre > cat mdt/lse-MDT0000/recovery_status
      status: COMPLETE
      recovery_start: 0
      recovery_duration: 0
      completed_clients: 3846/3846
      replayed_requests: 0
      last_transno: -1
      VBR: DISABLED
      IR: ENABLED
      

      The MGS mount seems to be complete, but the mount command for the MDT is still hung after two hours. The backtrace is:

      crash> bt 14674
      PID: 14674  TASK: ffff880832fd0040  CPU: 1   COMMAND: "mount.lustre"
       #0 [ffff8807d29370f8] schedule at ffffffff81510bd2
       #1 [ffff8807d29371c0] io_schedule at ffffffff815113b3
       #2 [ffff8807d29371e0] cv_wait_common at ffffffffa02915cc [spl]
       #3 [ffff8807d2937250] __cv_wait_io at ffffffffa0291658 [spl]
       #4 [ffff8807d2937260] zio_wait at ffffffffa03d686b [zfs]
       #5 [ffff8807d29372a0] dbuf_read at ffffffffa0343f8d [zfs]
       #6 [ffff8807d2937360] dmu_buf_hold at ffffffffa034b728 [zfs]
       #7 [ffff8807d29373d0] zap_get_leaf_byblk at ffffffffa03a286f [zfs]
       #8 [ffff8807d2937440] zap_deref_leaf at ffffffffa03a2b2a [zfs]
       #9 [ffff8807d2937480] fzap_lookup at ffffffffa03a2ef0 [zfs]
      #10 [ffff8807d2937500] zap_lookup_norm at ffffffffa03a88f1 [zfs]
      #11 [ffff8807d2937570] zap_lookup at ffffffffa03a8a33 [zfs]
      #12 [ffff8807d29375a0] osd_fid_lookup at ffffffffa0d2a710 [osd_zfs]
      #13 [ffff8807d29375f0] osd_object_init at ffffffffa0d24311 [osd_zfs]
      #14 [ffff8807d2937650] lu_object_alloc at ffffffffa0737c9d [obdclass]
      #15 [ffff8807d29376b0] lu_object_find_at at ffffffffa0738805 [obdclass]
      #16 [ffff8807d2937770] dt_locate_at at ffffffffa073ad9a [obdclass]
      #17 [ffff8807d29377b0] llog_osd_open at ffffffffa071867c [obdclass]
      #18 [ffff8807d2937820] llog_open at ffffffffa06e360a [obdclass]
      #19 [ffff8807d2937870] llog_cat_id2handle at ffffffffa06e8772 [obdclass]
      #20 [ffff8807d29378e0] cat_cancel_cb at ffffffffa06e9cf9 [obdclass]
      #21 [ffff8807d2937930] llog_process_thread at ffffffffa06e510b [obdclass]
      #22 [ffff8807d29379e0] llog_process_or_fork at ffffffffa06e6e1d [obdclass]
      #23 [ffff8807d2937a30] llog_cat_init_and_process at ffffffffa06e779a [obdclass]
      #24 [ffff8807d2937a60] mdd_prepare at ffffffffa0de8f77 [mdd]
      #25 [ffff8807d2937af0] mdt_prepare at ffffffffa0e3e10a [mdt]
      #26 [ffff8807d2937b50] server_start_targets at ffffffffa0761c06 [obdclass]
      #27 [ffff8807d2937c90] server_fill_super at ffffffffa07631e4 [obdclass]
      #28 [ffff8807d2937d70] lustre_fill_super at ffffffffa0733838 [obdclass]
      #29 [ffff8807d2937da0] get_sb_nodev at ffffffff811855af
      #30 [ffff8807d2937de0] lustre_get_sb at ffffffffa072b2a5 [obdclass]
      #31 [ffff8807d2937e00] vfs_kern_mount at ffffffff81184bcb
      #32 [ffff8807d2937e50] do_kern_mount at ffffffff81184d72
      #33 [ffff8807d2937ea0] do_mount at ffffffff811a4dd2
      #34 [ffff8807d2937f20] sys_mount at ffffffff811a5460
      #35 [ffff8807d2937f80] system_call_fastpath at ffffffff8100b0b2
      

      This time around it cleared up after I grabbed this info, so I don't have much more to go on.

      Hopefully you folks can explain what crazy llogs are being processed under the mount command, and why that is happening (and prevents any filesystem usage until the processing is finished).

      There is probably some issue with processing going slower on the zfs osd, but I'd like to understand if what Lustre is doing is at all sane.

      Attachments

        Issue Links

          Activity

            [LU-5038] Mount hangs for hours processing some llog

            Patch was landed to master for 2.7.0.

            adilger Andreas Dilger added a comment - Patch was landed to master for 2.7.0.

            FYI, the landing on master seems to have introduced a regression in changelog functionality: LU-5859.

            mjmac Michael MacDonald (Inactive) added a comment - FYI, the landing on master seems to have introduced a regression in changelog functionality: LU-5859 .

            Hmmm, so they are. My bad. I'm not sure what I was trying to pull last Wednesday, but it must have been the wrong thing.

            Thanks!

            morrone Christopher Morrone (Inactive) added a comment - - edited Hmmm, so they are. My bad. I'm not sure what I was trying to pull last Wednesday, but it must have been the wrong thing. Thanks!

            Chris, these three patches and the patch from LU-4584 are independent, they can be applied without any conflict.

            niu Niu Yawei (Inactive) added a comment - Chris, these three patches and the patch from LU-4584 are independent, they can be applied without any conflict.

            Not only do we need a fix for b2_4, more specifically we need a fix for b2_4 that is compatible with Patch Set 7 of change 9488 from LU-4584. That was the officially designated fix for that bug on b2_4, so we will need patches compatible with that.

            morrone Christopher Morrone (Inactive) added a comment - Not only do we need a fix for b2_4, more specifically we need a fix for b2_4 that is compatible with Patch Set 7 of change 9488 from LU-4584 . That was the officially designated fix for that bug on b2_4, so we will need patches compatible with that.
            niu Niu Yawei (Inactive) added a comment - - edited backport this patch and two dependent fixes to b2_4: http://review.whamcloud.com/#/c/11331/ http://review.whamcloud.com/#/c/11332/ http://review.whamcloud.com/#/c/11333/

            actually I wrote that in the first try to reply, but removed. but I'm happy you asked

            so, we do have a notion of "llog" - basically a relatively small object containing a header (8K) and up to 64K records. the idea is to be able to be able to reclaim space when some fraction of records is consumed. with a single object that would be an issue given not every disk filesystem provides a way to punch (deallocate) space in a middle of object.
            now to be able to add >64K records we add a catalog, which is essentially a log containing pointers to those "plain" logs (which in turn stores actual records, like changelog records, unlink records, etc).
            when some record is being cancelled we drop a bit in the header. once all the records in a llog are cancelled, we can destroy this llog and cancel the corresponding record in the catalog. for the few reasons, this process of cancellation/destroy isn't atomic, it's done in couple different transactions. this can lead to "leaked" llogs which contains no useful records but consume space. and to clean them we used to scan the catalog at the beginning, checking all the headers and removing empty plain llogs. this is exactly what was happening in the backtrace you provided.
            as said before, there is no strong need to do this at initialization, but that never was an issue (to my knowledge at least), because: 1) the number of entries to scan is usually very small, just few - because the records are consumed quickly 2) *probably* higher cost for random I/O with ZFS, though I can't say I totally disagree with your argument on a cold cache.

            bzzz Alex Zhuravlev added a comment - actually I wrote that in the first try to reply, but removed. but I'm happy you asked so, we do have a notion of "llog" - basically a relatively small object containing a header (8K) and up to 64K records. the idea is to be able to be able to reclaim space when some fraction of records is consumed. with a single object that would be an issue given not every disk filesystem provides a way to punch (deallocate) space in a middle of object. now to be able to add >64K records we add a catalog, which is essentially a log containing pointers to those "plain" logs (which in turn stores actual records, like changelog records, unlink records, etc). when some record is being cancelled we drop a bit in the header. once all the records in a llog are cancelled, we can destroy this llog and cancel the corresponding record in the catalog. for the few reasons, this process of cancellation/destroy isn't atomic, it's done in couple different transactions. this can lead to "leaked" llogs which contains no useful records but consume space. and to clean them we used to scan the catalog at the beginning, checking all the headers and removing empty plain llogs. this is exactly what was happening in the backtrace you provided. as said before, there is no strong need to do this at initialization, but that never was an issue (to my knowledge at least), because: 1) the number of entries to scan is usually very small, just few - because the records are consumed quickly 2) * probably * higher cost for random I/O with ZFS, though I can't say I totally disagree with your argument on a cold cache.

            given a plain llog contains upto 64K, we'd need to touch 400M/64K=~6.1K records in the catalog at mount.

            Ah, I see! So that means roughly 6.1K random 64K reads? Why does it need to read the llogs in the first place? I can't seem to remember exactly what purpose they serve.\

            "a touch" means FID->dnode lookup, taking dnode, then 8K read from that dnode.

            Why the 8K read?

            prakash Prakash Surya (Inactive) added a comment - given a plain llog contains upto 64K, we'd need to touch 400M/64K=~6.1K records in the catalog at mount. Ah, I see! So that means roughly 6.1K random 64K reads? Why does it need to read the llogs in the first place? I can't seem to remember exactly what purpose they serve.\ "a touch" means FID->dnode lookup, taking dnode, then 8K read from that dnode. Why the 8K read?
            bzzz Alex Zhuravlev added a comment - - edited

            Prakash, we do not scan all the entries, at mount we scan only the headers of the plain llogs which contain actual changelog entries. given a plain llog contains upto 64K, we'd need to touch 400M/64K=~6.1K records in the catalog at mount.
            "a touch" means FID->dnode lookup, taking dnode, then 8K read from that dnode.

            bzzz Alex Zhuravlev added a comment - - edited Prakash, we do not scan all the entries, at mount we scan only the headers of the plain llogs which contain actual changelog entries. given a plain llog contains upto 64K, we'd need to touch 400M/64K=~6.1K records in the catalog at mount. "a touch" means FID->dnode lookup, taking dnode, then 8K read from that dnode.

            Alex, just to give you some more context. Our admin has decided to turn off robinhood on the particular filesystem that we're seeing the most issues on, with the feeling that it is has been the cause for some of our grief. According to our calculations, it would have take just over a month to complete the initial scan of the filesystem; it has over 700 million files it it currently.

            That's not a good feeling to have to ditch a "feature" because it doesn't actually work when we try to use it on a production filesystem. He tried to deregister the changelog user on the MDS, letting it run over night, and it still has not completed. Is the deregister process essentially the same as what is going on at mount time?

            Also, there were over 400 million changelogs entries at the point that we tried to stop and deregister the changelog user; so definitely not a "very small" amount.

            then the whole process seem to be badly affected by slow OI population (that issue with FID->dnode and cache misses in ZAP).

            I can see why you might think that, but I'm not sure I buy it. For a large filesystem, I don't see why you can have a reasonable expectation to get a large percentage of cache hits for the initial scan; you'll probably be pulling in information for files that haven't be touched recently which will come from disk. Whereas, changelog entries will accumulate for "new" files, so I'd expect much better cache hit rates there. Thus, at least during the initial scan of a heavily utilized filesystem, I'd expect the changelogs to accumulate much faster than the rate which robinhood can traverse the FS.

            prakash Prakash Surya (Inactive) added a comment - Alex, just to give you some more context. Our admin has decided to turn off robinhood on the particular filesystem that we're seeing the most issues on, with the feeling that it is has been the cause for some of our grief. According to our calculations, it would have take just over a month to complete the initial scan of the filesystem; it has over 700 million files it it currently. That's not a good feeling to have to ditch a "feature" because it doesn't actually work when we try to use it on a production filesystem. He tried to deregister the changelog user on the MDS, letting it run over night, and it still has not completed. Is the deregister process essentially the same as what is going on at mount time? Also, there were over 400 million changelogs entries at the point that we tried to stop and deregister the changelog user; so definitely not a "very small" amount. then the whole process seem to be badly affected by slow OI population (that issue with FID->dnode and cache misses in ZAP). I can see why you might think that, but I'm not sure I buy it. For a large filesystem, I don't see why you can have a reasonable expectation to get a large percentage of cache hits for the initial scan; you'll probably be pulling in information for files that haven't be touched recently which will come from disk. Whereas, changelog entries will accumulate for "new" files, so I'd expect much better cache hit rates there. Thus, at least during the initial scan of a heavily utilized filesystem, I'd expect the changelogs to accumulate much faster than the rate which robinhood can traverse the FS.

            People

              bzzz Alex Zhuravlev
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: