Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7409

llog declares write region that don't match actually write region later for osd_zfs

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      The typical stack trace is as follows:

      Call Trace:
       [<ffffffff8156bf13>] ? panic+0xac/0x179
       [<ffffffffa0f4e5cc>] ? zio_wait+0x21c/0x3e0 [zfs]
       [<ffffffffa0e7ef87>] ? dmu_tx_dirty_buf+0x247/0x3d0 [zfs]
       [<ffffffffa0f4e2f3>] ? zio_destroy+0xb3/0x170 [zfs]
       [<ffffffffa0e5e55f>] ? dbuf_dirty+0x5f/0x16d0 [zfs]
       [<ffffffff8157156b>] ? _spin_unlock+0x2b/0x40
       [<ffffffffa0e848ea>] ? dnode_rele+0x5a/0xa0 [zfs]
       [<ffffffffa0e61501>] ? dmu_buf_will_dirty+0x91/0x100 [zfs]
       [<ffffffffa0e6cc70>] ? dmu_write+0xa0/0x230 [zfs]
       [<ffffffffa08444c1>] ? osd_write+0x1d1/0x3a0 [osd_zfs]
       [<ffffffffa06b9bdd>] ? dt_record_write+0x3d/0x130 [obdclass]
       [<ffffffffa067955a>] ? llog_osd_write_rec+0xd6a/0x1b70 [obdclass]
       [<ffffffffa06673f6>] ? llog_write_rec+0xb6/0x270 [obdclass]
       [<ffffffffa066c1b8>] ? llog_write+0x298/0x430 [obdclass]
       [<ffffffffa066c1cf>] ? llog_write+0x2af/0x430 [obdclass]
       [<ffffffffa14780a1>] ? record_marker+0x1c1/0x1e0 [mgs]
       [<ffffffffa14779ea>] ? record_start_log+0x38a/0x4a0 [mgs]
       [<ffffffffa14787cf>] ? mgs_write_log_lov+0x38f/0x6b0 [mgs]
       [<ffffffffa148a5c6>] ? mgs_write_log_mdt+0x326/0x1630 [mgs]
       [<ffffffff810c156d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffffa148d475>] ? mgs_write_log_target+0xb55/0x1980 [mgs]
       [<ffffffff810c156d>] ? trace_hardirqs_on+0xd/0x10
       [<ffffffffa057cc11>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
       [<ffffffffa1471d89>] ? mgs_target_reg+0xa19/0xe50 [mgs]
       [<ffffffffa0940b3f>] ? tgt_request_handle+0x8cf/0x1300 [ptlrpc]
       [<ffffffffa08eb85a>] ? ptlrpc_main+0xdaa/0x18b0 [ptlrpc]
       [<ffffffffa08eaab0>] ? ptlrpc_main+0x0/0x18b0 [ptlrpc]
       [<ffffffff810a728e>] ? kthread+0x9e/0xc0
       [<ffffffff8100c38a>] ? child_rip+0xa/0x20
       [<ffffffff815714b0>] ? _spin_unlock_irq+0x30/0x40
       [<ffffffff8100bb90>] ? restore_args+0x0/0x30
       [<ffffffff810a71f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c380>] ? child_rip+0x0/0x20
      

      Patch will be submitted shortly

      Attachments

        Issue Links

          Activity

            [LU-7409] llog declares write region that don't match actually write region later for osd_zfs

            we don't plan to configure ZFS with debugging enabled, so this shouldn't be an issue

            bzzz Alex Zhuravlev added a comment - we don't plan to configure ZFS with debugging enabled, so this shouldn't be an issue

            I think it's better and simpler to modify ZFS instead. especially given Brian B. wasn't against.

            bzzz Alex Zhuravlev added a comment - I think it's better and simpler to modify ZFS instead. especially given Brian B. wasn't against.

            thanks for explanation, I see it now. It looks like ZFS temporarily reserves arc buffer for the data to be written. However, in our case for append, the reserved space are counted multiple times in the same txg, is it possible for us to add some code in dmu_tx_try_assign() and count it more accurately?

            jay Jinshan Xiong (Inactive) added a comment - thanks for explanation, I see it now. It looks like ZFS temporarily reserves arc buffer for the data to be written. However, in our case for append, the reserved space are counted multiple times in the same txg, is it possible for us to add some code in dmu_tx_try_assign() and count it more accurately?
            bzzz Alex Zhuravlev added a comment - - edited

            nope, we can't really predict this as the transactions can take different time and while one thread has been executing a single transaction, another is able to produce many more (each adding a new record). and we may have lots of threads like these. if we try to declare large ranges, then this can result in huge credits. the credits can be multiplied by N depending on the pool's configuration. ZFS reserves memory for all the credits promised. so that a transaction going to modify few llogs may want to reserve 1,5GB memory. and again we've got to run few hundred threads usually. IOW, this approach doesn't work at scale and we need a proper support in ZFS's debugging code to understand append case. as a short term solution, we don't use debugging.

            bzzz Alex Zhuravlev added a comment - - edited nope, we can't really predict this as the transactions can take different time and while one thread has been executing a single transaction, another is able to produce many more (each adding a new record). and we may have lots of threads like these. if we try to declare large ranges, then this can result in huge credits. the credits can be multiplied by N depending on the pool's configuration. ZFS reserves memory for all the credits promised. so that a transaction going to modify few llogs may want to reserve 1,5GB memory. and again we've got to run few hundred threads usually. IOW, this approach doesn't work at scale and we need a proper support in ZFS's debugging code to understand append case. as a short term solution, we don't use debugging.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            This patch can work reasonably well on normal llog because I reserved large enough buffer for cushion, but it has problems with cat log for unlink case. Right now cat log is being used for unlink, changelog, and HSM, but I think the change to these logs are predictable at declare phase?

            jay Jinshan Xiong (Inactive) added a comment - - edited This patch can work reasonably well on normal llog because I reserved large enough buffer for cushion, but it has problems with cat log for unlink case. Right now cat log is being used for unlink, changelog, and HSM, but I think the change to these logs are predictable at declare phase?

            declaration is just accounting and actual serializations happens against specific dbufs at actual write.

            bzzz Alex Zhuravlev added a comment - declaration is just accounting and actual serializations happens against specific dbufs at actual write.

            Does declaring/holding a large range of the file cause ZFS to serialize IO to that region, or is this just accounting and serialization happens elsewhere? I'm just recalling the case of file creates where the TXG is serialized because (IIRC) you cannot modify a dnode in the same TXG as it is created in.

            adilger Andreas Dilger added a comment - Does declaring/holding a large range of the file cause ZFS to serialize IO to that region, or is this just accounting and serialization happens elsewhere? I'm just recalling the case of file creates where the TXG is serialized because (IIRC) you cannot modify a dnode in the same TXG as it is created in.
            bzzz Alex Zhuravlev added a comment - - edited

            I didn't see the patch, but in theory - yes. it should be consisting of two pieces:
            1) recognize a special (like -1) offset and reserve slightly more credits like we're going to write at huge offset resulting in a deep tree
            2) changes to the debugging code so that such a "undefined" write can be satisfied with that special declaration at (1)

            my original point was that we can't land this patch.

            bzzz Alex Zhuravlev added a comment - - edited I didn't see the patch, but in theory - yes. it should be consisting of two pieces: 1) recognize a special (like -1) offset and reserve slightly more credits like we're going to write at huge offset resulting in a deep tree 2) changes to the debugging code so that such a "undefined" write can be satisfied with that special declaration at (1) my original point was that we can't land this patch.

            Alex, I recall that the actual patch to the ZFS declare code is not very complex? How hard would it be to recreate that patch?

            adilger Andreas Dilger added a comment - Alex, I recall that the actual patch to the ZFS declare code is not very complex? How hard would it be to recreate that patch?
            jay Jinshan Xiong (Inactive) added a comment - - edited

            it can allow me to mount my MDS after debug enabled.

            jay Jinshan Xiong (Inactive) added a comment - - edited it can allow me to mount my MDS after debug enabled.

            People

              bzzz Alex Zhuravlev
              jay Jinshan Xiong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: