Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7754

DNE3: osd-zfs gets into a livelock if transaction is too big

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      ONLY=300k bash sanity.sh:

      [ 89.828294] LNet: Service thread pid 4249 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 89.831356] Pid: 4249, comm: mdt01_001
      [ 89.831895]
      [ 89.831895] Call Trace:
      [ 89.832451] [<ffffffff810ac6de>] ? getrawmonotonic+0x2e/0xc0
      [ 89.833222] [<ffffffff810828c5>] __cond_resched+0x25/0x40
      [ 89.834035] [<ffffffff814e521a>] _cond_resched+0x2a/0x40
      [ 89.834810] [<ffffffff814e5f11>] mutex_lock+0x11/0x40
      [ 89.835565] [<ffffffffa07741b4>] dmu_tx_assign+0x284/0x500 [zfs]
      [ 89.836410] [<ffffffffa0a1ae32>] osd_trans_start+0xb2/0x410 [osd_zfs]
      [ 89.837282] [<ffffffffa032de15>] top_trans_start+0x255/0x9c0 [ptlrpc]
      [ 89.838090] [<ffffffffa0bb88f9>] lod_trans_start+0x59/0x60 [lod]
      [ 89.838854] [<ffffffffa0ae3cdf>] mdd_trans_start+0xf/0x20 [mdd]
      [ 89.839594] [<ffffffffa0acf1a0>] mdd_create+0x1170/0x1c70 [mdd]

      Attachments

        Issue Links

          Activity

            [LU-7754] DNE3: osd-zfs gets into a livelock if transaction is too big

            That is 4755 MB / 512 stripes = 9 MB/stripe which seems like a lot of space to reserve? I thought we got away from O(n^2) transaction sizes for striped directories?

            adilger Andreas Dilger added a comment - That is 4755 MB / 512 stripes = 9 MB/stripe which seems like a lot of space to reserve? I thought we got away from O(n^2) transaction sizes for striped directories?

            transaction calculations:
            mem 4986830848, asize 119683940352, fsize 8506441728, usize 8497152000

            it seem to fail because of insufficient memory: 4986830848 (4755MB) is needed while the test system had 4GB in total.

            bzzz Alex Zhuravlev added a comment - transaction calculations: mem 4986830848, asize 119683940352, fsize 8506441728, usize 8497152000 it seem to fail because of insufficient memory: 4986830848 (4755MB) is needed while the test system had 4GB in total.

            How large is the transaction? Do we have a larger MDS size in our testing?

            I guess this is because we don't run DNE + ZFS by default.

            adilger Andreas Dilger added a comment - How large is the transaction? Do we have a larger MDS size in our testing? I guess this is because we don't run DNE + ZFS by default.

            sanity/300k tries to create a big striped directory:

            $LFS setdirstripe -i 0 -c512 $DIR/$tdir/striped_dir

            with default MDSSIZE=200000 DMU fails to start such a big transaction.

            bzzz Alex Zhuravlev added a comment - sanity/300k tries to create a big striped directory: $LFS setdirstripe -i 0 -c512 $DIR/$tdir/striped_dir with default MDSSIZE=200000 DMU fails to start such a big transaction.

            Your patch turn this from a hang into a failure. That is an improvement, but it doesn't explain why this test failed? Do you have an unusual config (small MDT?) or is there some regression that makes the transaction too large?

            adilger Andreas Dilger added a comment - Your patch turn this from a hang into a failure. That is an improvement, but it doesn't explain why this test failed? Do you have an unusual config (small MDT?) or is there some regression that makes the transaction too large?

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18341
            Subject: LU-7754 osd: osd-zfs should not wait indefinitely for a TXG
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ebef50ed6f5032697f05c4bcc20c7fb329423a17

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18341 Subject: LU-7754 osd: osd-zfs should not wait indefinitely for a TXG Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ebef50ed6f5032697f05c4bcc20c7fb329423a17

            People

              bzzz Alex Zhuravlev
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: