Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.8.0
    • None
    • Hyperion/SWL -
    • 3
    • 9223372036854775807

    Description

      This bug created to track activity from http://review.whamcloud.com/17712
      LU-7602 zfs: reset ZFS baseline to 0.6.4.2

      ZFS 0.6.5.2 is known to introduce I/O problems
      Typical timeout - slightly different from the stack traces in the Gerrit ticket

      Dec 23 11:47:33 iws2 kernel: LNet: Service thread pid 30734 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Dec 23 11:47:33 iws2 kernel: Pid: 30734, comm: ll_ost00_000
      Dec 23 11:47:33 iws2 kernel:
      Dec 23 11:47:33 iws2 kernel: Call Trace:
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa06cb330>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffff815395d3>] io_schedule+0x73/0xc0
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa05a3eaf>] cv_wait_common+0xaf/0x130 [spl]
      Dec 23 11:47:33 iws2 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa05a3f48>] __cv_wait_io+0x18/0x20 [spl]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa070c29b>] zio_wait+0x10b/0x1e0 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa06638a9>] dbuf_read+0x439/0x850 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa066c168>] dmu_buf_hold+0x68/0x90 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa0661fa8>] ? dbuf_rele_and_unlock+0x268/0x390 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d5e0a>] zap_lockdir+0x5a/0x770 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d797a>] zap_lookup_norm+0x4a/0x190 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d7b53>] zap_lookup+0x33/0x40 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa067bbe6>] dmu_tx_hold_zap+0x146/0x210 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa107b3b5>] osd_declare_object_create+0x2d5/0x440 [osd_zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa11bba24>] ofd_precreate_objects+0x4e4/0x19d0 [ofd]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa04bc6c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa11c8bdb>] ? ofd_grant_create+0x23b/0x3e0 [ofd]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa11ab83e>] ofd_create_hdl+0x56e/0x2640 [ofd]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bbefe0>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa0c294cc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bd0b41>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bcfd00>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      Dec 23 11:47:33 iws2 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
      Dec 23 11:47:33 iws2 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Dec 23 11:47:33 iws2 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      Dec 23 11:47:33 iws2 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      Dec 23 11:47:33 iws2 kernel:
      

      Attachments

        Issue Links

          Activity

            [LU-7602] Repeated timeouts with ZFS 0.6.5.2

            I dumped the stacks on iws2. It's a while since the error, this file includes all the timeout stacks

            cliffw Cliff White (Inactive) added a comment - I dumped the stacks on iws2. It's a while since the error, this file includes all the timeout stacks

            Closing this as a duplicate of LU-7404 since that already has more information in it. The stack trace shown here is from the OSS, which is blocked on the OST object precreate. The stack trace shown in the 17712 ticket is the timeout on the MDS caused by waiting for new OST object precreation to complete, which is only a symptom of the actual deadlock problem on the OSS.

            adilger Andreas Dilger added a comment - Closing this as a duplicate of LU-7404 since that already has more information in it. The stack trace shown here is from the OSS, which is blocked on the OST object precreate. The stack trace shown in the 17712 ticket is the timeout on the MDS caused by waiting for new OST object precreation to complete, which is only a symptom of the actual deadlock problem on the OSS.

            Cliff, do you have the stack traces for all the threads on the OSS? It seems this ll_ost00_000 thread is waiting for the ZFS TXG to commit, but it would be useful to know what the other threads are doing in the meantime.

            adilger Andreas Dilger added a comment - Cliff, do you have the stack traces for all the threads on the OSS? It seems this ll_ost00_000 thread is waiting for the ZFS TXG to commit, but it would be useful to know what the other threads are doing in the meantime.
            yujian Jian Yu added a comment -

            Hi Cliff,

            Patch http://review.whamcloud.com/17712 hit build failure on sles11sp2 server. I created TEI-4369 to disable the build.

            In the meantime, since builds on other distros passed, could you please verify whether the timeout issue is resolved or not after resetting ZFS baseline to 0.6.4.2? Thank you.

            yujian Jian Yu added a comment - Hi Cliff, Patch http://review.whamcloud.com/17712 hit build failure on sles11sp2 server. I created TEI-4369 to disable the build. In the meantime, since builds on other distros passed, could you please verify whether the timeout issue is resolved or not after resetting ZFS baseline to 0.6.4.2? Thank you.

            People

              yujian Jian Yu
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: