Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.8.0
    • None
    • Hyperion/SWL -
    • 3
    • 9223372036854775807

    Description

      This bug created to track activity from http://review.whamcloud.com/17712
      LU-7602 zfs: reset ZFS baseline to 0.6.4.2

      ZFS 0.6.5.2 is known to introduce I/O problems
      Typical timeout - slightly different from the stack traces in the Gerrit ticket

      Dec 23 11:47:33 iws2 kernel: LNet: Service thread pid 30734 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Dec 23 11:47:33 iws2 kernel: Pid: 30734, comm: ll_ost00_000
      Dec 23 11:47:33 iws2 kernel:
      Dec 23 11:47:33 iws2 kernel: Call Trace:
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa06cb330>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffff815395d3>] io_schedule+0x73/0xc0
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa05a3eaf>] cv_wait_common+0xaf/0x130 [spl]
      Dec 23 11:47:33 iws2 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa05a3f48>] __cv_wait_io+0x18/0x20 [spl]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa070c29b>] zio_wait+0x10b/0x1e0 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa06638a9>] dbuf_read+0x439/0x850 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa066c168>] dmu_buf_hold+0x68/0x90 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa0661fa8>] ? dbuf_rele_and_unlock+0x268/0x390 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d5e0a>] zap_lockdir+0x5a/0x770 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d797a>] zap_lookup_norm+0x4a/0x190 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d7b53>] zap_lookup+0x33/0x40 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa067bbe6>] dmu_tx_hold_zap+0x146/0x210 [zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa107b3b5>] osd_declare_object_create+0x2d5/0x440 [osd_zfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa11bba24>] ofd_precreate_objects+0x4e4/0x19d0 [ofd]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa04bc6c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa11c8bdb>] ? ofd_grant_create+0x23b/0x3e0 [ofd]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa11ab83e>] ofd_create_hdl+0x56e/0x2640 [ofd]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bbefe0>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa0c294cc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bd0b41>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
      Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bcfd00>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      Dec 23 11:47:33 iws2 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
      Dec 23 11:47:33 iws2 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Dec 23 11:47:33 iws2 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      Dec 23 11:47:33 iws2 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      Dec 23 11:47:33 iws2 kernel:
      

      Attachments

        Issue Links

          Activity

            [LU-7602] Repeated timeouts with ZFS 0.6.5.2
            cliffw Cliff White (Inactive) made changes -
            Attachment New: iws2.stackes.txt.gz [ 19998 ]

            I dumped the stacks on iws2. It's a while since the error, this file includes all the timeout stacks

            cliffw Cliff White (Inactive) added a comment - I dumped the stacks on iws2. It's a while since the error, this file includes all the timeout stacks
            adilger Andreas Dilger made changes -
            Fix Version/s Original: Lustre 2.8.0 [ 11113 ]
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Closing this as a duplicate of LU-7404 since that already has more information in it. The stack trace shown here is from the OSS, which is blocked on the OST object precreate. The stack trace shown in the 17712 ticket is the timeout on the MDS caused by waiting for new OST object precreation to complete, which is only a symptom of the actual deadlock problem on the OSS.

            adilger Andreas Dilger added a comment - Closing this as a duplicate of LU-7404 since that already has more information in it. The stack trace shown here is from the OSS, which is blocked on the OST object precreate. The stack trace shown in the 17712 ticket is the timeout on the MDS caused by waiting for new OST object precreation to complete, which is only a symptom of the actual deadlock problem on the OSS.
            adilger Andreas Dilger made changes -
            Link New: This issue duplicates LU-7404 [ LU-7404 ]

            Cliff, do you have the stack traces for all the threads on the OSS? It seems this ll_ost00_000 thread is waiting for the ZFS TXG to commit, but it would be useful to know what the other threads are doing in the meantime.

            adilger Andreas Dilger added a comment - Cliff, do you have the stack traces for all the threads on the OSS? It seems this ll_ost00_000 thread is waiting for the ZFS TXG to commit, but it would be useful to know what the other threads are doing in the meantime.
            adilger Andreas Dilger made changes -
            Summary Original: Repeated timeouts with current ZFS New: Repeated timeouts with ZFS 0.6.5.2
            adilger Andreas Dilger made changes -
            Description Original: This bug created to track activity from http://review.whamcloud.com/#/c/17712/
            LU-0000 zfs: reset ZFS baseline to 0.6.4.2

            ZFS 0.6.5.2 is known to introduce I/O problems
            Typical timeout - slightly different from the stack traces in the Gerrit ticket
            {code}
            Dec 23 11:47:33 iws2 kernel: LNet: Service thread pid 30734 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            Dec 23 11:47:33 iws2 kernel: Pid: 30734, comm: ll_ost00_000
            Dec 23 11:47:33 iws2 kernel:
            Dec 23 11:47:33 iws2 kernel: Call Trace:
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa06cb330>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffff815395d3>] io_schedule+0x73/0xc0
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa05a3eaf>] cv_wait_common+0xaf/0x130 [spl]
            Dec 23 11:47:33 iws2 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa05a3f48>] __cv_wait_io+0x18/0x20 [spl]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa070c29b>] zio_wait+0x10b/0x1e0 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa06638a9>] dbuf_read+0x439/0x850 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa066c168>] dmu_buf_hold+0x68/0x90 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa0661fa8>] ? dbuf_rele_and_unlock+0x268/0x390 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d5e0a>] zap_lockdir+0x5a/0x770 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d797a>] zap_lookup_norm+0x4a/0x190 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d7b53>] zap_lookup+0x33/0x40 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa067bbe6>] dmu_tx_hold_zap+0x146/0x210 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa107b3b5>] osd_declare_object_create+0x2d5/0x440 [osd_zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa11bba24>] ofd_precreate_objects+0x4e4/0x19d0 [ofd]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa04bc6c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa11c8bdb>] ? ofd_grant_create+0x23b/0x3e0 [ofd]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa11ab83e>] ofd_create_hdl+0x56e/0x2640 [ofd]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bbefe0>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa0c294cc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bd0b41>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bcfd00>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
            Dec 23 11:47:33 iws2 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
            Dec 23 11:47:33 iws2 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
            Dec 23 11:47:33 iws2 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
            Dec 23 11:47:33 iws2 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
            Dec 23 11:47:33 iws2 kernel:
            {code}
            New: This bug created to track activity from http://review.whamcloud.com/17712
            LU-7602 zfs: reset ZFS baseline to 0.6.4.2

            ZFS 0.6.5.2 is known to introduce I/O problems
            Typical timeout - slightly different from the stack traces in the Gerrit ticket
            {code}
            Dec 23 11:47:33 iws2 kernel: LNet: Service thread pid 30734 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            Dec 23 11:47:33 iws2 kernel: Pid: 30734, comm: ll_ost00_000
            Dec 23 11:47:33 iws2 kernel:
            Dec 23 11:47:33 iws2 kernel: Call Trace:
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa06cb330>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffff815395d3>] io_schedule+0x73/0xc0
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa05a3eaf>] cv_wait_common+0xaf/0x130 [spl]
            Dec 23 11:47:33 iws2 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa05a3f48>] __cv_wait_io+0x18/0x20 [spl]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa070c29b>] zio_wait+0x10b/0x1e0 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa06638a9>] dbuf_read+0x439/0x850 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa066c168>] dmu_buf_hold+0x68/0x90 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa0661fa8>] ? dbuf_rele_and_unlock+0x268/0x390 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d5e0a>] zap_lockdir+0x5a/0x770 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d797a>] zap_lookup_norm+0x4a/0x190 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa06d7b53>] zap_lookup+0x33/0x40 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa067bbe6>] dmu_tx_hold_zap+0x146/0x210 [zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa107b3b5>] osd_declare_object_create+0x2d5/0x440 [osd_zfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa11bba24>] ofd_precreate_objects+0x4e4/0x19d0 [ofd]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa04bc6c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa11c8bdb>] ? ofd_grant_create+0x23b/0x3e0 [ofd]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa11ab83e>] ofd_create_hdl+0x56e/0x2640 [ofd]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bbefe0>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa0c294cc>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bd0b41>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
            Dec 23 11:47:33 iws2 kernel: [<ffffffffa0bcfd00>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
            Dec 23 11:47:33 iws2 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
            Dec 23 11:47:33 iws2 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
            Dec 23 11:47:33 iws2 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
            Dec 23 11:47:33 iws2 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
            Dec 23 11:47:33 iws2 kernel:
            {code}
            pjones Peter Jones made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Jian Yu [ yujian ]
            yujian Jian Yu added a comment -

            Hi Cliff,

            Patch http://review.whamcloud.com/17712 hit build failure on sles11sp2 server. I created TEI-4369 to disable the build.

            In the meantime, since builds on other distros passed, could you please verify whether the timeout issue is resolved or not after resetting ZFS baseline to 0.6.4.2? Thank you.

            yujian Jian Yu added a comment - Hi Cliff, Patch http://review.whamcloud.com/17712 hit build failure on sles11sp2 server. I created TEI-4369 to disable the build. In the meantime, since builds on other distros passed, could you please verify whether the timeout issue is resolved or not after resetting ZFS baseline to 0.6.4.2? Thank you.

            People

              yujian Jian Yu
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: