Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Running SWL, OSS has repeated timeouts

      Nov  5 15:23:57 iws9 kernel: LNet: Service thread pid 23042 was inactive for 200.00s. The thread might be hung, or it 
      might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Nov  5 15:23:57 iws9 kernel: Pid: 23042, comm: ll_ost00_004
      Nov  5 15:23:57 iws9 kernel:
      Nov  5 15:23:57 iws9 kernel: Call Trace:
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa067c380>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff815395c3>] io_schedule+0x73/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa05b2f8f>] cv_wait_common+0xaf/0x130 [spl]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa05b3028>] __cv_wait_io+0x18/0x20 [spl]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06bd2eb>] zio_wait+0x10b/0x1e0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0614939>] dbuf_read+0x439/0x850 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0614ef1>] __dbuf_hold_impl+0x1a1/0x4f0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06152bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0616790>] dbuf_hold+0x20/0x30 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa061d0d7>] dmu_buf_hold_noread+0x87/0x140 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa061d1cb>] dmu_buf_hold+0x3b/0x90 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0612fb8>] ? dbuf_rele_and_unlock+0x268/0x400 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0686e5a>] zap_lockdir+0x5a/0x770 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff81178fcd>] ? kmem_cache_alloc_node_trace+0x1cd/0x200
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06889ca>] zap_lookup_norm+0x4a/0x190 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0688ba3>] zap_lookup+0x33/0x40 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa062cc76>] dmu_tx_hold_zap+0x146/0x210 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa1034255>] osd_declare_object_create+0x2a5/0x440 [osd_zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa11738e4>] ofd_precreate_objects+0x4e4/0x19d0 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa04b4b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa1180a9b>] ? ofd_grant_create+0x23b/0x3e0 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa116384e>] ofd_create_hdl+0x56e/0x2640 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c28e80>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c930ec>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c3a9e1>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c39ba0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Lustre-log dump attached

      Attachments

        Issue Links

          Activity

            [LU-7404] ZFS OSS - Numerous timeouts - SWL
            pjones Peter Jones added a comment -

            AFAIK no action is required for 2.9 and master is now running against a more current 0.6.5.x ZFS release

            pjones Peter Jones added a comment - AFAIK no action is required for 2.9 and master is now running against a more current 0.6.5.x ZFS release

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/19632
            Subject: LU-7404 zfs: Reset default zfs version to 0.6.5.5
            Project: fs/lustre-release
            Branch: b2_8
            Current Patch Set: 1
            Commit: 947f37272b35ede3791616261e6a6929aebcac86

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/19632 Subject: LU-7404 zfs: Reset default zfs version to 0.6.5.5 Project: fs/lustre-release Branch: b2_8 Current Patch Set: 1 Commit: 947f37272b35ede3791616261e6a6929aebcac86

            Andreas the ZFS IO throttled used to work exactly as you've described it. That implementation had some significant draw backs so it was reworked in 2013. Specifically, it often resulted in a nasty saw tooth behavior when processes hit the calculated limit (which was often wrong). See https://github.com/zfsonlinux/zfs/commit/e8b96c6007bf97cdf34869c1ffbd0ce753873a3d for all gory the details. Bottom line the code was updated so that `dmu_tx_assign()` now starts injecting tiny delays which gradually increase as you approach the dirty limit. This back pressure is supposed to keep the TXG from getting to large and allow the system to settle in to a steady equilibrium.

            Clearly that doesn't seem to be working properly for a Lustre workload even though `dmu_tx_assign()` was injecting large delays. I suspect it may be because of the large number of Lustre IO threads all attempting to dirty ARC buffers, that wasn't exactly the use case this was designed for. The code also makes sure to allow any caller which was delayed in to the next TXG to prevent starving them. How quickly these delays ramp up is controlled by the `zfs_delay_scale` module parameter which we should able to tune to control this. There's a great comment describing this in the zfs-module-parameters.5 man page which is worth reading. See https://github.com/zfsonlinux/zfs/blob/master/man/man5/zfs-module-parameters.5#L1768.

            behlendorf Brian Behlendorf added a comment - Andreas the ZFS IO throttled used to work exactly as you've described it. That implementation had some significant draw backs so it was reworked in 2013. Specifically, it often resulted in a nasty saw tooth behavior when processes hit the calculated limit (which was often wrong). See https://github.com/zfsonlinux/zfs/commit/e8b96c6007bf97cdf34869c1ffbd0ce753873a3d for all gory the details. Bottom line the code was updated so that `dmu_tx_assign()` now starts injecting tiny delays which gradually increase as you approach the dirty limit. This back pressure is supposed to keep the TXG from getting to large and allow the system to settle in to a steady equilibrium. Clearly that doesn't seem to be working properly for a Lustre workload even though `dmu_tx_assign()` was injecting large delays. I suspect it may be because of the large number of Lustre IO threads all attempting to dirty ARC buffers, that wasn't exactly the use case this was designed for. The code also makes sure to allow any caller which was delayed in to the next TXG to prevent starving them. How quickly these delays ramp up is controlled by the `zfs_delay_scale` module parameter which we should able to tune to control this. There's a great comment describing this in the zfs-module-parameters.5 man page which is worth reading. See https://github.com/zfsonlinux/zfs/blob/master/man/man5/zfs-module-parameters.5#L1768 .
            jay Jinshan Xiong (Inactive) added a comment - - edited

            From what I have seen in the code, ZFS starts to throttle I/O when dirty data in pool reaches 60% of zfs_dirty_data_max(6GB by default on Hyperion); at the same time it will wake up quiescing thread to close the current open txg. But there is no mechanism to adjust TXG size from the time how long previous TXG was complete. This pushed the txg sync time to be about 100s on average.

            jay Jinshan Xiong (Inactive) added a comment - - edited From what I have seen in the code, ZFS starts to throttle I/O when dirty data in pool reaches 60% of zfs_dirty_data_max(6GB by default on Hyperion); at the same time it will wake up quiescing thread to close the current open txg. But there is no mechanism to adjust TXG size from the time how long previous TXG was complete. This pushed the txg sync time to be about 100s on average.

            Shouldn't ZFS limit the TXG size based on the speed of the underlying storage? I'd thought that was the main feature of the dynamic TXG sizing - record how quickly the data could be flushed to disk in the previous TXG and then use it to limit it the size of the next TXG based on the desired TXG commit interval.

            adilger Andreas Dilger added a comment - Shouldn't ZFS limit the TXG size based on the speed of the underlying storage? I'd thought that was the main feature of the dynamic TXG sizing - record how quickly the data could be flushed to disk in the previous TXG and then use it to limit it the size of the next TXG based on the desired TXG commit interval.

            My theory to explain this problem is that the ARC lock contention in old ZFS releases may throttle I/O by itself.

            jay Jinshan Xiong (Inactive) added a comment - My theory to explain this problem is that the ARC lock contention in old ZFS releases may throttle I/O by itself.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            From Brian @ LLNL - it looks like the underlying storage on Hyperion doesn't match TXG dirty throttle in ZFS. OSTs can only do ~50MB/s but zfs_dirty_data_max was set to 6G by default. After Brian reduced zfs_dirty_data_max to 1G, he didn't see I/O timeouts any more. This has been verified by Cliff.

            The reason we can't see this problem elsewhere is because it's unusual to have such a slow storage in production.

            jay Jinshan Xiong (Inactive) added a comment - - edited From Brian @ LLNL - it looks like the underlying storage on Hyperion doesn't match TXG dirty throttle in ZFS. OSTs can only do ~50MB/s but zfs_dirty_data_max was set to 6G by default. After Brian reduced zfs_dirty_data_max to 1G, he didn't see I/O timeouts any more. This has been verified by Cliff. The reason we can't see this problem elsewhere is because it's unusual to have such a slow storage in production.

            Ran latest patch from Jinshan - furnished him with debug information from SWL run.

            cliffw Cliff White (Inactive) added a comment - Ran latest patch from Jinshan - furnished him with debug information from SWL run.

            People

              jay Jinshan Xiong (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: