Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Running SWL, OSS has repeated timeouts

      Nov  5 15:23:57 iws9 kernel: LNet: Service thread pid 23042 was inactive for 200.00s. The thread might be hung, or it 
      might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Nov  5 15:23:57 iws9 kernel: Pid: 23042, comm: ll_ost00_004
      Nov  5 15:23:57 iws9 kernel:
      Nov  5 15:23:57 iws9 kernel: Call Trace:
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa067c380>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff815395c3>] io_schedule+0x73/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa05b2f8f>] cv_wait_common+0xaf/0x130 [spl]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa05b3028>] __cv_wait_io+0x18/0x20 [spl]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06bd2eb>] zio_wait+0x10b/0x1e0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0614939>] dbuf_read+0x439/0x850 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0614ef1>] __dbuf_hold_impl+0x1a1/0x4f0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06152bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0616790>] dbuf_hold+0x20/0x30 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa061d0d7>] dmu_buf_hold_noread+0x87/0x140 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa061d1cb>] dmu_buf_hold+0x3b/0x90 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0612fb8>] ? dbuf_rele_and_unlock+0x268/0x400 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0686e5a>] zap_lockdir+0x5a/0x770 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff81178fcd>] ? kmem_cache_alloc_node_trace+0x1cd/0x200
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06889ca>] zap_lookup_norm+0x4a/0x190 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0688ba3>] zap_lookup+0x33/0x40 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa062cc76>] dmu_tx_hold_zap+0x146/0x210 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa1034255>] osd_declare_object_create+0x2a5/0x440 [osd_zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa11738e4>] ofd_precreate_objects+0x4e4/0x19d0 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa04b4b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa1180a9b>] ? ofd_grant_create+0x23b/0x3e0 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa116384e>] ofd_create_hdl+0x56e/0x2640 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c28e80>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c930ec>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c3a9e1>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c39ba0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Lustre-log dump attached

      Attachments

        Issue Links

          Activity

            [LU-7404] ZFS OSS - Numerous timeouts - SWL
            pjones Peter Jones added a comment -

            AFAIK no action is required for 2.9 and master is now running against a more current 0.6.5.x ZFS release

            pjones Peter Jones added a comment - AFAIK no action is required for 2.9 and master is now running against a more current 0.6.5.x ZFS release

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/19632
            Subject: LU-7404 zfs: Reset default zfs version to 0.6.5.5
            Project: fs/lustre-release
            Branch: b2_8
            Current Patch Set: 1
            Commit: 947f37272b35ede3791616261e6a6929aebcac86

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/19632 Subject: LU-7404 zfs: Reset default zfs version to 0.6.5.5 Project: fs/lustre-release Branch: b2_8 Current Patch Set: 1 Commit: 947f37272b35ede3791616261e6a6929aebcac86

            Andreas the ZFS IO throttled used to work exactly as you've described it. That implementation had some significant draw backs so it was reworked in 2013. Specifically, it often resulted in a nasty saw tooth behavior when processes hit the calculated limit (which was often wrong). See https://github.com/zfsonlinux/zfs/commit/e8b96c6007bf97cdf34869c1ffbd0ce753873a3d for all gory the details. Bottom line the code was updated so that `dmu_tx_assign()` now starts injecting tiny delays which gradually increase as you approach the dirty limit. This back pressure is supposed to keep the TXG from getting to large and allow the system to settle in to a steady equilibrium.

            Clearly that doesn't seem to be working properly for a Lustre workload even though `dmu_tx_assign()` was injecting large delays. I suspect it may be because of the large number of Lustre IO threads all attempting to dirty ARC buffers, that wasn't exactly the use case this was designed for. The code also makes sure to allow any caller which was delayed in to the next TXG to prevent starving them. How quickly these delays ramp up is controlled by the `zfs_delay_scale` module parameter which we should able to tune to control this. There's a great comment describing this in the zfs-module-parameters.5 man page which is worth reading. See https://github.com/zfsonlinux/zfs/blob/master/man/man5/zfs-module-parameters.5#L1768.

            behlendorf Brian Behlendorf added a comment - Andreas the ZFS IO throttled used to work exactly as you've described it. That implementation had some significant draw backs so it was reworked in 2013. Specifically, it often resulted in a nasty saw tooth behavior when processes hit the calculated limit (which was often wrong). See https://github.com/zfsonlinux/zfs/commit/e8b96c6007bf97cdf34869c1ffbd0ce753873a3d for all gory the details. Bottom line the code was updated so that `dmu_tx_assign()` now starts injecting tiny delays which gradually increase as you approach the dirty limit. This back pressure is supposed to keep the TXG from getting to large and allow the system to settle in to a steady equilibrium. Clearly that doesn't seem to be working properly for a Lustre workload even though `dmu_tx_assign()` was injecting large delays. I suspect it may be because of the large number of Lustre IO threads all attempting to dirty ARC buffers, that wasn't exactly the use case this was designed for. The code also makes sure to allow any caller which was delayed in to the next TXG to prevent starving them. How quickly these delays ramp up is controlled by the `zfs_delay_scale` module parameter which we should able to tune to control this. There's a great comment describing this in the zfs-module-parameters.5 man page which is worth reading. See https://github.com/zfsonlinux/zfs/blob/master/man/man5/zfs-module-parameters.5#L1768 .
            jay Jinshan Xiong (Inactive) added a comment - - edited

            From what I have seen in the code, ZFS starts to throttle I/O when dirty data in pool reaches 60% of zfs_dirty_data_max(6GB by default on Hyperion); at the same time it will wake up quiescing thread to close the current open txg. But there is no mechanism to adjust TXG size from the time how long previous TXG was complete. This pushed the txg sync time to be about 100s on average.

            jay Jinshan Xiong (Inactive) added a comment - - edited From what I have seen in the code, ZFS starts to throttle I/O when dirty data in pool reaches 60% of zfs_dirty_data_max(6GB by default on Hyperion); at the same time it will wake up quiescing thread to close the current open txg. But there is no mechanism to adjust TXG size from the time how long previous TXG was complete. This pushed the txg sync time to be about 100s on average.

            Shouldn't ZFS limit the TXG size based on the speed of the underlying storage? I'd thought that was the main feature of the dynamic TXG sizing - record how quickly the data could be flushed to disk in the previous TXG and then use it to limit it the size of the next TXG based on the desired TXG commit interval.

            adilger Andreas Dilger added a comment - Shouldn't ZFS limit the TXG size based on the speed of the underlying storage? I'd thought that was the main feature of the dynamic TXG sizing - record how quickly the data could be flushed to disk in the previous TXG and then use it to limit it the size of the next TXG based on the desired TXG commit interval.

            My theory to explain this problem is that the ARC lock contention in old ZFS releases may throttle I/O by itself.

            jay Jinshan Xiong (Inactive) added a comment - My theory to explain this problem is that the ARC lock contention in old ZFS releases may throttle I/O by itself.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            From Brian @ LLNL - it looks like the underlying storage on Hyperion doesn't match TXG dirty throttle in ZFS. OSTs can only do ~50MB/s but zfs_dirty_data_max was set to 6G by default. After Brian reduced zfs_dirty_data_max to 1G, he didn't see I/O timeouts any more. This has been verified by Cliff.

            The reason we can't see this problem elsewhere is because it's unusual to have such a slow storage in production.

            jay Jinshan Xiong (Inactive) added a comment - - edited From Brian @ LLNL - it looks like the underlying storage on Hyperion doesn't match TXG dirty throttle in ZFS. OSTs can only do ~50MB/s but zfs_dirty_data_max was set to 6G by default. After Brian reduced zfs_dirty_data_max to 1G, he didn't see I/O timeouts any more. This has been verified by Cliff. The reason we can't see this problem elsewhere is because it's unusual to have such a slow storage in production.

            Ran latest patch from Jinshan - furnished him with debug information from SWL run.

            cliffw Cliff White (Inactive) added a comment - Ran latest patch from Jinshan - furnished him with debug information from SWL run.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            it turned out the root cause of this issue may be due to cache hit ratio dropped significantly introduced by the commit of multilisted ARC implementation.

            ARC Total accesses:                                     863.95k
                    Cache Hit Ratio:                99.68%  861.22k
                    Cache Miss Ratio:               0.32%   2.73k
                    Actual Hit Ratio:               99.68%  861.22k
            
                    Data Demand Efficiency:         100.00% 369.92k
            
                    CACHE HITS BY CACHE LIST:
                      Most Recently Used:           30.96%  266.66k
                      Most Frequently Used:         69.04%  594.56k
                      Most Recently Used Ghost:     0.00%   0
                      Most Frequently Used Ghost:   0.00%   0
            
                    CACHE HITS BY DATA TYPE:
                      Demand Data:                  42.95%  369.92k
                      Prefetch Data:                0.00%   0
                      Demand Metadata:              57.05%  491.30k
                      Prefetch Metadata:            0.00%   0
            
                    CACHE MISSES BY DATA TYPE:
                      Demand Data:                  0.00%   0
                      Prefetch Data:                0.00%   0
                      Demand Metadata:              100.00% 2.73k
                      Prefetch Metadata:            0.00%   0
            

            This is the ARC stats before the patch is introduced, the cache hit ratio is close to 100%.

            While this is the ARC stats after the commit is landed:

            ARC Total accesses:                                     967.83k
                    Cache Hit Ratio:                18.54%  179.46k
                    Cache Miss Ratio:               81.46%  788.37k
                    Actual Hit Ratio:               18.54%  179.46k
            
                    Data Demand Efficiency:         10.47%  868.25k
            
                    CACHE HITS BY CACHE LIST:
                      Most Recently Used:           35.70%  64.07k
                      Most Frequently Used:         64.30%  115.39k
                      Most Recently Used Ghost:     3.77%   6.77k
                      Most Frequently Used Ghost:   6.30%   11.30k
            
                    CACHE HITS BY DATA TYPE:
                      Demand Data:                  50.64%  90.89k
                      Prefetch Data:                0.00%   0
                      Demand Metadata:              49.36%  88.58k
                      Prefetch Metadata:            0.00%   0
            
                    CACHE MISSES BY DATA TYPE:
                      Demand Data:                  98.60%  777.36k
                      Prefetch Data:                0.00%   0
                      Demand Metadata:              1.40%   11.01k
                      Prefetch Metadata:            0.00%   0
            jay Jinshan Xiong (Inactive) added a comment - - edited it turned out the root cause of this issue may be due to cache hit ratio dropped significantly introduced by the commit of multilisted ARC implementation. ARC Total accesses: 863.95k Cache Hit Ratio: 99.68% 861.22k Cache Miss Ratio: 0.32% 2.73k Actual Hit Ratio: 99.68% 861.22k Data Demand Efficiency: 100.00% 369.92k CACHE HITS BY CACHE LIST: Most Recently Used: 30.96% 266.66k Most Frequently Used: 69.04% 594.56k Most Recently Used Ghost: 0.00% 0 Most Frequently Used Ghost: 0.00% 0 CACHE HITS BY DATA TYPE: Demand Data: 42.95% 369.92k Prefetch Data: 0.00% 0 Demand Metadata: 57.05% 491.30k Prefetch Metadata: 0.00% 0 CACHE MISSES BY DATA TYPE: Demand Data: 0.00% 0 Prefetch Data: 0.00% 0 Demand Metadata: 100.00% 2.73k Prefetch Metadata: 0.00% 0 This is the ARC stats before the patch is introduced, the cache hit ratio is close to 100%. While this is the ARC stats after the commit is landed: ARC Total accesses: 967.83k Cache Hit Ratio: 18.54% 179.46k Cache Miss Ratio: 81.46% 788.37k Actual Hit Ratio: 18.54% 179.46k Data Demand Efficiency: 10.47% 868.25k CACHE HITS BY CACHE LIST: Most Recently Used: 35.70% 64.07k Most Frequently Used: 64.30% 115.39k Most Recently Used Ghost: 3.77% 6.77k Most Frequently Used Ghost: 6.30% 11.30k CACHE HITS BY DATA TYPE: Demand Data: 50.64% 90.89k Prefetch Data: 0.00% 0 Demand Metadata: 49.36% 88.58k Prefetch Metadata: 0.00% 0 CACHE MISSES BY DATA TYPE: Demand Data: 98.60% 777.36k Prefetch Data: 0.00% 0 Demand Metadata: 1.40% 11.01k Prefetch Metadata: 0.00% 0

            Hi Nathaniel,

            We've tried 0.6.5.4 before and it didn't help.

            Only ZFS Master includes the patches the upstream ZFS developer mentioned and we tried that on Hyperion yesterday, unfortunately it didn't help either.

            jay Jinshan Xiong (Inactive) added a comment - Hi Nathaniel, We've tried 0.6.5.4 before and it didn't help. Only ZFS Master includes the patches the upstream ZFS developer mentioned and we tried that on Hyperion yesterday, unfortunately it didn't help either.

            People

              jay Jinshan Xiong (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: