Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Running SWL, OSS has repeated timeouts

      Nov  5 15:23:57 iws9 kernel: LNet: Service thread pid 23042 was inactive for 200.00s. The thread might be hung, or it 
      might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Nov  5 15:23:57 iws9 kernel: Pid: 23042, comm: ll_ost00_004
      Nov  5 15:23:57 iws9 kernel:
      Nov  5 15:23:57 iws9 kernel: Call Trace:
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa067c380>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff815395c3>] io_schedule+0x73/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa05b2f8f>] cv_wait_common+0xaf/0x130 [spl]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa05b3028>] __cv_wait_io+0x18/0x20 [spl]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06bd2eb>] zio_wait+0x10b/0x1e0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0614939>] dbuf_read+0x439/0x850 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0614ef1>] __dbuf_hold_impl+0x1a1/0x4f0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06152bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0616790>] dbuf_hold+0x20/0x30 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa061d0d7>] dmu_buf_hold_noread+0x87/0x140 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa061d1cb>] dmu_buf_hold+0x3b/0x90 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0612fb8>] ? dbuf_rele_and_unlock+0x268/0x400 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0686e5a>] zap_lockdir+0x5a/0x770 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff81178fcd>] ? kmem_cache_alloc_node_trace+0x1cd/0x200
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa06889ca>] zap_lookup_norm+0x4a/0x190 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0688ba3>] zap_lookup+0x33/0x40 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa062cc76>] dmu_tx_hold_zap+0x146/0x210 [zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa1034255>] osd_declare_object_create+0x2a5/0x440 [osd_zfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa11738e4>] ofd_precreate_objects+0x4e4/0x19d0 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa04b4b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa1180a9b>] ? ofd_grant_create+0x23b/0x3e0 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa116384e>] ofd_create_hdl+0x56e/0x2640 [ofd]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c28e80>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c930ec>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c3a9e1>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c39ba0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Nov  5 15:23:57 iws9 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
      Nov  5 15:23:57 iws9 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Lustre-log dump attached

      Attachments

        Issue Links

          Activity

            [LU-7404] ZFS OSS - Numerous timeouts - SWL
            jay Jinshan Xiong (Inactive) added a comment - - edited

            From what I have seen in the code, ZFS starts to throttle I/O when dirty data in pool reaches 60% of zfs_dirty_data_max(6GB by default on Hyperion); at the same time it will wake up quiescing thread to close the current open txg. But there is no mechanism to adjust TXG size from the time how long previous TXG was complete. This pushed the txg sync time to be about 100s on average.

            jay Jinshan Xiong (Inactive) added a comment - - edited From what I have seen in the code, ZFS starts to throttle I/O when dirty data in pool reaches 60% of zfs_dirty_data_max(6GB by default on Hyperion); at the same time it will wake up quiescing thread to close the current open txg. But there is no mechanism to adjust TXG size from the time how long previous TXG was complete. This pushed the txg sync time to be about 100s on average.

            Shouldn't ZFS limit the TXG size based on the speed of the underlying storage? I'd thought that was the main feature of the dynamic TXG sizing - record how quickly the data could be flushed to disk in the previous TXG and then use it to limit it the size of the next TXG based on the desired TXG commit interval.

            adilger Andreas Dilger added a comment - Shouldn't ZFS limit the TXG size based on the speed of the underlying storage? I'd thought that was the main feature of the dynamic TXG sizing - record how quickly the data could be flushed to disk in the previous TXG and then use it to limit it the size of the next TXG based on the desired TXG commit interval.

            My theory to explain this problem is that the ARC lock contention in old ZFS releases may throttle I/O by itself.

            jay Jinshan Xiong (Inactive) added a comment - My theory to explain this problem is that the ARC lock contention in old ZFS releases may throttle I/O by itself.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            From Brian @ LLNL - it looks like the underlying storage on Hyperion doesn't match TXG dirty throttle in ZFS. OSTs can only do ~50MB/s but zfs_dirty_data_max was set to 6G by default. After Brian reduced zfs_dirty_data_max to 1G, he didn't see I/O timeouts any more. This has been verified by Cliff.

            The reason we can't see this problem elsewhere is because it's unusual to have such a slow storage in production.

            jay Jinshan Xiong (Inactive) added a comment - - edited From Brian @ LLNL - it looks like the underlying storage on Hyperion doesn't match TXG dirty throttle in ZFS. OSTs can only do ~50MB/s but zfs_dirty_data_max was set to 6G by default. After Brian reduced zfs_dirty_data_max to 1G, he didn't see I/O timeouts any more. This has been verified by Cliff. The reason we can't see this problem elsewhere is because it's unusual to have such a slow storage in production.

            Ran latest patch from Jinshan - furnished him with debug information from SWL run.

            cliffw Cliff White (Inactive) added a comment - Ran latest patch from Jinshan - furnished him with debug information from SWL run.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            it turned out the root cause of this issue may be due to cache hit ratio dropped significantly introduced by the commit of multilisted ARC implementation.

            ARC Total accesses:                                     863.95k
                    Cache Hit Ratio:                99.68%  861.22k
                    Cache Miss Ratio:               0.32%   2.73k
                    Actual Hit Ratio:               99.68%  861.22k
            
                    Data Demand Efficiency:         100.00% 369.92k
            
                    CACHE HITS BY CACHE LIST:
                      Most Recently Used:           30.96%  266.66k
                      Most Frequently Used:         69.04%  594.56k
                      Most Recently Used Ghost:     0.00%   0
                      Most Frequently Used Ghost:   0.00%   0
            
                    CACHE HITS BY DATA TYPE:
                      Demand Data:                  42.95%  369.92k
                      Prefetch Data:                0.00%   0
                      Demand Metadata:              57.05%  491.30k
                      Prefetch Metadata:            0.00%   0
            
                    CACHE MISSES BY DATA TYPE:
                      Demand Data:                  0.00%   0
                      Prefetch Data:                0.00%   0
                      Demand Metadata:              100.00% 2.73k
                      Prefetch Metadata:            0.00%   0
            

            This is the ARC stats before the patch is introduced, the cache hit ratio is close to 100%.

            While this is the ARC stats after the commit is landed:

            ARC Total accesses:                                     967.83k
                    Cache Hit Ratio:                18.54%  179.46k
                    Cache Miss Ratio:               81.46%  788.37k
                    Actual Hit Ratio:               18.54%  179.46k
            
                    Data Demand Efficiency:         10.47%  868.25k
            
                    CACHE HITS BY CACHE LIST:
                      Most Recently Used:           35.70%  64.07k
                      Most Frequently Used:         64.30%  115.39k
                      Most Recently Used Ghost:     3.77%   6.77k
                      Most Frequently Used Ghost:   6.30%   11.30k
            
                    CACHE HITS BY DATA TYPE:
                      Demand Data:                  50.64%  90.89k
                      Prefetch Data:                0.00%   0
                      Demand Metadata:              49.36%  88.58k
                      Prefetch Metadata:            0.00%   0
            
                    CACHE MISSES BY DATA TYPE:
                      Demand Data:                  98.60%  777.36k
                      Prefetch Data:                0.00%   0
                      Demand Metadata:              1.40%   11.01k
                      Prefetch Metadata:            0.00%   0
            jay Jinshan Xiong (Inactive) added a comment - - edited it turned out the root cause of this issue may be due to cache hit ratio dropped significantly introduced by the commit of multilisted ARC implementation. ARC Total accesses: 863.95k Cache Hit Ratio: 99.68% 861.22k Cache Miss Ratio: 0.32% 2.73k Actual Hit Ratio: 99.68% 861.22k Data Demand Efficiency: 100.00% 369.92k CACHE HITS BY CACHE LIST: Most Recently Used: 30.96% 266.66k Most Frequently Used: 69.04% 594.56k Most Recently Used Ghost: 0.00% 0 Most Frequently Used Ghost: 0.00% 0 CACHE HITS BY DATA TYPE: Demand Data: 42.95% 369.92k Prefetch Data: 0.00% 0 Demand Metadata: 57.05% 491.30k Prefetch Metadata: 0.00% 0 CACHE MISSES BY DATA TYPE: Demand Data: 0.00% 0 Prefetch Data: 0.00% 0 Demand Metadata: 100.00% 2.73k Prefetch Metadata: 0.00% 0 This is the ARC stats before the patch is introduced, the cache hit ratio is close to 100%. While this is the ARC stats after the commit is landed: ARC Total accesses: 967.83k Cache Hit Ratio: 18.54% 179.46k Cache Miss Ratio: 81.46% 788.37k Actual Hit Ratio: 18.54% 179.46k Data Demand Efficiency: 10.47% 868.25k CACHE HITS BY CACHE LIST: Most Recently Used: 35.70% 64.07k Most Frequently Used: 64.30% 115.39k Most Recently Used Ghost: 3.77% 6.77k Most Frequently Used Ghost: 6.30% 11.30k CACHE HITS BY DATA TYPE: Demand Data: 50.64% 90.89k Prefetch Data: 0.00% 0 Demand Metadata: 49.36% 88.58k Prefetch Metadata: 0.00% 0 CACHE MISSES BY DATA TYPE: Demand Data: 98.60% 777.36k Prefetch Data: 0.00% 0 Demand Metadata: 1.40% 11.01k Prefetch Metadata: 0.00% 0

            Hi Nathaniel,

            We've tried 0.6.5.4 before and it didn't help.

            Only ZFS Master includes the patches the upstream ZFS developer mentioned and we tried that on Hyperion yesterday, unfortunately it didn't help either.

            jay Jinshan Xiong (Inactive) added a comment - Hi Nathaniel, We've tried 0.6.5.4 before and it didn't help. Only ZFS Master includes the patches the upstream ZFS developer mentioned and we tried that on Hyperion yesterday, unfortunately it didn't help either.
            utopiabound Nathaniel Clark added a comment - - edited

            Given the discussion on zfs#4210 should I push a patch to move to 0.6.5.4?

            utopiabound Nathaniel Clark added a comment - - edited Given the discussion on zfs#4210 should I push a patch to move to 0.6.5.4?
            jay Jinshan Xiong (Inactive) added a comment - - edited

            I filed a ticket on upstreaming zfs at: https://github.com/zfsonlinux/zfs/issues/4210

            jay Jinshan Xiong (Inactive) added a comment - - edited I filed a ticket on upstreaming zfs at: https://github.com/zfsonlinux/zfs/issues/4210

            I'm dropping the priority of this issue because it's not blocking 2.8 release any more. I will keep this ticket open till I find the root cause.

            jay Jinshan Xiong (Inactive) added a comment - I'm dropping the priority of this issue because it's not blocking 2.8 release any more. I will keep this ticket open till I find the root cause.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17712/
            Subject: LU-7404 zfs: reset ZFS baseline to 0.6.4.2
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 182b30b7699858c73a990c36c51b70c40858a1fe

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17712/ Subject: LU-7404 zfs: reset ZFS baseline to 0.6.4.2 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 182b30b7699858c73a990c36c51b70c40858a1fe

            People

              jay Jinshan Xiong (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: