[LU-7404] ZFS OSS - Numerous timeouts - SWL Created: 06/Nov/15  Updated: 20/Mar/19  Resolved: 03/Nov/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: RZ_LS, zfs
Environment:

Hyperion /SWL 2.7.61 review build 35536 (patch http://review.whamcloud.com/17053 - Revert "LU-4865 zfs: grow block size by write pattern")


Attachments: File iws9.t1.txt.gz    
Issue Links:
Duplicate
is duplicated by LU-7602 Repeated timeouts with ZFS 0.6.5.2 Resolved
Related
is related to LU-7153 Update ZFS/SPL version to 0.6.5.2 Resolved
is related to LU-4865 osd-zfs: increase object block size d... Resolved
is related to LU-6750 missing stop callback in osd-zfs Resolved
is related to LU-7987 Lustre 2.8 OSS with zfs 0.6.5 backend... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running SWL, OSS has repeated timeouts

Nov  5 15:23:57 iws9 kernel: LNet: Service thread pid 23042 was inactive for 200.00s. The thread might be hung, or it 
might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Nov  5 15:23:57 iws9 kernel: Pid: 23042, comm: ll_ost00_004
Nov  5 15:23:57 iws9 kernel:
Nov  5 15:23:57 iws9 kernel: Call Trace:
Nov  5 15:23:57 iws9 kernel: [<ffffffffa067c380>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffff815395c3>] io_schedule+0x73/0xc0
Nov  5 15:23:57 iws9 kernel: [<ffffffffa05b2f8f>] cv_wait_common+0xaf/0x130 [spl]
Nov  5 15:23:57 iws9 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40
Nov  5 15:23:57 iws9 kernel: [<ffffffffa05b3028>] __cv_wait_io+0x18/0x20 [spl]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa06bd2eb>] zio_wait+0x10b/0x1e0 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa0614939>] dbuf_read+0x439/0x850 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa0614ef1>] __dbuf_hold_impl+0x1a1/0x4f0 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa06152bd>] dbuf_hold_impl+0x7d/0xb0 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa0616790>] dbuf_hold+0x20/0x30 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa061d0d7>] dmu_buf_hold_noread+0x87/0x140 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa061d1cb>] dmu_buf_hold+0x3b/0x90 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa0612fb8>] ? dbuf_rele_and_unlock+0x268/0x400 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa0686e5a>] zap_lockdir+0x5a/0x770 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffff81178fcd>] ? kmem_cache_alloc_node_trace+0x1cd/0x200
Nov  5 15:23:57 iws9 kernel: [<ffffffffa06889ca>] zap_lookup_norm+0x4a/0x190 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa0688ba3>] zap_lookup+0x33/0x40 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa062cc76>] dmu_tx_hold_zap+0x146/0x210 [zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa1034255>] osd_declare_object_create+0x2a5/0x440 [osd_zfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa11738e4>] ofd_precreate_objects+0x4e4/0x19d0 [ofd]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa04b4b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa1180a9b>] ? ofd_grant_create+0x23b/0x3e0 [ofd]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa116384e>] ofd_create_hdl+0x56e/0x2640 [ofd]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c28e80>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c930ec>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c3a9e1>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
Nov  5 15:23:57 iws9 kernel: [<ffffffffa0c39ba0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
Nov  5 15:23:57 iws9 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0
Nov  5 15:23:57 iws9 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
Nov  5 15:23:57 iws9 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
Nov  5 15:23:57 iws9 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20

Lustre-log dump attached



 Comments   
Comment by Joseph Gmitter (Inactive) [ 06/Nov/15 ]

Issues seen after reverting patch for LU-4865.

Comment by Andreas Dilger [ 10/Nov/15 ]

Current testing with DNE+ZFS on Hyperion has shown that 2.7.56 does not have this timeout problem, while 2.7.62 does have the timeouts. Testing is underway with the 2.7.59 tag to see if the timeout problem is present there as well. The LU-6750 patch was landed as v2_7_56_0-5-g27929cc (i.e. 5 patches past 2.7.56) so it wasn't present in the 2.7.56 testing that passed. The LU-4865 patch was landed as v2_7_59_0-20-g3e43691 (i.e. 20 patches past 2.7.59) so the 2.7.59 testing will give us a good half-way mark without being affected by LU-4865.

I've pushed http://review.whamcloud.com/17112 to revert the LU-6750 patch in addition to the LU-4865 patch reversion, based on the current tip of master, for the next stage of testing after 2.7.59, depending on those results.

Comment by Andreas Dilger [ 23/Dec/15 ]

This problem has been isolated to the update from ZFS 0.6.4.2 to 0.6.5.2, commit v2_7_61_0-39-ge94d375d8a, patch http://review.whamcloud.com/16399 "LU-7153 build: Update SPL/ZFS to 0.6.5.2".

One option for debugging would be to bisect the ZFS code upstream to see which ZFS patch has introduced this.

Comment by Jian Yu [ 23/Dec/15 ]

Here is the patch to reset ZFS baseline to version 0.6.4.2: http://review.whamcloud.com/17712

Comment by Nathaniel Clark [ 28/Dec/15 ]

This hang isn't the same as deadlock as zfsonlinux/spl#484 "Disable dynamic taskqs by default to avoid deadlock" that was fixed in 0.6.5.3 is it? If it isn't has someone opened a bug upstream for this?

Comment by Jinshan Xiong (Inactive) [ 29/Dec/15 ]

Hi Nathaniel,

Cliff has verified that the same issue can be seen on current master, where the baseline of ZFS is 0.6.5.3. I will investigate if this is a problem of OSD-ZFS, or a problem of ZFS baseline. We will file an issue upstream if it turns out to be a problem of ZFS.

Comment by Gerrit Updater [ 05/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17712/
Subject: LU-7404 zfs: reset ZFS baseline to 0.6.4.2
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 182b30b7699858c73a990c36c51b70c40858a1fe

Comment by Jinshan Xiong (Inactive) [ 06/Jan/16 ]

I'm dropping the priority of this issue because it's not blocking 2.8 release any more. I will keep this ticket open till I find the root cause.

Comment by Jinshan Xiong (Inactive) [ 13/Jan/16 ]

I filed a ticket on upstreaming zfs at: https://github.com/zfsonlinux/zfs/issues/4210

Comment by Nathaniel Clark [ 21/Jan/16 ]

Given the discussion on zfs#4210 should I push a patch to move to 0.6.5.4?

Comment by Jinshan Xiong (Inactive) [ 21/Jan/16 ]

Hi Nathaniel,

We've tried 0.6.5.4 before and it didn't help.

Only ZFS Master includes the patches the upstream ZFS developer mentioned and we tried that on Hyperion yesterday, unfortunately it didn't help either.

Comment by Jinshan Xiong (Inactive) [ 23/Feb/16 ]

it turned out the root cause of this issue may be due to cache hit ratio dropped significantly introduced by the commit of multilisted ARC implementation.

ARC Total accesses:                                     863.95k
        Cache Hit Ratio:                99.68%  861.22k
        Cache Miss Ratio:               0.32%   2.73k
        Actual Hit Ratio:               99.68%  861.22k

        Data Demand Efficiency:         100.00% 369.92k

        CACHE HITS BY CACHE LIST:
          Most Recently Used:           30.96%  266.66k
          Most Frequently Used:         69.04%  594.56k
          Most Recently Used Ghost:     0.00%   0
          Most Frequently Used Ghost:   0.00%   0

        CACHE HITS BY DATA TYPE:
          Demand Data:                  42.95%  369.92k
          Prefetch Data:                0.00%   0
          Demand Metadata:              57.05%  491.30k
          Prefetch Metadata:            0.00%   0

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  0.00%   0
          Prefetch Data:                0.00%   0
          Demand Metadata:              100.00% 2.73k
          Prefetch Metadata:            0.00%   0

This is the ARC stats before the patch is introduced, the cache hit ratio is close to 100%.

While this is the ARC stats after the commit is landed:

ARC Total accesses:                                     967.83k
        Cache Hit Ratio:                18.54%  179.46k
        Cache Miss Ratio:               81.46%  788.37k
        Actual Hit Ratio:               18.54%  179.46k

        Data Demand Efficiency:         10.47%  868.25k

        CACHE HITS BY CACHE LIST:
          Most Recently Used:           35.70%  64.07k
          Most Frequently Used:         64.30%  115.39k
          Most Recently Used Ghost:     3.77%   6.77k
          Most Frequently Used Ghost:   6.30%   11.30k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  50.64%  90.89k
          Prefetch Data:                0.00%   0
          Demand Metadata:              49.36%  88.58k
          Prefetch Metadata:            0.00%   0

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  98.60%  777.36k
          Prefetch Data:                0.00%   0
          Demand Metadata:              1.40%   11.01k
          Prefetch Metadata:            0.00%   0
Comment by Cliff White (Inactive) [ 04/Apr/16 ]

Ran latest patch from Jinshan - furnished him with debug information from SWL run.

Comment by Jinshan Xiong (Inactive) [ 12/Apr/16 ]

From Brian @ LLNL - it looks like the underlying storage on Hyperion doesn't match TXG dirty throttle in ZFS. OSTs can only do ~50MB/s but zfs_dirty_data_max was set to 6G by default. After Brian reduced zfs_dirty_data_max to 1G, he didn't see I/O timeouts any more. This has been verified by Cliff.

The reason we can't see this problem elsewhere is because it's unusual to have such a slow storage in production.

Comment by Jinshan Xiong (Inactive) [ 12/Apr/16 ]

My theory to explain this problem is that the ARC lock contention in old ZFS releases may throttle I/O by itself.

Comment by Andreas Dilger [ 12/Apr/16 ]

Shouldn't ZFS limit the TXG size based on the speed of the underlying storage? I'd thought that was the main feature of the dynamic TXG sizing - record how quickly the data could be flushed to disk in the previous TXG and then use it to limit it the size of the next TXG based on the desired TXG commit interval.

Comment by Jinshan Xiong (Inactive) [ 12/Apr/16 ]

From what I have seen in the code, ZFS starts to throttle I/O when dirty data in pool reaches 60% of zfs_dirty_data_max(6GB by default on Hyperion); at the same time it will wake up quiescing thread to close the current open txg. But there is no mechanism to adjust TXG size from the time how long previous TXG was complete. This pushed the txg sync time to be about 100s on average.

Comment by Brian Behlendorf [ 12/Apr/16 ]

Andreas the ZFS IO throttled used to work exactly as you've described it. That implementation had some significant draw backs so it was reworked in 2013. Specifically, it often resulted in a nasty saw tooth behavior when processes hit the calculated limit (which was often wrong). See https://github.com/zfsonlinux/zfs/commit/e8b96c6007bf97cdf34869c1ffbd0ce753873a3d for all gory the details. Bottom line the code was updated so that `dmu_tx_assign()` now starts injecting tiny delays which gradually increase as you approach the dirty limit. This back pressure is supposed to keep the TXG from getting to large and allow the system to settle in to a steady equilibrium.

Clearly that doesn't seem to be working properly for a Lustre workload even though `dmu_tx_assign()` was injecting large delays. I suspect it may be because of the large number of Lustre IO threads all attempting to dirty ARC buffers, that wasn't exactly the use case this was designed for. The code also makes sure to allow any caller which was delayed in to the next TXG to prevent starving them. How quickly these delays ramp up is controlled by the `zfs_delay_scale` module parameter which we should able to tune to control this. There's a great comment describing this in the zfs-module-parameters.5 man page which is worth reading. See https://github.com/zfsonlinux/zfs/blob/master/man/man5/zfs-module-parameters.5#L1768.

Comment by Gerrit Updater [ 18/Apr/16 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/19632
Subject: LU-7404 zfs: Reset default zfs version to 0.6.5.5
Project: fs/lustre-release
Branch: b2_8
Current Patch Set: 1
Commit: 947f37272b35ede3791616261e6a6929aebcac86

Comment by Peter Jones [ 25/Jul/16 ]

AFAIK no action is required for 2.9 and master is now running against a more current 0.6.5.x ZFS release

Generated at Sat Feb 10 02:08:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.