[LU-7404] ZFS OSS - Numerous timeouts - SWL Created: 06/Nov/15 Updated: 20/Mar/19 Resolved: 03/Nov/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | RZ_LS, zfs | ||
| Environment: |
Hyperion /SWL 2.7.61 review build 35536 (patch http://review.whamcloud.com/17053 - Revert " |
||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
Running SWL, OSS has repeated timeouts Nov 5 15:23:57 iws9 kernel: LNet: Service thread pid 23042 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 5 15:23:57 iws9 kernel: Pid: 23042, comm: ll_ost00_004 Nov 5 15:23:57 iws9 kernel: Nov 5 15:23:57 iws9 kernel: Call Trace: Nov 5 15:23:57 iws9 kernel: [<ffffffffa067c380>] ? vdev_mirror_child_done+0x0/0x30 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffff815395c3>] io_schedule+0x73/0xc0 Nov 5 15:23:57 iws9 kernel: [<ffffffffa05b2f8f>] cv_wait_common+0xaf/0x130 [spl] Nov 5 15:23:57 iws9 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40 Nov 5 15:23:57 iws9 kernel: [<ffffffffa05b3028>] __cv_wait_io+0x18/0x20 [spl] Nov 5 15:23:57 iws9 kernel: [<ffffffffa06bd2eb>] zio_wait+0x10b/0x1e0 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0614939>] dbuf_read+0x439/0x850 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0614ef1>] __dbuf_hold_impl+0x1a1/0x4f0 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa06152bd>] dbuf_hold_impl+0x7d/0xb0 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0616790>] dbuf_hold+0x20/0x30 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa061d0d7>] dmu_buf_hold_noread+0x87/0x140 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa061d1cb>] dmu_buf_hold+0x3b/0x90 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0612fb8>] ? dbuf_rele_and_unlock+0x268/0x400 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0686e5a>] zap_lockdir+0x5a/0x770 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffff81178fcd>] ? kmem_cache_alloc_node_trace+0x1cd/0x200 Nov 5 15:23:57 iws9 kernel: [<ffffffffa06889ca>] zap_lookup_norm+0x4a/0x190 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0688ba3>] zap_lookup+0x33/0x40 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa062cc76>] dmu_tx_hold_zap+0x146/0x210 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa1034255>] osd_declare_object_create+0x2a5/0x440 [osd_zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa11738e4>] ofd_precreate_objects+0x4e4/0x19d0 [ofd] Nov 5 15:23:57 iws9 kernel: [<ffffffffa04b4b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa1180a9b>] ? ofd_grant_create+0x23b/0x3e0 [ofd] Nov 5 15:23:57 iws9 kernel: [<ffffffffa116384e>] ofd_create_hdl+0x56e/0x2640 [ofd] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0c28e80>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0c930ec>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0c3a9e1>] ptlrpc_main+0xe41/0x1910 [ptlrpc] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0c39ba0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] Nov 5 15:23:57 iws9 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0 Nov 5 15:23:57 iws9 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Nov 5 15:23:57 iws9 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0 Nov 5 15:23:57 iws9 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Lustre-log dump attached |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 06/Nov/15 ] |
|
Issues seen after reverting patch for |
| Comment by Andreas Dilger [ 10/Nov/15 ] |
|
Current testing with DNE+ZFS on Hyperion has shown that 2.7.56 does not have this timeout problem, while 2.7.62 does have the timeouts. Testing is underway with the 2.7.59 tag to see if the timeout problem is present there as well. The I've pushed http://review.whamcloud.com/17112 to revert the |
| Comment by Andreas Dilger [ 23/Dec/15 ] |
|
This problem has been isolated to the update from ZFS 0.6.4.2 to 0.6.5.2, commit v2_7_61_0-39-ge94d375d8a, patch http://review.whamcloud.com/16399 " One option for debugging would be to bisect the ZFS code upstream to see which ZFS patch has introduced this. |
| Comment by Jian Yu [ 23/Dec/15 ] |
|
Here is the patch to reset ZFS baseline to version 0.6.4.2: http://review.whamcloud.com/17712 |
| Comment by Nathaniel Clark [ 28/Dec/15 ] |
|
This hang isn't the same as deadlock as zfsonlinux/spl#484 "Disable dynamic taskqs by default to avoid deadlock" that was fixed in 0.6.5.3 is it? If it isn't has someone opened a bug upstream for this? |
| Comment by Jinshan Xiong (Inactive) [ 29/Dec/15 ] |
|
Hi Nathaniel, Cliff has verified that the same issue can be seen on current master, where the baseline of ZFS is 0.6.5.3. I will investigate if this is a problem of OSD-ZFS, or a problem of ZFS baseline. We will file an issue upstream if it turns out to be a problem of ZFS. |
| Comment by Gerrit Updater [ 05/Jan/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17712/ |
| Comment by Jinshan Xiong (Inactive) [ 06/Jan/16 ] |
|
I'm dropping the priority of this issue because it's not blocking 2.8 release any more. I will keep this ticket open till I find the root cause. |
| Comment by Jinshan Xiong (Inactive) [ 13/Jan/16 ] |
|
I filed a ticket on upstreaming zfs at: https://github.com/zfsonlinux/zfs/issues/4210 |
| Comment by Nathaniel Clark [ 21/Jan/16 ] |
|
Given the discussion on zfs#4210 should I push a patch to move to 0.6.5.4? |
| Comment by Jinshan Xiong (Inactive) [ 21/Jan/16 ] |
|
Hi Nathaniel, We've tried 0.6.5.4 before and it didn't help. Only ZFS Master includes the patches the upstream ZFS developer mentioned and we tried that on Hyperion yesterday, unfortunately it didn't help either. |
| Comment by Jinshan Xiong (Inactive) [ 23/Feb/16 ] |
|
it turned out the root cause of this issue may be due to cache hit ratio dropped significantly introduced by the commit of multilisted ARC implementation. ARC Total accesses: 863.95k
Cache Hit Ratio: 99.68% 861.22k
Cache Miss Ratio: 0.32% 2.73k
Actual Hit Ratio: 99.68% 861.22k
Data Demand Efficiency: 100.00% 369.92k
CACHE HITS BY CACHE LIST:
Most Recently Used: 30.96% 266.66k
Most Frequently Used: 69.04% 594.56k
Most Recently Used Ghost: 0.00% 0
Most Frequently Used Ghost: 0.00% 0
CACHE HITS BY DATA TYPE:
Demand Data: 42.95% 369.92k
Prefetch Data: 0.00% 0
Demand Metadata: 57.05% 491.30k
Prefetch Metadata: 0.00% 0
CACHE MISSES BY DATA TYPE:
Demand Data: 0.00% 0
Prefetch Data: 0.00% 0
Demand Metadata: 100.00% 2.73k
Prefetch Metadata: 0.00% 0
This is the ARC stats before the patch is introduced, the cache hit ratio is close to 100%. While this is the ARC stats after the commit is landed: ARC Total accesses: 967.83k
Cache Hit Ratio: 18.54% 179.46k
Cache Miss Ratio: 81.46% 788.37k
Actual Hit Ratio: 18.54% 179.46k
Data Demand Efficiency: 10.47% 868.25k
CACHE HITS BY CACHE LIST:
Most Recently Used: 35.70% 64.07k
Most Frequently Used: 64.30% 115.39k
Most Recently Used Ghost: 3.77% 6.77k
Most Frequently Used Ghost: 6.30% 11.30k
CACHE HITS BY DATA TYPE:
Demand Data: 50.64% 90.89k
Prefetch Data: 0.00% 0
Demand Metadata: 49.36% 88.58k
Prefetch Metadata: 0.00% 0
CACHE MISSES BY DATA TYPE:
Demand Data: 98.60% 777.36k
Prefetch Data: 0.00% 0
Demand Metadata: 1.40% 11.01k
Prefetch Metadata: 0.00% 0
|
| Comment by Cliff White (Inactive) [ 04/Apr/16 ] |
|
Ran latest patch from Jinshan - furnished him with debug information from SWL run. |
| Comment by Jinshan Xiong (Inactive) [ 12/Apr/16 ] |
|
From Brian @ LLNL - it looks like the underlying storage on Hyperion doesn't match TXG dirty throttle in ZFS. OSTs can only do ~50MB/s but zfs_dirty_data_max was set to 6G by default. After Brian reduced zfs_dirty_data_max to 1G, he didn't see I/O timeouts any more. This has been verified by Cliff. The reason we can't see this problem elsewhere is because it's unusual to have such a slow storage in production. |
| Comment by Jinshan Xiong (Inactive) [ 12/Apr/16 ] |
|
My theory to explain this problem is that the ARC lock contention in old ZFS releases may throttle I/O by itself. |
| Comment by Andreas Dilger [ 12/Apr/16 ] |
|
Shouldn't ZFS limit the TXG size based on the speed of the underlying storage? I'd thought that was the main feature of the dynamic TXG sizing - record how quickly the data could be flushed to disk in the previous TXG and then use it to limit it the size of the next TXG based on the desired TXG commit interval. |
| Comment by Jinshan Xiong (Inactive) [ 12/Apr/16 ] |
|
From what I have seen in the code, ZFS starts to throttle I/O when dirty data in pool reaches 60% of zfs_dirty_data_max(6GB by default on Hyperion); at the same time it will wake up quiescing thread to close the current open txg. But there is no mechanism to adjust TXG size from the time how long previous TXG was complete. This pushed the txg sync time to be about 100s on average. |
| Comment by Brian Behlendorf [ 12/Apr/16 ] |
|
Andreas the ZFS IO throttled used to work exactly as you've described it. That implementation had some significant draw backs so it was reworked in 2013. Specifically, it often resulted in a nasty saw tooth behavior when processes hit the calculated limit (which was often wrong). See https://github.com/zfsonlinux/zfs/commit/e8b96c6007bf97cdf34869c1ffbd0ce753873a3d for all gory the details. Bottom line the code was updated so that `dmu_tx_assign()` now starts injecting tiny delays which gradually increase as you approach the dirty limit. This back pressure is supposed to keep the TXG from getting to large and allow the system to settle in to a steady equilibrium. Clearly that doesn't seem to be working properly for a Lustre workload even though `dmu_tx_assign()` was injecting large delays. I suspect it may be because of the large number of Lustre IO threads all attempting to dirty ARC buffers, that wasn't exactly the use case this was designed for. The code also makes sure to allow any caller which was delayed in to the next TXG to prevent starving them. How quickly these delays ramp up is controlled by the `zfs_delay_scale` module parameter which we should able to tune to control this. There's a great comment describing this in the zfs-module-parameters.5 man page which is worth reading. See https://github.com/zfsonlinux/zfs/blob/master/man/man5/zfs-module-parameters.5#L1768. |
| Comment by Gerrit Updater [ 18/Apr/16 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/19632 |
| Comment by Peter Jones [ 25/Jul/16 ] |
|
AFAIK no action is required for 2.9 and master is now running against a more current 0.6.5.x ZFS release |