Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.8.0
-
Hyperion /SWL 2.7.61 review build 35536 (patch http://review.whamcloud.com/17053 - Revert "
LU-4865zfs: grow block size by write pattern")
-
3
-
9223372036854775807
Description
Running SWL, OSS has repeated timeouts
Nov 5 15:23:57 iws9 kernel: LNet: Service thread pid 23042 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Nov 5 15:23:57 iws9 kernel: Pid: 23042, comm: ll_ost00_004 Nov 5 15:23:57 iws9 kernel: Nov 5 15:23:57 iws9 kernel: Call Trace: Nov 5 15:23:57 iws9 kernel: [<ffffffffa067c380>] ? vdev_mirror_child_done+0x0/0x30 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffff815395c3>] io_schedule+0x73/0xc0 Nov 5 15:23:57 iws9 kernel: [<ffffffffa05b2f8f>] cv_wait_common+0xaf/0x130 [spl] Nov 5 15:23:57 iws9 kernel: [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40 Nov 5 15:23:57 iws9 kernel: [<ffffffffa05b3028>] __cv_wait_io+0x18/0x20 [spl] Nov 5 15:23:57 iws9 kernel: [<ffffffffa06bd2eb>] zio_wait+0x10b/0x1e0 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0614939>] dbuf_read+0x439/0x850 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0614ef1>] __dbuf_hold_impl+0x1a1/0x4f0 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa06152bd>] dbuf_hold_impl+0x7d/0xb0 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0616790>] dbuf_hold+0x20/0x30 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa061d0d7>] dmu_buf_hold_noread+0x87/0x140 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa061d1cb>] dmu_buf_hold+0x3b/0x90 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0612fb8>] ? dbuf_rele_and_unlock+0x268/0x400 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0686e5a>] zap_lockdir+0x5a/0x770 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffff81178fcd>] ? kmem_cache_alloc_node_trace+0x1cd/0x200 Nov 5 15:23:57 iws9 kernel: [<ffffffffa06889ca>] zap_lookup_norm+0x4a/0x190 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0688ba3>] zap_lookup+0x33/0x40 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa062cc76>] dmu_tx_hold_zap+0x146/0x210 [zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa1034255>] osd_declare_object_create+0x2a5/0x440 [osd_zfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa11738e4>] ofd_precreate_objects+0x4e4/0x19d0 [ofd] Nov 5 15:23:57 iws9 kernel: [<ffffffffa04b4b61>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Nov 5 15:23:57 iws9 kernel: [<ffffffffa1180a9b>] ? ofd_grant_create+0x23b/0x3e0 [ofd] Nov 5 15:23:57 iws9 kernel: [<ffffffffa116384e>] ofd_create_hdl+0x56e/0x2640 [ofd] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0c28e80>] ? lustre_pack_reply_v2+0x220/0x280 [ptlrpc] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0c930ec>] tgt_request_handle+0x8bc/0x12e0 [ptlrpc] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0c3a9e1>] ptlrpc_main+0xe41/0x1910 [ptlrpc] Nov 5 15:23:57 iws9 kernel: [<ffffffffa0c39ba0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc] Nov 5 15:23:57 iws9 kernel: [<ffffffff810a0fce>] kthread+0x9e/0xc0 Nov 5 15:23:57 iws9 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Nov 5 15:23:57 iws9 kernel: [<ffffffff810a0f30>] ? kthread+0x0/0xc0 Nov 5 15:23:57 iws9 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Lustre-log dump attached
Attachments
Issue Links
- is duplicated by
-
LU-7602 Repeated timeouts with ZFS 0.6.5.2
-
- Resolved
-
- is related to
-
LU-6750 missing stop callback in osd-zfs
-
- Resolved
-
-
LU-7987 Lustre 2.8 OSS with zfs 0.6.5 backend hitting most schedule_timeout
-
- Closed
-
- is related to
-
LU-7153 Update ZFS/SPL version to 0.6.5.2
-
- Resolved
-
-
LU-4865 osd-zfs: increase object block size dynamically as object grows
-
- Resolved
-
- links to
(1 links to)
From what I have seen in the code, ZFS starts to throttle I/O when dirty data in pool reaches 60% of zfs_dirty_data_max(6GB by default on Hyperion); at the same time it will wake up quiescing thread to close the current open txg. But there is no mechanism to adjust TXG size from the time how long previous TXG was complete. This pushed the txg sync time to be about 100s on average.