Details
Description
sanity-benchmark test_iozone hangs in txg_sync for ZFS/DNE testing.
Looking at a recent failure, logs at https://testing.whamcloud.com/test_sets/04682612-66aa-11e9-a6f2-52540065bddc , the last thing seen in the suite_log is
running as uid/gid/euid/egid 500/500/500/500, groups: [iozone] [-i] [0] [-i] [1] [-i] [2] [-e] [-+d] [-r] [512] [-s] [1719368] [-t] [2] [-F] [/mnt/lustre/d0.iozone/iozone.1] [/mnt/lustre/d0.iozone/iozone.2] Iozone: Performance Test of File I/O Version $Revision: 3.373 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. Run began: Tue Apr 23 10:44:20 2019 Include fsync in write timing >>> I/O Diagnostic mode enabled. <<< Performance measurements are invalid in this mode. Record Size 512 KB File size set to 1719368 KB Command line used: iozone -i 0 -i 1 -i 2 -e -+d -r 512 -s 1719368 -t 2 -F /mnt/lustre/d0.iozone/iozone.1 /mnt/lustre/d0.iozone/iozone.2 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 2 processes Each process writes a 1719368 Kbyte file in 512 Kbyte records Children see throughput for 2 initial writers = 5465.87 KB/sec Parent sees throughput for 2 initial writers = 5007.29 KB/sec Min throughput per process = 2523.98 KB/sec Max throughput per process = 2941.89 KB/sec Avg throughput per process = 2732.94 KB/sec Min xfer = 1475072.00 KB Children see throughput for 2 rewriters = 5791.13 KB/sec Parent sees throughput for 2 rewriters = 5787.53 KB/sec Min throughput per process = 2895.57 KB/sec Max throughput per process = 2895.57 KB/sec Avg throughput per process = 2895.57 KB/sec Min xfer = 1719296.00 KB
The OSS console has the following call trace
[38923.077282] Lustre: lustre-OST0001: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp) [38923.077284] Lustre: lustre-OST0000: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp) [38931.596843] Lustre: lustre-OST0002: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp) [39020.749857] LNet: Service thread pid 7985 was inactive for 64.24s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [39020.752951] Pid: 7985, comm: ll_ost_io00_048 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Mon Apr 22 22:25:47 UTC 2019 [39020.754687] Call Trace: [39020.755302] [<ffffffffc03c72d5>] cv_wait_common+0x125/0x150 [spl] [39020.756438] [<ffffffffc03c7315>] __cv_wait+0x15/0x20 [spl] [39020.757446] [<ffffffffc053e4d3>] txg_wait_open+0xc3/0x110 [zfs] [39020.758847] [<ffffffffc04f3dca>] dmu_tx_wait+0x3aa/0x3c0 [zfs] [39020.759930] [<ffffffffc04f3e72>] dmu_tx_assign+0x92/0x490 [zfs] [39020.761024] [<ffffffffc1184009>] osd_trans_start+0x199/0x440 [osd_zfs] [39020.762239] [<ffffffffc12c1c85>] ofd_trans_start+0x75/0xf0 [ofd] [39020.763368] [<ffffffffc12c8881>] ofd_commitrw_write+0xa31/0x1d40 [ofd] [39020.764542] [<ffffffffc12ccc6c>] ofd_commitrw+0x48c/0x9e0 [ofd] [39020.765635] [<ffffffffc0fb747c>] tgt_brw_write+0x10cc/0x1cf0 [ptlrpc] [39020.767164] [<ffffffffc0fb31da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [39020.768414] [<ffffffffc0f5880b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39020.769802] [<ffffffffc0f5c13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [39020.771023] [<ffffffff97cc1c71>] kthread+0xd1/0xe0 [39020.772060] [<ffffffff98375c37>] ret_from_fork_nospec_end+0x0/0x39 [39020.773194] [<ffffffffffffffff>] 0xffffffffffffffff [39020.774176] LustreError: dumping log to /tmp/lustre-log.1556016359.7985 [39025.011155] LNet: Service thread pid 7985 completed after 68.51s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [39501.389765] LNet: Service thread pid 16546 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [39501.393077] Pid: 16546, comm: ll_ost00_009 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Mon Apr 22 22:25:47 UTC 2019 [39501.394929] Call Trace: [39501.395582] [<ffffffffc03c72d5>] cv_wait_common+0x125/0x150 [spl] [39501.396880] [<ffffffffc03c7315>] __cv_wait+0x15/0x20 [spl] [39501.397884] [<ffffffffc053e2bf>] txg_wait_synced+0xef/0x140 [zfs] [39501.399302] [<ffffffffc118b69e>] osd_object_sync+0x16e/0x180 [osd_zfs] [39501.400686] [<ffffffffc0fad8a7>] tgt_sync+0xb7/0x270 [ptlrpc] [39501.402075] [<ffffffffc12af731>] ofd_sync_hdl+0x111/0x530 [ofd] [39501.403303] [<ffffffffc0fb31da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [39501.404733] [<ffffffffc0f5880b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39501.406112] [<ffffffffc0f5c13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [39501.407350] [<ffffffff97cc1c71>] kthread+0xd1/0xe0 [39501.408407] [<ffffffff98375c37>] ret_from_fork_nospec_end+0x0/0x39 [39501.409609] [<ffffffffffffffff>] 0xffffffffffffffff [39501.410651] LustreError: dumping log to /tmp/lustre-log.1556016839.16546 … [40140.279045] in:imjournal D ffff911dbbed5140 0 3670 1 0x00000080 [40140.280809] Call Trace: [40140.281623] [<ffffffff98366d60>] ? bit_wait+0x50/0x50 [40140.282901] [<ffffffff98368c49>] schedule+0x29/0x70 [40140.283998] [<ffffffff98366721>] schedule_timeout+0x221/0x2d0 [40140.285254] [<ffffffff97c6a0fe>] ? kvm_clock_get_cycles+0x1e/0x20 [40140.286833] [<ffffffff97d01092>] ? ktime_get_ts64+0x52/0xf0 [40140.288092] [<ffffffff98366d60>] ? bit_wait+0x50/0x50 [40140.289171] [<ffffffff983682ed>] io_schedule_timeout+0xad/0x130 [40140.290445] [<ffffffff98368388>] io_schedule+0x18/0x20 [40140.291769] [<ffffffff98366d71>] bit_wait_io+0x11/0x50 [40140.293061] [<ffffffff98366897>] __wait_on_bit+0x67/0x90 [40140.294200] [<ffffffff97db93de>] ? __find_get_pages+0x11e/0x1c0 [40140.295549] [<ffffffff97db5881>] wait_on_page_bit+0x81/0xa0 [40140.296959] [<ffffffff97cc2e00>] ? wake_bit_function+0x40/0x40 [40140.298307] [<ffffffff97dc706b>] truncate_inode_pages_range+0x42b/0x700 [40140.342038] [<ffffffffc02a5dbc>] ? __ext4_journal_stop+0x3c/0xb0 [ext4] [40140.343859] [<ffffffffc0281d58>] ? ext4_rename+0x168/0x890 [ext4] [40140.345266] [<ffffffff97e51151>] ? link_path_walk+0x81/0x8b0 [40140.346503] [<ffffffff97eaf38a>] ? __dquot_initialize+0x3a/0x240 [40140.347746] [<ffffffff97e6fe5a>] ? __inode_wait_for_writeback+0x7a/0xf0 [40140.349584] [<ffffffff97dc73af>] truncate_inode_pages_final+0x4f/0x60 [40140.350965] [<ffffffffc027841f>] ext4_evict_inode+0x10f/0x480 [ext4] [40140.352310] [<ffffffff97e5eeb4>] evict+0xb4/0x180 [40140.353476] [<ffffffff97e5f7bc>] iput+0xfc/0x190 [40140.354709] [<ffffffff97e5a020>] __dentry_kill+0x120/0x180 [40140.356039] [<ffffffff97e5a130>] dput+0xb0/0x160 [40140.357170] [<ffffffff97e53f58>] SYSC_renameat2+0x518/0x5a0 [40140.358364] [<ffffffff97defa61>] ? __vma_rb_erase+0x121/0x220 [40140.359563] [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146 [40140.361085] [<ffffffff98375d15>] ? system_call_after_swapgs+0xa2/0x146 [40140.362608] [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146 [40140.364012] [<ffffffff98375d15>] ? system_call_after_swapgs+0xa2/0x146 [40140.365472] [<ffffffff97e54e5e>] SyS_renameat2+0xe/0x10 [40140.366695] [<ffffffff97e54e9e>] SyS_rename+0x1e/0x20 [40140.367909] [<ffffffff98375ddb>] system_call_fastpath+0x22/0x27 [40140.369258] [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146 … [40141.471161] txg_sync D ffff911dbc145140 0 27944 2 0x00000080 [40141.472432] Call Trace: [40141.472873] [<ffffffff97cceca4>] ? __wake_up+0x44/0x50 [40141.473770] [<ffffffff98368c49>] schedule+0x29/0x70 [40141.474622] [<ffffffff98366721>] schedule_timeout+0x221/0x2d0 [40141.475635] [<ffffffffc052c57e>] ? spa_taskq_dispatch_ent+0x8e/0xc0 [zfs] [40141.476797] [<ffffffff97c6a0fe>] ? kvm_clock_get_cycles+0x1e/0x20 [40141.477853] [<ffffffff983682ed>] io_schedule_timeout+0xad/0x130 [40141.478872] [<ffffffff97cc28c6>] ? prepare_to_wait_exclusive+0x56/0x90 [40141.479986] [<ffffffff98368388>] io_schedule+0x18/0x20 [40141.480884] [<ffffffffc03c7262>] cv_wait_common+0xb2/0x150 [spl] [40141.481929] [<ffffffff97cc2d40>] ? wake_up_atomic_t+0x30/0x30 [40141.482928] [<ffffffffc03c7338>] __cv_wait_io+0x18/0x20 [spl] [40141.483963] [<ffffffffc0596a3b>] zio_wait+0x11b/0x1c0 [zfs] [40141.484952] [<ffffffffc050cb50>] dsl_pool_sync+0x3e0/0x440 [zfs] [40141.486020] [<ffffffffc052a907>] spa_sync+0x437/0xd90 [zfs] [40141.487015] [<ffffffff97cd6802>] ? default_wake_function+0x12/0x20 [40141.488130] [<ffffffff97cceca4>] ? __wake_up+0x44/0x50 [40141.489066] [<ffffffffc053f321>] txg_sync_thread+0x301/0x510 [zfs] [40141.490151] [<ffffffffc053f020>] ? txg_fini+0x2a0/0x2a0 [zfs] [40141.491152] [<ffffffffc03c2063>] thread_generic_wrapper+0x73/0x80 [spl] [40141.492288] [<ffffffffc03c1ff0>] ? __thread_exit+0x20/0x20 [spl] [40141.493328] [<ffffffff97cc1c71>] kthread+0xd1/0xe0 [40141.494186] [<ffffffff97cc1ba0>] ? insert_kthread_work+0x40/0x40 [40141.495223] [<ffffffff98375c37>] ret_from_fork_nospec_begin+0x21/0x21 [40141.496325] [<ffffffff97cc1ba0>] ? insert_kthread_work+0x40/0x40 …
This hang looks similar to LU-5575, but LU-5575 is closed as a duplicate of LU-4950. Looking at the comment from Alex on 13/Mar/17, he essentially says, if you don’t see ofd_destroy() in the traces, then the hang is probably not LU-4950. Thus, I’ve opened this ticket to capture a current failure.
Attachments
Issue Links
- is duplicated by
-
LU-13230 sanity-benchmark test fsx hangs
- Closed
- is related to
-
LU-10009 sanity-benchmark test_iozone: test failed to respond and timed out
- Open
-
LU-12258 sanity test_101d timeout when doing rolling upgrade OSS from 2.10.7 to 2.12.1 with ZFS
- Open
- is related to
-
LU-4950 sanity-benchmark test fsx hung: txg_sync was stuck on OSS
- Closed
-
LU-5575 Failure on test suite replay-ost-single test_5
- Closed
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...