Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.1, Lustre 2.10.8, Lustre 2.12.4, Lustre 2.12.6, Lustre 2.12.8
Labels:
- dne
- zfs

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

sanity-benchmark test_iozone hangs in txg_sync for ZFS/DNE testing.

Looking at a recent failure, logs at https://testing.whamcloud.com/test_sets/04682612-66aa-11e9-a6f2-52540065bddc , the last thing seen in the suite_log is

running as uid/gid/euid/egid 500/500/500/500, groups:
 [iozone] [-i] [0] [-i] [1] [-i] [2] [-e] [-+d] [-r] [512] [-s] [1719368] [-t] [2] [-F] [/mnt/lustre/d0.iozone/iozone.1] [/mnt/lustre/d0.iozone/iozone.2]
	Iozone: Performance Test of File I/O
	        Version $Revision: 3.373 $
		Compiled for 64 bit mode.
		Build: linux-AMD64 

	Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
	             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
	             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
	             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
	             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
	             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
	             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.

	Run began: Tue Apr 23 10:44:20 2019

	Include fsync in write timing
	>>> I/O Diagnostic mode enabled. <<<
	Performance measurements are invalid in this mode.
	Record Size 512 KB
	File size set to 1719368 KB
	Command line used: iozone -i 0 -i 1 -i 2 -e -+d -r 512 -s 1719368 -t 2 -F /mnt/lustre/d0.iozone/iozone.1 /mnt/lustre/d0.iozone/iozone.2
	Output is in Kbytes/sec
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 Kbytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
	Throughput test with 2 processes
	Each process writes a 1719368 Kbyte file in 512 Kbyte records

	Children see throughput for  2 initial writers 	=    5465.87 KB/sec
	Parent sees throughput for  2 initial writers 	=    5007.29 KB/sec
	Min throughput per process 			=    2523.98 KB/sec 
	Max throughput per process 			=    2941.89 KB/sec
	Avg throughput per process 			=    2732.94 KB/sec
	Min xfer 					= 1475072.00 KB

	Children see throughput for  2 rewriters 	=    5791.13 KB/sec
	Parent sees throughput for  2 rewriters 	=    5787.53 KB/sec
	Min throughput per process 			=    2895.57 KB/sec 
	Max throughput per process 			=    2895.57 KB/sec
	Avg throughput per process 			=    2895.57 KB/sec
	Min xfer 					= 1719296.00 KB

The OSS console has the following call trace

[38923.077282] Lustre: lustre-OST0001: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp)
[38923.077284] Lustre: lustre-OST0000: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp)
[38931.596843] Lustre: lustre-OST0002: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp)
[39020.749857] LNet: Service thread pid 7985 was inactive for 64.24s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[39020.752951] Pid: 7985, comm: ll_ost_io00_048 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Mon Apr 22 22:25:47 UTC 2019
[39020.754687] Call Trace:
[39020.755302]  [<ffffffffc03c72d5>] cv_wait_common+0x125/0x150 [spl]
[39020.756438]  [<ffffffffc03c7315>] __cv_wait+0x15/0x20 [spl]
[39020.757446]  [<ffffffffc053e4d3>] txg_wait_open+0xc3/0x110 [zfs]
[39020.758847]  [<ffffffffc04f3dca>] dmu_tx_wait+0x3aa/0x3c0 [zfs]
[39020.759930]  [<ffffffffc04f3e72>] dmu_tx_assign+0x92/0x490 [zfs]
[39020.761024]  [<ffffffffc1184009>] osd_trans_start+0x199/0x440 [osd_zfs]
[39020.762239]  [<ffffffffc12c1c85>] ofd_trans_start+0x75/0xf0 [ofd]
[39020.763368]  [<ffffffffc12c8881>] ofd_commitrw_write+0xa31/0x1d40 [ofd]
[39020.764542]  [<ffffffffc12ccc6c>] ofd_commitrw+0x48c/0x9e0 [ofd]
[39020.765635]  [<ffffffffc0fb747c>] tgt_brw_write+0x10cc/0x1cf0 [ptlrpc]
[39020.767164]  [<ffffffffc0fb31da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[39020.768414]  [<ffffffffc0f5880b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[39020.769802]  [<ffffffffc0f5c13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[39020.771023]  [<ffffffff97cc1c71>] kthread+0xd1/0xe0
[39020.772060]  [<ffffffff98375c37>] ret_from_fork_nospec_end+0x0/0x39
[39020.773194]  [<ffffffffffffffff>] 0xffffffffffffffff
[39020.774176] LustreError: dumping log to /tmp/lustre-log.1556016359.7985
[39025.011155] LNet: Service thread pid 7985 completed after 68.51s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[39501.389765] LNet: Service thread pid 16546 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[39501.393077] Pid: 16546, comm: ll_ost00_009 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Mon Apr 22 22:25:47 UTC 2019
[39501.394929] Call Trace:
[39501.395582]  [<ffffffffc03c72d5>] cv_wait_common+0x125/0x150 [spl]
[39501.396880]  [<ffffffffc03c7315>] __cv_wait+0x15/0x20 [spl]
[39501.397884]  [<ffffffffc053e2bf>] txg_wait_synced+0xef/0x140 [zfs]
[39501.399302]  [<ffffffffc118b69e>] osd_object_sync+0x16e/0x180 [osd_zfs]
[39501.400686]  [<ffffffffc0fad8a7>] tgt_sync+0xb7/0x270 [ptlrpc]
[39501.402075]  [<ffffffffc12af731>] ofd_sync_hdl+0x111/0x530 [ofd]
[39501.403303]  [<ffffffffc0fb31da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[39501.404733]  [<ffffffffc0f5880b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[39501.406112]  [<ffffffffc0f5c13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[39501.407350]  [<ffffffff97cc1c71>] kthread+0xd1/0xe0
[39501.408407]  [<ffffffff98375c37>] ret_from_fork_nospec_end+0x0/0x39
[39501.409609]  [<ffffffffffffffff>] 0xffffffffffffffff
[39501.410651] LustreError: dumping log to /tmp/lustre-log.1556016839.16546
…
[40140.279045] in:imjournal    D ffff911dbbed5140     0  3670      1 0x00000080
[40140.280809] Call Trace:
[40140.281623]  [<ffffffff98366d60>] ? bit_wait+0x50/0x50
[40140.282901]  [<ffffffff98368c49>] schedule+0x29/0x70
[40140.283998]  [<ffffffff98366721>] schedule_timeout+0x221/0x2d0
[40140.285254]  [<ffffffff97c6a0fe>] ? kvm_clock_get_cycles+0x1e/0x20
[40140.286833]  [<ffffffff97d01092>] ? ktime_get_ts64+0x52/0xf0
[40140.288092]  [<ffffffff98366d60>] ? bit_wait+0x50/0x50
[40140.289171]  [<ffffffff983682ed>] io_schedule_timeout+0xad/0x130
[40140.290445]  [<ffffffff98368388>] io_schedule+0x18/0x20
[40140.291769]  [<ffffffff98366d71>] bit_wait_io+0x11/0x50
[40140.293061]  [<ffffffff98366897>] __wait_on_bit+0x67/0x90
[40140.294200]  [<ffffffff97db93de>] ? __find_get_pages+0x11e/0x1c0
[40140.295549]  [<ffffffff97db5881>] wait_on_page_bit+0x81/0xa0
[40140.296959]  [<ffffffff97cc2e00>] ? wake_bit_function+0x40/0x40
[40140.298307]  [<ffffffff97dc706b>] truncate_inode_pages_range+0x42b/0x700
[40140.342038]  [<ffffffffc02a5dbc>] ? __ext4_journal_stop+0x3c/0xb0 [ext4]
[40140.343859]  [<ffffffffc0281d58>] ? ext4_rename+0x168/0x890 [ext4]
[40140.345266]  [<ffffffff97e51151>] ? link_path_walk+0x81/0x8b0
[40140.346503]  [<ffffffff97eaf38a>] ? __dquot_initialize+0x3a/0x240
[40140.347746]  [<ffffffff97e6fe5a>] ? __inode_wait_for_writeback+0x7a/0xf0
[40140.349584]  [<ffffffff97dc73af>] truncate_inode_pages_final+0x4f/0x60
[40140.350965]  [<ffffffffc027841f>] ext4_evict_inode+0x10f/0x480 [ext4]
[40140.352310]  [<ffffffff97e5eeb4>] evict+0xb4/0x180
[40140.353476]  [<ffffffff97e5f7bc>] iput+0xfc/0x190
[40140.354709]  [<ffffffff97e5a020>] __dentry_kill+0x120/0x180
[40140.356039]  [<ffffffff97e5a130>] dput+0xb0/0x160
[40140.357170]  [<ffffffff97e53f58>] SYSC_renameat2+0x518/0x5a0
[40140.358364]  [<ffffffff97defa61>] ? __vma_rb_erase+0x121/0x220
[40140.359563]  [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146
[40140.361085]  [<ffffffff98375d15>] ? system_call_after_swapgs+0xa2/0x146
[40140.362608]  [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146
[40140.364012]  [<ffffffff98375d15>] ? system_call_after_swapgs+0xa2/0x146
[40140.365472]  [<ffffffff97e54e5e>] SyS_renameat2+0xe/0x10
[40140.366695]  [<ffffffff97e54e9e>] SyS_rename+0x1e/0x20
[40140.367909]  [<ffffffff98375ddb>] system_call_fastpath+0x22/0x27
[40140.369258]  [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146
…
[40141.471161] txg_sync        D ffff911dbc145140     0 27944      2 0x00000080
[40141.472432] Call Trace:
[40141.472873]  [<ffffffff97cceca4>] ? __wake_up+0x44/0x50
[40141.473770]  [<ffffffff98368c49>] schedule+0x29/0x70
[40141.474622]  [<ffffffff98366721>] schedule_timeout+0x221/0x2d0
[40141.475635]  [<ffffffffc052c57e>] ? spa_taskq_dispatch_ent+0x8e/0xc0 [zfs]
[40141.476797]  [<ffffffff97c6a0fe>] ? kvm_clock_get_cycles+0x1e/0x20
[40141.477853]  [<ffffffff983682ed>] io_schedule_timeout+0xad/0x130
[40141.478872]  [<ffffffff97cc28c6>] ? prepare_to_wait_exclusive+0x56/0x90
[40141.479986]  [<ffffffff98368388>] io_schedule+0x18/0x20
[40141.480884]  [<ffffffffc03c7262>] cv_wait_common+0xb2/0x150 [spl]
[40141.481929]  [<ffffffff97cc2d40>] ? wake_up_atomic_t+0x30/0x30
[40141.482928]  [<ffffffffc03c7338>] __cv_wait_io+0x18/0x20 [spl]
[40141.483963]  [<ffffffffc0596a3b>] zio_wait+0x11b/0x1c0 [zfs]
[40141.484952]  [<ffffffffc050cb50>] dsl_pool_sync+0x3e0/0x440 [zfs]
[40141.486020]  [<ffffffffc052a907>] spa_sync+0x437/0xd90 [zfs]
[40141.487015]  [<ffffffff97cd6802>] ? default_wake_function+0x12/0x20
[40141.488130]  [<ffffffff97cceca4>] ? __wake_up+0x44/0x50
[40141.489066]  [<ffffffffc053f321>] txg_sync_thread+0x301/0x510 [zfs]
[40141.490151]  [<ffffffffc053f020>] ? txg_fini+0x2a0/0x2a0 [zfs]
[40141.491152]  [<ffffffffc03c2063>] thread_generic_wrapper+0x73/0x80 [spl]
[40141.492288]  [<ffffffffc03c1ff0>] ? __thread_exit+0x20/0x20 [spl]
[40141.493328]  [<ffffffff97cc1c71>] kthread+0xd1/0xe0
[40141.494186]  [<ffffffff97cc1ba0>] ? insert_kthread_work+0x40/0x40
[40141.495223]  [<ffffffff98375c37>] ret_from_fork_nospec_begin+0x21/0x21
[40141.496325]  [<ffffffff97cc1ba0>] ? insert_kthread_work+0x40/0x40
…

This hang looks similar to ~~LU-5575~~, but ~~LU-5575~~ is closed as a duplicate of ~~LU-4950~~. Looking at the comment from Alex on 13/Mar/17, he essentially says, if you don’t see ofd_destroy() in the traces, then the hang is probably not ~~LU-4950~~. Thus, I’ve opened this ticket to capture a current failure.

Attachments

Issue Links

is duplicated by

LU-13230 sanity-benchmark test fsx hangs

Closed

is related to

LU-10009 sanity-benchmark test_iozone: test failed to respond and timed out

Open

LU-12258 sanity test_101d timeout when doing rolling upgrade OSS from 2.10.7 to 2.12.1 with ZFS

Open

is related to

LU-4950 sanity-benchmark test fsx hung: txg_sync was stuck on OSS

Closed

LU-5575 Failure on test suite replay-ost-single test_5

Closed

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(9 mentioned in)

sanity-benchmark test iozone hangs in txg_sync

Details

Description

Attachments

Issue Links

Activity

People

Dates