[LU-12234] sanity-benchmark test iozone hangs in txg_sync Created: 26/Apr/19  Updated: 03/Dec/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.1, Lustre 2.10.8, Lustre 2.12.4, Lustre 2.12.6, Lustre 2.12.8
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: dne, zfs

Issue Links:
Duplicate
is duplicated by LU-13230 sanity-benchmark test fsx hangs Closed
Related
is related to LU-4950 sanity-benchmark test fsx hung: txg_s... Closed
is related to LU-5575 Failure on test suite replay-ost-sing... Closed
is related to LU-10009 sanity-benchmark test_iozone: test fa... Open
is related to LU-12258 sanity test_101d timeout when doing r... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-benchmark test_iozone hangs in txg_sync for ZFS/DNE testing.

Looking at a recent failure, logs at https://testing.whamcloud.com/test_sets/04682612-66aa-11e9-a6f2-52540065bddc , the last thing seen in the suite_log is

running as uid/gid/euid/egid 500/500/500/500, groups:
 [iozone] [-i] [0] [-i] [1] [-i] [2] [-e] [-+d] [-r] [512] [-s] [1719368] [-t] [2] [-F] [/mnt/lustre/d0.iozone/iozone.1] [/mnt/lustre/d0.iozone/iozone.2]
	Iozone: Performance Test of File I/O
	        Version $Revision: 3.373 $
		Compiled for 64 bit mode.
		Build: linux-AMD64 

	Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
	             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
	             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
	             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
	             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
	             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
	             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.

	Run began: Tue Apr 23 10:44:20 2019

	Include fsync in write timing
	>>> I/O Diagnostic mode enabled. <<<
	Performance measurements are invalid in this mode.
	Record Size 512 KB
	File size set to 1719368 KB
	Command line used: iozone -i 0 -i 1 -i 2 -e -+d -r 512 -s 1719368 -t 2 -F /mnt/lustre/d0.iozone/iozone.1 /mnt/lustre/d0.iozone/iozone.2
	Output is in Kbytes/sec
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 Kbytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
	Throughput test with 2 processes
	Each process writes a 1719368 Kbyte file in 512 Kbyte records

	Children see throughput for  2 initial writers 	=    5465.87 KB/sec
	Parent sees throughput for  2 initial writers 	=    5007.29 KB/sec
	Min throughput per process 			=    2523.98 KB/sec 
	Max throughput per process 			=    2941.89 KB/sec
	Avg throughput per process 			=    2732.94 KB/sec
	Min xfer 					= 1475072.00 KB

	Children see throughput for  2 rewriters 	=    5791.13 KB/sec
	Parent sees throughput for  2 rewriters 	=    5787.53 KB/sec
	Min throughput per process 			=    2895.57 KB/sec 
	Max throughput per process 			=    2895.57 KB/sec
	Avg throughput per process 			=    2895.57 KB/sec
	Min xfer 					= 1719296.00 KB

The OSS console has the following call trace

[38923.077282] Lustre: lustre-OST0001: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp)
[38923.077284] Lustre: lustre-OST0000: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp)
[38931.596843] Lustre: lustre-OST0002: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp)
[39020.749857] LNet: Service thread pid 7985 was inactive for 64.24s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[39020.752951] Pid: 7985, comm: ll_ost_io00_048 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Mon Apr 22 22:25:47 UTC 2019
[39020.754687] Call Trace:
[39020.755302]  [<ffffffffc03c72d5>] cv_wait_common+0x125/0x150 [spl]
[39020.756438]  [<ffffffffc03c7315>] __cv_wait+0x15/0x20 [spl]
[39020.757446]  [<ffffffffc053e4d3>] txg_wait_open+0xc3/0x110 [zfs]
[39020.758847]  [<ffffffffc04f3dca>] dmu_tx_wait+0x3aa/0x3c0 [zfs]
[39020.759930]  [<ffffffffc04f3e72>] dmu_tx_assign+0x92/0x490 [zfs]
[39020.761024]  [<ffffffffc1184009>] osd_trans_start+0x199/0x440 [osd_zfs]
[39020.762239]  [<ffffffffc12c1c85>] ofd_trans_start+0x75/0xf0 [ofd]
[39020.763368]  [<ffffffffc12c8881>] ofd_commitrw_write+0xa31/0x1d40 [ofd]
[39020.764542]  [<ffffffffc12ccc6c>] ofd_commitrw+0x48c/0x9e0 [ofd]
[39020.765635]  [<ffffffffc0fb747c>] tgt_brw_write+0x10cc/0x1cf0 [ptlrpc]
[39020.767164]  [<ffffffffc0fb31da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[39020.768414]  [<ffffffffc0f5880b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[39020.769802]  [<ffffffffc0f5c13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[39020.771023]  [<ffffffff97cc1c71>] kthread+0xd1/0xe0
[39020.772060]  [<ffffffff98375c37>] ret_from_fork_nospec_end+0x0/0x39
[39020.773194]  [<ffffffffffffffff>] 0xffffffffffffffff
[39020.774176] LustreError: dumping log to /tmp/lustre-log.1556016359.7985
[39025.011155] LNet: Service thread pid 7985 completed after 68.51s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
[39501.389765] LNet: Service thread pid 16546 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[39501.393077] Pid: 16546, comm: ll_ost00_009 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Mon Apr 22 22:25:47 UTC 2019
[39501.394929] Call Trace:
[39501.395582]  [<ffffffffc03c72d5>] cv_wait_common+0x125/0x150 [spl]
[39501.396880]  [<ffffffffc03c7315>] __cv_wait+0x15/0x20 [spl]
[39501.397884]  [<ffffffffc053e2bf>] txg_wait_synced+0xef/0x140 [zfs]
[39501.399302]  [<ffffffffc118b69e>] osd_object_sync+0x16e/0x180 [osd_zfs]
[39501.400686]  [<ffffffffc0fad8a7>] tgt_sync+0xb7/0x270 [ptlrpc]
[39501.402075]  [<ffffffffc12af731>] ofd_sync_hdl+0x111/0x530 [ofd]
[39501.403303]  [<ffffffffc0fb31da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[39501.404733]  [<ffffffffc0f5880b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[39501.406112]  [<ffffffffc0f5c13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[39501.407350]  [<ffffffff97cc1c71>] kthread+0xd1/0xe0
[39501.408407]  [<ffffffff98375c37>] ret_from_fork_nospec_end+0x0/0x39
[39501.409609]  [<ffffffffffffffff>] 0xffffffffffffffff
[39501.410651] LustreError: dumping log to /tmp/lustre-log.1556016839.16546
…
[40140.279045] in:imjournal    D ffff911dbbed5140     0  3670      1 0x00000080
[40140.280809] Call Trace:
[40140.281623]  [<ffffffff98366d60>] ? bit_wait+0x50/0x50
[40140.282901]  [<ffffffff98368c49>] schedule+0x29/0x70
[40140.283998]  [<ffffffff98366721>] schedule_timeout+0x221/0x2d0
[40140.285254]  [<ffffffff97c6a0fe>] ? kvm_clock_get_cycles+0x1e/0x20
[40140.286833]  [<ffffffff97d01092>] ? ktime_get_ts64+0x52/0xf0
[40140.288092]  [<ffffffff98366d60>] ? bit_wait+0x50/0x50
[40140.289171]  [<ffffffff983682ed>] io_schedule_timeout+0xad/0x130
[40140.290445]  [<ffffffff98368388>] io_schedule+0x18/0x20
[40140.291769]  [<ffffffff98366d71>] bit_wait_io+0x11/0x50
[40140.293061]  [<ffffffff98366897>] __wait_on_bit+0x67/0x90
[40140.294200]  [<ffffffff97db93de>] ? __find_get_pages+0x11e/0x1c0
[40140.295549]  [<ffffffff97db5881>] wait_on_page_bit+0x81/0xa0
[40140.296959]  [<ffffffff97cc2e00>] ? wake_bit_function+0x40/0x40
[40140.298307]  [<ffffffff97dc706b>] truncate_inode_pages_range+0x42b/0x700
[40140.342038]  [<ffffffffc02a5dbc>] ? __ext4_journal_stop+0x3c/0xb0 [ext4]
[40140.343859]  [<ffffffffc0281d58>] ? ext4_rename+0x168/0x890 [ext4]
[40140.345266]  [<ffffffff97e51151>] ? link_path_walk+0x81/0x8b0
[40140.346503]  [<ffffffff97eaf38a>] ? __dquot_initialize+0x3a/0x240
[40140.347746]  [<ffffffff97e6fe5a>] ? __inode_wait_for_writeback+0x7a/0xf0
[40140.349584]  [<ffffffff97dc73af>] truncate_inode_pages_final+0x4f/0x60
[40140.350965]  [<ffffffffc027841f>] ext4_evict_inode+0x10f/0x480 [ext4]
[40140.352310]  [<ffffffff97e5eeb4>] evict+0xb4/0x180
[40140.353476]  [<ffffffff97e5f7bc>] iput+0xfc/0x190
[40140.354709]  [<ffffffff97e5a020>] __dentry_kill+0x120/0x180
[40140.356039]  [<ffffffff97e5a130>] dput+0xb0/0x160
[40140.357170]  [<ffffffff97e53f58>] SYSC_renameat2+0x518/0x5a0
[40140.358364]  [<ffffffff97defa61>] ? __vma_rb_erase+0x121/0x220
[40140.359563]  [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146
[40140.361085]  [<ffffffff98375d15>] ? system_call_after_swapgs+0xa2/0x146
[40140.362608]  [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146
[40140.364012]  [<ffffffff98375d15>] ? system_call_after_swapgs+0xa2/0x146
[40140.365472]  [<ffffffff97e54e5e>] SyS_renameat2+0xe/0x10
[40140.366695]  [<ffffffff97e54e9e>] SyS_rename+0x1e/0x20
[40140.367909]  [<ffffffff98375ddb>] system_call_fastpath+0x22/0x27
[40140.369258]  [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146
…
[40141.471161] txg_sync        D ffff911dbc145140     0 27944      2 0x00000080
[40141.472432] Call Trace:
[40141.472873]  [<ffffffff97cceca4>] ? __wake_up+0x44/0x50
[40141.473770]  [<ffffffff98368c49>] schedule+0x29/0x70
[40141.474622]  [<ffffffff98366721>] schedule_timeout+0x221/0x2d0
[40141.475635]  [<ffffffffc052c57e>] ? spa_taskq_dispatch_ent+0x8e/0xc0 [zfs]
[40141.476797]  [<ffffffff97c6a0fe>] ? kvm_clock_get_cycles+0x1e/0x20
[40141.477853]  [<ffffffff983682ed>] io_schedule_timeout+0xad/0x130
[40141.478872]  [<ffffffff97cc28c6>] ? prepare_to_wait_exclusive+0x56/0x90
[40141.479986]  [<ffffffff98368388>] io_schedule+0x18/0x20
[40141.480884]  [<ffffffffc03c7262>] cv_wait_common+0xb2/0x150 [spl]
[40141.481929]  [<ffffffff97cc2d40>] ? wake_up_atomic_t+0x30/0x30
[40141.482928]  [<ffffffffc03c7338>] __cv_wait_io+0x18/0x20 [spl]
[40141.483963]  [<ffffffffc0596a3b>] zio_wait+0x11b/0x1c0 [zfs]
[40141.484952]  [<ffffffffc050cb50>] dsl_pool_sync+0x3e0/0x440 [zfs]
[40141.486020]  [<ffffffffc052a907>] spa_sync+0x437/0xd90 [zfs]
[40141.487015]  [<ffffffff97cd6802>] ? default_wake_function+0x12/0x20
[40141.488130]  [<ffffffff97cceca4>] ? __wake_up+0x44/0x50
[40141.489066]  [<ffffffffc053f321>] txg_sync_thread+0x301/0x510 [zfs]
[40141.490151]  [<ffffffffc053f020>] ? txg_fini+0x2a0/0x2a0 [zfs]
[40141.491152]  [<ffffffffc03c2063>] thread_generic_wrapper+0x73/0x80 [spl]
[40141.492288]  [<ffffffffc03c1ff0>] ? __thread_exit+0x20/0x20 [spl]
[40141.493328]  [<ffffffff97cc1c71>] kthread+0xd1/0xe0
[40141.494186]  [<ffffffff97cc1ba0>] ? insert_kthread_work+0x40/0x40
[40141.495223]  [<ffffffff98375c37>] ret_from_fork_nospec_begin+0x21/0x21
[40141.496325]  [<ffffffff97cc1ba0>] ? insert_kthread_work+0x40/0x40
…

This hang looks similar to LU-5575, but LU-5575 is closed as a duplicate of LU-4950. Looking at the comment from Alex on 13/Mar/17, he essentially says, if you don’t see ofd_destroy() in the traces, then the hang is probably not LU-4950. Thus, I’ve opened this ticket to capture a current failure.



 Comments   
Comment by Sarah Liu [ 20/May/19 ]

hit similar issue in 2.10.8 testing zfs, none-DNE
https://testing.whamcloud.com/test_sets/206ba5a2-7602-11e9-aeec-52540065bddc

Comment by Jian Yu [ 15/Aug/19 ]

obdfilter-survey test 1c hung in ZFS testing on master branch:

[47508.054761] txg_sync        D ffff8f857a62c1c0     0 18730      2 0x00000080
[47508.055558] Call Trace:
[47508.055822]  [<ffffffffc04e7017>] ? taskq_dispatch_ent+0x57/0x170 [spl]
[47508.056531]  [<ffffffffc06a7900>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs]
[47508.057368]  [<ffffffffaeb7f1c9>] schedule+0x29/0x70
[47508.057871]  [<ffffffffaeb7cb51>] schedule_timeout+0x221/0x2d0
[47508.058507]  [<ffffffffc06a691f>] ? zio_taskq_dispatch+0x8f/0xa0 [zfs]
[47508.059211]  [<ffffffffae46c27e>] ? kvm_clock_get_cycles+0x1e/0x20
[47508.059826]  [<ffffffffaeb7e73d>] io_schedule_timeout+0xad/0x130
[47508.060455]  [<ffffffffae4c5d46>] ? prepare_to_wait_exclusive+0x56/0x90
[47508.061121]  [<ffffffffaeb7e7d8>] io_schedule+0x18/0x20
[47508.061667]  [<ffffffffc04eb262>] cv_wait_common+0xb2/0x150 [spl]
[47508.062306]  [<ffffffffae4c61c0>] ? wake_up_atomic_t+0x30/0x30
[47508.062892]  [<ffffffffc04eb338>] __cv_wait_io+0x18/0x20 [spl]
[47508.063546]  [<ffffffffc06aaa6b>] zio_wait+0x11b/0x1c0 [zfs]
[47508.064163]  [<ffffffffc062085f>] dsl_pool_sync+0xbf/0x440 [zfs]
[47508.064801]  [<ffffffffc063e937>] spa_sync+0x437/0xd90 [zfs]
[47508.065428]  [<ffffffffc0653351>] txg_sync_thread+0x301/0x510 [zfs]
[47508.066090]  [<ffffffffc0653050>] ? txg_fini+0x2a0/0x2a0 [zfs]
[47508.066700]  [<ffffffffc04e6063>] thread_generic_wrapper+0x73/0x80 [spl]
[47508.067394]  [<ffffffffc04e5ff0>] ? __thread_exit+0x20/0x20 [spl]
[47508.068021]  [<ffffffffae4c50d1>] kthread+0xd1/0xe0
[47508.068528]  [<ffffffffae4c5000>] ? insert_kthread_work+0x40/0x40
[47508.069141]  [<ffffffffaeb8bd37>] ret_from_fork_nospec_begin+0x21/0x21
[47508.069805]  [<ffffffffae4c5000>] ? insert_kthread_work+0x40/0x40
[47511.400931] lctl            D ffff8f8576ad0000     0 13993  13986 0x00000080
[47511.401695] Call Trace:
[47511.401957]  [<ffffffffaeb7f1c9>] schedule+0x29/0x70
[47511.402484]  [<ffffffffc04eb2d5>] cv_wait_common+0x125/0x150 [spl]
[47511.403128]  [<ffffffffae4c61c0>] ? wake_up_atomic_t+0x30/0x30
[47511.403724]  [<ffffffffc04eb315>] __cv_wait+0x15/0x20 [spl]
[47511.404343]  [<ffffffffc0652503>] txg_wait_open+0xc3/0x110 [zfs]
[47511.404975]  [<ffffffffc0607dfa>] dmu_tx_wait+0x3aa/0x3c0 [zfs]
[47511.405612]  [<ffffffffc0607ea2>] dmu_tx_assign+0x92/0x490 [zfs]
[47511.406272]  [<ffffffffc1373fd9>] osd_trans_start+0x199/0x440 [osd_zfs]
[47511.406956]  [<ffffffffc1498c35>] ofd_trans_start+0x75/0xf0 [ofd]
[47511.407589]  [<ffffffffc149f821>] ofd_commitrw_write+0xa31/0x1d40 [ofd]
[47511.408283]  [<ffffffffc14a3c2c>] ofd_commitrw+0x48c/0x9e0 [ofd]
[47511.408942]  [<ffffffffc153dfb0>] echo_client_prep_commit.isra.50+0x5b0/0xea0 [obdecho]
[47511.409768]  [<ffffffffc1540994>] echo_client_iocontrol+0x914/0x1c50 [obdecho]
[47511.410546]  [<ffffffffc0e2e4aa>] class_handle_ioctl+0x192a/0x1e30 [obdclass]
[47511.411302]  [<ffffffffae701cbe>] ? security_capable+0x1e/0x20
[47511.411907]  [<ffffffffc0e2ea25>] obd_class_ioctl+0x75/0x170 [obdclass]
[47511.412585]  [<ffffffffae65d9e0>] do_vfs_ioctl+0x3a0/0x5a0
[47511.413167]  [<ffffffffae65dc81>] SyS_ioctl+0xa1/0xc0
[47511.413676]  [<ffffffffaeb8be15>] ? system_call_after_swapgs+0xa2/0x146
[47511.414356]  [<ffffffffaeb8bede>] system_call_fastpath+0x25/0x2a
[47511.414966]  [<ffffffffaeb8be21>] ? system_call_after_swapgs+0xae/0x146

https://testing.whamcloud.com/test_sets/a8573658-bf1c-11e9-98c8-52540065bddc

Comment by Alena Nikitenko [ 30/Nov/21 ]

Hit something similar on 2.12.8: https://testing.whamcloud.com/test_sets/0531f6af-85a8-4730-b9e3-d80acc3b6639

 

Generated at Sat Feb 10 02:50:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.