[LU-12234] sanity-benchmark test iozone hangs in txg_sync Created: 26/Apr/19 Updated: 03/Dec/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.1, Lustre 2.10.8, Lustre 2.12.4, Lustre 2.12.6, Lustre 2.12.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | dne, zfs | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
sanity-benchmark test_iozone hangs in txg_sync for ZFS/DNE testing. Looking at a recent failure, logs at https://testing.whamcloud.com/test_sets/04682612-66aa-11e9-a6f2-52540065bddc , the last thing seen in the suite_log is running as uid/gid/euid/egid 500/500/500/500, groups: [iozone] [-i] [0] [-i] [1] [-i] [2] [-e] [-+d] [-r] [512] [-s] [1719368] [-t] [2] [-F] [/mnt/lustre/d0.iozone/iozone.1] [/mnt/lustre/d0.iozone/iozone.2] Iozone: Performance Test of File I/O Version $Revision: 3.373 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. Run began: Tue Apr 23 10:44:20 2019 Include fsync in write timing >>> I/O Diagnostic mode enabled. <<< Performance measurements are invalid in this mode. Record Size 512 KB File size set to 1719368 KB Command line used: iozone -i 0 -i 1 -i 2 -e -+d -r 512 -s 1719368 -t 2 -F /mnt/lustre/d0.iozone/iozone.1 /mnt/lustre/d0.iozone/iozone.2 Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 2 processes Each process writes a 1719368 Kbyte file in 512 Kbyte records Children see throughput for 2 initial writers = 5465.87 KB/sec Parent sees throughput for 2 initial writers = 5007.29 KB/sec Min throughput per process = 2523.98 KB/sec Max throughput per process = 2941.89 KB/sec Avg throughput per process = 2732.94 KB/sec Min xfer = 1475072.00 KB Children see throughput for 2 rewriters = 5791.13 KB/sec Parent sees throughput for 2 rewriters = 5787.53 KB/sec Min throughput per process = 2895.57 KB/sec Max throughput per process = 2895.57 KB/sec Avg throughput per process = 2895.57 KB/sec Min xfer = 1719296.00 KB The OSS console has the following call trace [38923.077282] Lustre: lustre-OST0001: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp) [38923.077284] Lustre: lustre-OST0000: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp) [38931.596843] Lustre: lustre-OST0002: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp) [39020.749857] LNet: Service thread pid 7985 was inactive for 64.24s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [39020.752951] Pid: 7985, comm: ll_ost_io00_048 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Mon Apr 22 22:25:47 UTC 2019 [39020.754687] Call Trace: [39020.755302] [<ffffffffc03c72d5>] cv_wait_common+0x125/0x150 [spl] [39020.756438] [<ffffffffc03c7315>] __cv_wait+0x15/0x20 [spl] [39020.757446] [<ffffffffc053e4d3>] txg_wait_open+0xc3/0x110 [zfs] [39020.758847] [<ffffffffc04f3dca>] dmu_tx_wait+0x3aa/0x3c0 [zfs] [39020.759930] [<ffffffffc04f3e72>] dmu_tx_assign+0x92/0x490 [zfs] [39020.761024] [<ffffffffc1184009>] osd_trans_start+0x199/0x440 [osd_zfs] [39020.762239] [<ffffffffc12c1c85>] ofd_trans_start+0x75/0xf0 [ofd] [39020.763368] [<ffffffffc12c8881>] ofd_commitrw_write+0xa31/0x1d40 [ofd] [39020.764542] [<ffffffffc12ccc6c>] ofd_commitrw+0x48c/0x9e0 [ofd] [39020.765635] [<ffffffffc0fb747c>] tgt_brw_write+0x10cc/0x1cf0 [ptlrpc] [39020.767164] [<ffffffffc0fb31da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [39020.768414] [<ffffffffc0f5880b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39020.769802] [<ffffffffc0f5c13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [39020.771023] [<ffffffff97cc1c71>] kthread+0xd1/0xe0 [39020.772060] [<ffffffff98375c37>] ret_from_fork_nospec_end+0x0/0x39 [39020.773194] [<ffffffffffffffff>] 0xffffffffffffffff [39020.774176] LustreError: dumping log to /tmp/lustre-log.1556016359.7985 [39025.011155] LNet: Service thread pid 7985 completed after 68.51s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [39501.389765] LNet: Service thread pid 16546 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [39501.393077] Pid: 16546, comm: ll_ost00_009 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Mon Apr 22 22:25:47 UTC 2019 [39501.394929] Call Trace: [39501.395582] [<ffffffffc03c72d5>] cv_wait_common+0x125/0x150 [spl] [39501.396880] [<ffffffffc03c7315>] __cv_wait+0x15/0x20 [spl] [39501.397884] [<ffffffffc053e2bf>] txg_wait_synced+0xef/0x140 [zfs] [39501.399302] [<ffffffffc118b69e>] osd_object_sync+0x16e/0x180 [osd_zfs] [39501.400686] [<ffffffffc0fad8a7>] tgt_sync+0xb7/0x270 [ptlrpc] [39501.402075] [<ffffffffc12af731>] ofd_sync_hdl+0x111/0x530 [ofd] [39501.403303] [<ffffffffc0fb31da>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [39501.404733] [<ffffffffc0f5880b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [39501.406112] [<ffffffffc0f5c13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [39501.407350] [<ffffffff97cc1c71>] kthread+0xd1/0xe0 [39501.408407] [<ffffffff98375c37>] ret_from_fork_nospec_end+0x0/0x39 [39501.409609] [<ffffffffffffffff>] 0xffffffffffffffff [39501.410651] LustreError: dumping log to /tmp/lustre-log.1556016839.16546 … [40140.279045] in:imjournal D ffff911dbbed5140 0 3670 1 0x00000080 [40140.280809] Call Trace: [40140.281623] [<ffffffff98366d60>] ? bit_wait+0x50/0x50 [40140.282901] [<ffffffff98368c49>] schedule+0x29/0x70 [40140.283998] [<ffffffff98366721>] schedule_timeout+0x221/0x2d0 [40140.285254] [<ffffffff97c6a0fe>] ? kvm_clock_get_cycles+0x1e/0x20 [40140.286833] [<ffffffff97d01092>] ? ktime_get_ts64+0x52/0xf0 [40140.288092] [<ffffffff98366d60>] ? bit_wait+0x50/0x50 [40140.289171] [<ffffffff983682ed>] io_schedule_timeout+0xad/0x130 [40140.290445] [<ffffffff98368388>] io_schedule+0x18/0x20 [40140.291769] [<ffffffff98366d71>] bit_wait_io+0x11/0x50 [40140.293061] [<ffffffff98366897>] __wait_on_bit+0x67/0x90 [40140.294200] [<ffffffff97db93de>] ? __find_get_pages+0x11e/0x1c0 [40140.295549] [<ffffffff97db5881>] wait_on_page_bit+0x81/0xa0 [40140.296959] [<ffffffff97cc2e00>] ? wake_bit_function+0x40/0x40 [40140.298307] [<ffffffff97dc706b>] truncate_inode_pages_range+0x42b/0x700 [40140.342038] [<ffffffffc02a5dbc>] ? __ext4_journal_stop+0x3c/0xb0 [ext4] [40140.343859] [<ffffffffc0281d58>] ? ext4_rename+0x168/0x890 [ext4] [40140.345266] [<ffffffff97e51151>] ? link_path_walk+0x81/0x8b0 [40140.346503] [<ffffffff97eaf38a>] ? __dquot_initialize+0x3a/0x240 [40140.347746] [<ffffffff97e6fe5a>] ? __inode_wait_for_writeback+0x7a/0xf0 [40140.349584] [<ffffffff97dc73af>] truncate_inode_pages_final+0x4f/0x60 [40140.350965] [<ffffffffc027841f>] ext4_evict_inode+0x10f/0x480 [ext4] [40140.352310] [<ffffffff97e5eeb4>] evict+0xb4/0x180 [40140.353476] [<ffffffff97e5f7bc>] iput+0xfc/0x190 [40140.354709] [<ffffffff97e5a020>] __dentry_kill+0x120/0x180 [40140.356039] [<ffffffff97e5a130>] dput+0xb0/0x160 [40140.357170] [<ffffffff97e53f58>] SYSC_renameat2+0x518/0x5a0 [40140.358364] [<ffffffff97defa61>] ? __vma_rb_erase+0x121/0x220 [40140.359563] [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146 [40140.361085] [<ffffffff98375d15>] ? system_call_after_swapgs+0xa2/0x146 [40140.362608] [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146 [40140.364012] [<ffffffff98375d15>] ? system_call_after_swapgs+0xa2/0x146 [40140.365472] [<ffffffff97e54e5e>] SyS_renameat2+0xe/0x10 [40140.366695] [<ffffffff97e54e9e>] SyS_rename+0x1e/0x20 [40140.367909] [<ffffffff98375ddb>] system_call_fastpath+0x22/0x27 [40140.369258] [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146 … [40141.471161] txg_sync D ffff911dbc145140 0 27944 2 0x00000080 [40141.472432] Call Trace: [40141.472873] [<ffffffff97cceca4>] ? __wake_up+0x44/0x50 [40141.473770] [<ffffffff98368c49>] schedule+0x29/0x70 [40141.474622] [<ffffffff98366721>] schedule_timeout+0x221/0x2d0 [40141.475635] [<ffffffffc052c57e>] ? spa_taskq_dispatch_ent+0x8e/0xc0 [zfs] [40141.476797] [<ffffffff97c6a0fe>] ? kvm_clock_get_cycles+0x1e/0x20 [40141.477853] [<ffffffff983682ed>] io_schedule_timeout+0xad/0x130 [40141.478872] [<ffffffff97cc28c6>] ? prepare_to_wait_exclusive+0x56/0x90 [40141.479986] [<ffffffff98368388>] io_schedule+0x18/0x20 [40141.480884] [<ffffffffc03c7262>] cv_wait_common+0xb2/0x150 [spl] [40141.481929] [<ffffffff97cc2d40>] ? wake_up_atomic_t+0x30/0x30 [40141.482928] [<ffffffffc03c7338>] __cv_wait_io+0x18/0x20 [spl] [40141.483963] [<ffffffffc0596a3b>] zio_wait+0x11b/0x1c0 [zfs] [40141.484952] [<ffffffffc050cb50>] dsl_pool_sync+0x3e0/0x440 [zfs] [40141.486020] [<ffffffffc052a907>] spa_sync+0x437/0xd90 [zfs] [40141.487015] [<ffffffff97cd6802>] ? default_wake_function+0x12/0x20 [40141.488130] [<ffffffff97cceca4>] ? __wake_up+0x44/0x50 [40141.489066] [<ffffffffc053f321>] txg_sync_thread+0x301/0x510 [zfs] [40141.490151] [<ffffffffc053f020>] ? txg_fini+0x2a0/0x2a0 [zfs] [40141.491152] [<ffffffffc03c2063>] thread_generic_wrapper+0x73/0x80 [spl] [40141.492288] [<ffffffffc03c1ff0>] ? __thread_exit+0x20/0x20 [spl] [40141.493328] [<ffffffff97cc1c71>] kthread+0xd1/0xe0 [40141.494186] [<ffffffff97cc1ba0>] ? insert_kthread_work+0x40/0x40 [40141.495223] [<ffffffff98375c37>] ret_from_fork_nospec_begin+0x21/0x21 [40141.496325] [<ffffffff97cc1ba0>] ? insert_kthread_work+0x40/0x40 … This hang looks similar to |
| Comments |
| Comment by Sarah Liu [ 20/May/19 ] |
|
hit similar issue in 2.10.8 testing zfs, none-DNE |
| Comment by Jian Yu [ 15/Aug/19 ] |
|
obdfilter-survey test 1c hung in ZFS testing on master branch: [47508.054761] txg_sync D ffff8f857a62c1c0 0 18730 2 0x00000080 [47508.055558] Call Trace: [47508.055822] [<ffffffffc04e7017>] ? taskq_dispatch_ent+0x57/0x170 [spl] [47508.056531] [<ffffffffc06a7900>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs] [47508.057368] [<ffffffffaeb7f1c9>] schedule+0x29/0x70 [47508.057871] [<ffffffffaeb7cb51>] schedule_timeout+0x221/0x2d0 [47508.058507] [<ffffffffc06a691f>] ? zio_taskq_dispatch+0x8f/0xa0 [zfs] [47508.059211] [<ffffffffae46c27e>] ? kvm_clock_get_cycles+0x1e/0x20 [47508.059826] [<ffffffffaeb7e73d>] io_schedule_timeout+0xad/0x130 [47508.060455] [<ffffffffae4c5d46>] ? prepare_to_wait_exclusive+0x56/0x90 [47508.061121] [<ffffffffaeb7e7d8>] io_schedule+0x18/0x20 [47508.061667] [<ffffffffc04eb262>] cv_wait_common+0xb2/0x150 [spl] [47508.062306] [<ffffffffae4c61c0>] ? wake_up_atomic_t+0x30/0x30 [47508.062892] [<ffffffffc04eb338>] __cv_wait_io+0x18/0x20 [spl] [47508.063546] [<ffffffffc06aaa6b>] zio_wait+0x11b/0x1c0 [zfs] [47508.064163] [<ffffffffc062085f>] dsl_pool_sync+0xbf/0x440 [zfs] [47508.064801] [<ffffffffc063e937>] spa_sync+0x437/0xd90 [zfs] [47508.065428] [<ffffffffc0653351>] txg_sync_thread+0x301/0x510 [zfs] [47508.066090] [<ffffffffc0653050>] ? txg_fini+0x2a0/0x2a0 [zfs] [47508.066700] [<ffffffffc04e6063>] thread_generic_wrapper+0x73/0x80 [spl] [47508.067394] [<ffffffffc04e5ff0>] ? __thread_exit+0x20/0x20 [spl] [47508.068021] [<ffffffffae4c50d1>] kthread+0xd1/0xe0 [47508.068528] [<ffffffffae4c5000>] ? insert_kthread_work+0x40/0x40 [47508.069141] [<ffffffffaeb8bd37>] ret_from_fork_nospec_begin+0x21/0x21 [47508.069805] [<ffffffffae4c5000>] ? insert_kthread_work+0x40/0x40 [47511.400931] lctl D ffff8f8576ad0000 0 13993 13986 0x00000080 [47511.401695] Call Trace: [47511.401957] [<ffffffffaeb7f1c9>] schedule+0x29/0x70 [47511.402484] [<ffffffffc04eb2d5>] cv_wait_common+0x125/0x150 [spl] [47511.403128] [<ffffffffae4c61c0>] ? wake_up_atomic_t+0x30/0x30 [47511.403724] [<ffffffffc04eb315>] __cv_wait+0x15/0x20 [spl] [47511.404343] [<ffffffffc0652503>] txg_wait_open+0xc3/0x110 [zfs] [47511.404975] [<ffffffffc0607dfa>] dmu_tx_wait+0x3aa/0x3c0 [zfs] [47511.405612] [<ffffffffc0607ea2>] dmu_tx_assign+0x92/0x490 [zfs] [47511.406272] [<ffffffffc1373fd9>] osd_trans_start+0x199/0x440 [osd_zfs] [47511.406956] [<ffffffffc1498c35>] ofd_trans_start+0x75/0xf0 [ofd] [47511.407589] [<ffffffffc149f821>] ofd_commitrw_write+0xa31/0x1d40 [ofd] [47511.408283] [<ffffffffc14a3c2c>] ofd_commitrw+0x48c/0x9e0 [ofd] [47511.408942] [<ffffffffc153dfb0>] echo_client_prep_commit.isra.50+0x5b0/0xea0 [obdecho] [47511.409768] [<ffffffffc1540994>] echo_client_iocontrol+0x914/0x1c50 [obdecho] [47511.410546] [<ffffffffc0e2e4aa>] class_handle_ioctl+0x192a/0x1e30 [obdclass] [47511.411302] [<ffffffffae701cbe>] ? security_capable+0x1e/0x20 [47511.411907] [<ffffffffc0e2ea25>] obd_class_ioctl+0x75/0x170 [obdclass] [47511.412585] [<ffffffffae65d9e0>] do_vfs_ioctl+0x3a0/0x5a0 [47511.413167] [<ffffffffae65dc81>] SyS_ioctl+0xa1/0xc0 [47511.413676] [<ffffffffaeb8be15>] ? system_call_after_swapgs+0xa2/0x146 [47511.414356] [<ffffffffaeb8bede>] system_call_fastpath+0x25/0x2a [47511.414966] [<ffffffffaeb8be21>] ? system_call_after_swapgs+0xae/0x146 https://testing.whamcloud.com/test_sets/a8573658-bf1c-11e9-98c8-52540065bddc |
| Comment by Alena Nikitenko [ 30/Nov/21 ] |
|
Hit something similar on 2.12.8: https://testing.whamcloud.com/test_sets/0531f6af-85a8-4730-b9e3-d80acc3b6639
|