Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.10.7
-
ZFS
-
3
-
9223372036854775807
Description
sanity-benchmark test_iozone hangs. So far, we’re only seeing this hang for ZFS.
Looking at the suite_log for the failure at https://testing.whamcloud.com/test_sets/464d518e-43e5-11e9-a256-52540065bddc, we can see iozone hangs in DirectIO testing
== sanity-benchmark test iozone: iozone ============================================================== 15:04:32 (1552143872) min OST has 1922432kB available, using 3438744kB file size debug=0 running as uid/gid/euid/egid 500/500/500/500, groups: [touch] [/mnt/lustre/d0_runas_test/f27932] running as uid/gid/euid/egid 500/500/500/500, groups: [iozone] [-i] [0] [-i] [1] [-i] [2] [-e] [-+d] [-r] [512] [-s] [3438744] [-f] [/mnt/lustre/d0.iozone/iozone] Iozone: Performance Test of File I/O Version $Revision: 3.373 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. Run began: Sat Mar 9 15:04:38 2019 Include fsync in write timing >>> I/O Diagnostic mode enabled. <<< Performance measurements are invalid in this mode. Record Size 512 KB File size set to 3438744 KB Command line used: iozone -i 0 -i 1 -i 2 -e -+d -r 512 -s 3438744 -f /mnt/lustre/d0.iozone/iozone Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 3438744 512 2826 2713 32747 35474 19214 3473 iozone test complete. debug=0x33f0484 directio on /mnt/lustre/f.iozone for 1x4194304 bytes PASS debug=0 running as uid/gid/euid/egid 500/500/500/500, groups: [iozone] [-I] [-i] [0] [-i] [1] [-i] [2] [-e] [-+d] [-r] [512] [-s] [3438744] [-f] [/mnt/lustre/d0.iozone/iozone.odir] Iozone: Performance Test of File I/O Version $Revision: 3.373 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer. Run began: Sat Mar 9 16:08:59 2019 O_DIRECT feature enabled Include fsync in write timing >>> I/O Diagnostic mode enabled. <<< Performance measurements are invalid in this mode. Record Size 512 KB File size set to 3438744 KB Command line used: iozone -I -i 0 -i 1 -i 2 -e -+d -r 512 -s 3438744 -f /mnt/lustre/d0.iozone/iozone.odir Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. random random bkwd record stride KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread 3438744 512
Looking at the console logs, there’s not a lot to go on. On the OSS (vm3) we do see an LNET too many service threads warning before the hang and looking through the call stacks, there are some processes in the D state
[22345.851895] Lustre: DEBUG MARKER: == sanity-benchmark test iozone: iozone ============================================================== 15:04:32 (1552143872) [22349.818428] Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1922432kB available, using 3438744kB file size [22350.076310] Lustre: DEBUG MARKER: min OST has 1922432kB available, using 3438744kB file size [22402.060402] LNet: Service thread pid 32137 completed after 41.38s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [22402.065241] LNet: Skipped 7 previous similar messages [22410.807172] LNet: Service thread pid 1390 completed after 50.25s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [22410.812202] LNet: Skipped 4 previous similar messages [22431.196168] LNet: Service thread pid 10091 completed after 70.64s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [22431.199270] LNet: Skipped 21 previous similar messages [23561.656188] LNet: Service thread pid 19365 completed after 50.37s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [23561.659104] LNet: Skipped 23 previous similar messages [23566.276500] LNet: Service thread pid 1643 completed after 54.99s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [23566.280050] LNet: Skipped 1 previous similar message [24830.125602] LNet: Service thread pid 1385 completed after 44.33s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [24830.129505] LNet: Skipped 4 previous similar messages [24832.949732] LNet: Service thread pid 23024 completed after 47.15s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [24836.845387] LNet: Service thread pid 30496 completed after 51.05s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). … [36836.448221] jbd2/vda1-8 D ffff8c47b92ac100 0 1197 2 0x00000000 [36836.449552] Call Trace: [36836.450017] [<ffffffffa2d65d60>] ? bit_wait+0x50/0x50 [36836.450882] [<ffffffffa2d67c49>] schedule+0x29/0x70 [36836.451724] [<ffffffffa2d65721>] schedule_timeout+0x221/0x2d0 [36836.452739] [<ffffffffa266a0ce>] ? kvm_clock_get_cycles+0x1e/0x20 [36836.453789] [<ffffffffa2d65d60>] ? bit_wait+0x50/0x50 [36836.454659] [<ffffffffa2d672ed>] io_schedule_timeout+0xad/0x130 [36836.455683] [<ffffffffa2d67388>] io_schedule+0x18/0x20 [36836.456574] [<ffffffffa2d65d71>] bit_wait_io+0x11/0x50 [36836.457463] [<ffffffffa2d65897>] __wait_on_bit+0x67/0x90 [36836.458367] [<ffffffffa2d65d60>] ? bit_wait+0x50/0x50 [36836.459228] [<ffffffffa2d65a01>] out_of_line_wait_on_bit+0x81/0xb0 [36836.460283] [<ffffffffa26c2dc0>] ? wake_bit_function+0x40/0x40 [36836.461282] [<ffffffffa2878f1a>] __wait_on_buffer+0x2a/0x30 [36836.462460] [<ffffffffc01e4801>] jbd2_journal_commit_transaction+0x1781/0x19b0 [jbd2] [36836.463787] [<ffffffffa26d0880>] ? finish_task_switch+0x50/0x1c0 [36836.464819] [<ffffffffc01e9e89>] kjournald2+0xc9/0x260 [jbd2] [36836.465805] [<ffffffffa26c2d00>] ? wake_up_atomic_t+0x30/0x30 [36836.466785] [<ffffffffc01e9dc0>] ? commit_timeout+0x10/0x10 [jbd2] [36836.467836] [<ffffffffa26c1c31>] kthread+0xd1/0xe0 [36836.468658] [<ffffffffa26c1b60>] ? insert_kthread_work+0x40/0x40 [36836.469680] [<ffffffffa2d74c37>] ret_from_fork_nospec_begin+0x21/0x21 [36836.470778] [<ffffffffa26c1b60>] ? insert_kthread_work+0x40/0x40 … [36836.877778] auditd D ffff8c47b9094100 0 3085 1 0x00000000 [36836.879028] Call Trace: [36836.879461] [<ffffffffa2d67c49>] schedule+0x29/0x70 [36836.880296] [<ffffffffc01e97c5>] jbd2_log_wait_commit+0xc5/0x140 [jbd2] [36836.881405] [<ffffffffa26c2d00>] ? wake_up_atomic_t+0x30/0x30 [36836.882391] [<ffffffffc01eae52>] jbd2_complete_transaction+0x52/0xa0 [jbd2] [36836.883626] [<ffffffffc0213782>] ext4_sync_file+0x292/0x320 [ext4] [36836.884681] [<ffffffffa2875ef7>] do_fsync+0x67/0xb0 [36836.885533] [<ffffffffa2d74d15>] ? system_call_after_swapgs+0xa2/0x146 [36836.886642] [<ffffffffa28761e0>] SyS_fsync+0x10/0x20 [36836.887505] [<ffffffffa2d74ddb>] system_call_fastpath+0x22/0x27 [36836.888528] [<ffffffffa2d74d21>] ? system_call_after_swapgs+0xae/0x146 … [36837.709648] in:imjournal D ffff8c474007e180 0 3675 1 0x00000080 [36837.711149] Call Trace: [36837.711590] [<ffffffffa2d65d60>] ? bit_wait+0x50/0x50 [36837.712459] [<ffffffffa2d67c49>] schedule+0x29/0x70 [36837.713287] [<ffffffffa2d65721>] schedule_timeout+0x221/0x2d0 [36837.714256] [<ffffffffa266a0ce>] ? kvm_clock_get_cycles+0x1e/0x20 [36837.715294] [<ffffffffa2701052>] ? ktime_get_ts64+0x52/0xf0 [36837.716247] [<ffffffffa2d65d60>] ? bit_wait+0x50/0x50 [36837.717120] [<ffffffffa2d672ed>] io_schedule_timeout+0xad/0x130 [36837.718128] [<ffffffffa2d67388>] io_schedule+0x18/0x20 [36837.719016] [<ffffffffa2d65d71>] bit_wait_io+0x11/0x50 [36837.719907] [<ffffffffa2d65897>] __wait_on_bit+0x67/0x90 [36837.720823] [<ffffffffa27b92be>] ? __find_get_pages+0x11e/0x1c0 [36837.721834] [<ffffffffa27b5761>] wait_on_page_bit+0x81/0xa0 [36837.722787] [<ffffffffa26c2dc0>] ? wake_bit_function+0x40/0x40 [36837.723784] [<ffffffffa27c6f4b>] truncate_inode_pages_range+0x42b/0x700 [36837.724937] [<ffffffffc024bdbc>] ? __ext4_journal_stop+0x3c/0xb0 [ext4] [36837.726062] [<ffffffffc0227d58>] ? ext4_rename+0x168/0x890 [ext4] [36837.727101] [<ffffffffa2850d51>] ? link_path_walk+0x81/0x8b0 [36837.728084] [<ffffffffa28aee8a>] ? __dquot_initialize+0x3a/0x240 [36837.729106] [<ffffffffa286fa5a>] ? __inode_wait_for_writeback+0x7a/0xf0 [36837.730234] [<ffffffffa27c728f>] truncate_inode_pages_final+0x4f/0x60 [36837.731330] [<ffffffffc021e41f>] ext4_evict_inode+0x10f/0x480 [ext4] [36837.732411] [<ffffffffa285eab4>] evict+0xb4/0x180 [36837.733242] [<ffffffffa285f3bc>] iput+0xfc/0x190 [36837.734048] [<ffffffffa2859c20>] __dentry_kill+0x120/0x180 [36837.734987] [<ffffffffa2859d30>] dput+0xb0/0x160 [36837.735793] [<ffffffffa2853b58>] SYSC_renameat2+0x518/0x5a0 [36837.736751] [<ffffffffa27ef841>] ? __vma_rb_erase+0x121/0x220 [36837.737731] [<ffffffffa2d74d21>] ? system_call_after_swapgs+0xae/0x146 [36837.738835] [<ffffffffa2d74d15>] ? system_call_after_swapgs+0xa2/0x146 [36837.739945] [<ffffffffa2d74d21>] ? system_call_after_swapgs+0xae/0x146 [36837.741050] [<ffffffffa2d74d15>] ? system_call_after_swapgs+0xa2/0x146 [36837.742152] [<ffffffffa2854a5e>] SyS_renameat2+0xe/0x10 [36837.743078] [<ffffffffa2854a9e>] SyS_rename+0x1e/0x20 [36837.743951] [<ffffffffa2d74ddb>] system_call_fastpath+0x22/0x27 [36837.744961] [<ffffffffa2d74d21>] ? system_call_after_swapgs+0xae/0x146 … [36842.351359] txg_sync D ffff8c478a3d5140 0 11265 2 0x00000080 [36842.352621] Call Trace: [36842.353049] [<ffffffffa287e565>] ? bio_alloc_bioset+0x115/0x310 [36842.354043] [<ffffffffa2d67c49>] schedule+0x29/0x70 [36842.354892] [<ffffffffa2d65721>] schedule_timeout+0x221/0x2d0 [36842.355886] [<ffffffffa266a0ce>] ? kvm_clock_get_cycles+0x1e/0x20 [36842.356935] [<ffffffffa2701052>] ? ktime_get_ts64+0x52/0xf0 [36842.357893] [<ffffffffa2d672ed>] io_schedule_timeout+0xad/0x130 [36842.358912] [<ffffffffa26c2886>] ? prepare_to_wait_exclusive+0x56/0x90 [36842.360046] [<ffffffffa2d67388>] io_schedule+0x18/0x20 [36842.360935] [<ffffffffc0367242>] cv_wait_common+0xb2/0x150 [spl] [36842.361962] [<ffffffffa26c2d00>] ? wake_up_atomic_t+0x30/0x30 [36842.362944] [<ffffffffc0367318>] __cv_wait_io+0x18/0x20 [spl] [36842.363963] [<ffffffffc053b073>] zio_wait+0x113/0x1c0 [zfs] [36842.364943] [<ffffffffc04efbf1>] vdev_config_sync+0xf1/0x180 [zfs] [36842.366027] [<ffffffffc04cfa9c>] spa_sync+0xa1c/0xd90 [zfs] [36842.366985] [<ffffffffa26d67c2>] ? default_wake_function+0x12/0x20 [36842.368064] [<ffffffffc04e3c71>] txg_sync_thread+0x301/0x510 [zfs] [36842.369139] [<ffffffffc04e3970>] ? txg_fini+0x2a0/0x2a0 [zfs] [36842.370141] [<ffffffffc0362063>] thread_generic_wrapper+0x73/0x80 [spl] [36842.371269] [<ffffffffc0361ff0>] ? __thread_exit+0x20/0x20 [spl] [36842.372345] [<ffffffffa26c1c31>] kthread+0xd1/0xe0 [36842.373182] [<ffffffffa26c1b60>] ? insert_kthread_work+0x40/0x40 [36842.374210] [<ffffffffa2d74c37>] ret_from_fork_nospec_begin+0x21/0x21 [36842.375312] [<ffffffffa26c1b60>] ? insert_kthread_work+0x40/0x40
Could this be LU-8215 ?
Other sanity-benchmark test iozone that fail in this way
https://testing.whamcloud.com/test_sets/8d4b2322-4269-11e9-8e92-52540065bddc
https://testing.whamcloud.com/test_sets/c6afa400-4280-11e9-92fe-52540065bddc
https://testing.whamcloud.com/test_sets/c681c26c-40a8-11e9-8e92-52540065bddc
https://testing.whamcloud.com/test_sets/2e66b0c6-40aa-11e9-b98a-52540065bddc