Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12234

sanity-benchmark test iozone hangs in txg_sync

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.13.0, Lustre 2.12.1, Lustre 2.10.8, Lustre 2.12.4, Lustre 2.12.6, Lustre 2.12.8
    • 3
    • 9223372036854775807

    Description

      sanity-benchmark test_iozone hangs in txg_sync for ZFS/DNE testing.

      Looking at a recent failure, logs at https://testing.whamcloud.com/test_sets/04682612-66aa-11e9-a6f2-52540065bddc , the last thing seen in the suite_log is

      running as uid/gid/euid/egid 500/500/500/500, groups:
       [iozone] [-i] [0] [-i] [1] [-i] [2] [-e] [-+d] [-r] [512] [-s] [1719368] [-t] [2] [-F] [/mnt/lustre/d0.iozone/iozone.1] [/mnt/lustre/d0.iozone/iozone.2]
      	Iozone: Performance Test of File I/O
      	        Version $Revision: 3.373 $
      		Compiled for 64 bit mode.
      		Build: linux-AMD64 
      
      	Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
      	             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
      	             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
      	             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
      	             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
      	             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
      	             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer.
      
      	Run began: Tue Apr 23 10:44:20 2019
      
      	Include fsync in write timing
      	>>> I/O Diagnostic mode enabled. <<<
      	Performance measurements are invalid in this mode.
      	Record Size 512 KB
      	File size set to 1719368 KB
      	Command line used: iozone -i 0 -i 1 -i 2 -e -+d -r 512 -s 1719368 -t 2 -F /mnt/lustre/d0.iozone/iozone.1 /mnt/lustre/d0.iozone/iozone.2
      	Output is in Kbytes/sec
      	Time Resolution = 0.000001 seconds.
      	Processor cache size set to 1024 Kbytes.
      	Processor cache line size set to 32 bytes.
      	File stride size set to 17 * record size.
      	Throughput test with 2 processes
      	Each process writes a 1719368 Kbyte file in 512 Kbyte records
      
      	Children see throughput for  2 initial writers 	=    5465.87 KB/sec
      	Parent sees throughput for  2 initial writers 	=    5007.29 KB/sec
      	Min throughput per process 			=    2523.98 KB/sec 
      	Max throughput per process 			=    2941.89 KB/sec
      	Avg throughput per process 			=    2732.94 KB/sec
      	Min xfer 					= 1475072.00 KB
      
      	Children see throughput for  2 rewriters 	=    5791.13 KB/sec
      	Parent sees throughput for  2 rewriters 	=    5787.53 KB/sec
      	Min throughput per process 			=    2895.57 KB/sec 
      	Max throughput per process 			=    2895.57 KB/sec
      	Avg throughput per process 			=    2895.57 KB/sec
      	Min xfer 					= 1719296.00 KB
      

      The OSS console has the following call trace

      [38923.077282] Lustre: lustre-OST0001: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp)
      [38923.077284] Lustre: lustre-OST0000: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp)
      [38931.596843] Lustre: lustre-OST0002: Connection restored to 2011b398-4625-e110-f053-c52f3747dc69 (at 10.9.5.215@tcp)
      [39020.749857] LNet: Service thread pid 7985 was inactive for 64.24s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [39020.752951] Pid: 7985, comm: ll_ost_io00_048 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Mon Apr 22 22:25:47 UTC 2019
      [39020.754687] Call Trace:
      [39020.755302]  [<ffffffffc03c72d5>] cv_wait_common+0x125/0x150 [spl]
      [39020.756438]  [<ffffffffc03c7315>] __cv_wait+0x15/0x20 [spl]
      [39020.757446]  [<ffffffffc053e4d3>] txg_wait_open+0xc3/0x110 [zfs]
      [39020.758847]  [<ffffffffc04f3dca>] dmu_tx_wait+0x3aa/0x3c0 [zfs]
      [39020.759930]  [<ffffffffc04f3e72>] dmu_tx_assign+0x92/0x490 [zfs]
      [39020.761024]  [<ffffffffc1184009>] osd_trans_start+0x199/0x440 [osd_zfs]
      [39020.762239]  [<ffffffffc12c1c85>] ofd_trans_start+0x75/0xf0 [ofd]
      [39020.763368]  [<ffffffffc12c8881>] ofd_commitrw_write+0xa31/0x1d40 [ofd]
      [39020.764542]  [<ffffffffc12ccc6c>] ofd_commitrw+0x48c/0x9e0 [ofd]
      [39020.765635]  [<ffffffffc0fb747c>] tgt_brw_write+0x10cc/0x1cf0 [ptlrpc]
      [39020.767164]  [<ffffffffc0fb31da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [39020.768414]  [<ffffffffc0f5880b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [39020.769802]  [<ffffffffc0f5c13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [39020.771023]  [<ffffffff97cc1c71>] kthread+0xd1/0xe0
      [39020.772060]  [<ffffffff98375c37>] ret_from_fork_nospec_end+0x0/0x39
      [39020.773194]  [<ffffffffffffffff>] 0xffffffffffffffff
      [39020.774176] LustreError: dumping log to /tmp/lustre-log.1556016359.7985
      [39025.011155] LNet: Service thread pid 7985 completed after 68.51s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [39501.389765] LNet: Service thread pid 16546 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [39501.393077] Pid: 16546, comm: ll_ost00_009 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Mon Apr 22 22:25:47 UTC 2019
      [39501.394929] Call Trace:
      [39501.395582]  [<ffffffffc03c72d5>] cv_wait_common+0x125/0x150 [spl]
      [39501.396880]  [<ffffffffc03c7315>] __cv_wait+0x15/0x20 [spl]
      [39501.397884]  [<ffffffffc053e2bf>] txg_wait_synced+0xef/0x140 [zfs]
      [39501.399302]  [<ffffffffc118b69e>] osd_object_sync+0x16e/0x180 [osd_zfs]
      [39501.400686]  [<ffffffffc0fad8a7>] tgt_sync+0xb7/0x270 [ptlrpc]
      [39501.402075]  [<ffffffffc12af731>] ofd_sync_hdl+0x111/0x530 [ofd]
      [39501.403303]  [<ffffffffc0fb31da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [39501.404733]  [<ffffffffc0f5880b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [39501.406112]  [<ffffffffc0f5c13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [39501.407350]  [<ffffffff97cc1c71>] kthread+0xd1/0xe0
      [39501.408407]  [<ffffffff98375c37>] ret_from_fork_nospec_end+0x0/0x39
      [39501.409609]  [<ffffffffffffffff>] 0xffffffffffffffff
      [39501.410651] LustreError: dumping log to /tmp/lustre-log.1556016839.16546
      …
      [40140.279045] in:imjournal    D ffff911dbbed5140     0  3670      1 0x00000080
      [40140.280809] Call Trace:
      [40140.281623]  [<ffffffff98366d60>] ? bit_wait+0x50/0x50
      [40140.282901]  [<ffffffff98368c49>] schedule+0x29/0x70
      [40140.283998]  [<ffffffff98366721>] schedule_timeout+0x221/0x2d0
      [40140.285254]  [<ffffffff97c6a0fe>] ? kvm_clock_get_cycles+0x1e/0x20
      [40140.286833]  [<ffffffff97d01092>] ? ktime_get_ts64+0x52/0xf0
      [40140.288092]  [<ffffffff98366d60>] ? bit_wait+0x50/0x50
      [40140.289171]  [<ffffffff983682ed>] io_schedule_timeout+0xad/0x130
      [40140.290445]  [<ffffffff98368388>] io_schedule+0x18/0x20
      [40140.291769]  [<ffffffff98366d71>] bit_wait_io+0x11/0x50
      [40140.293061]  [<ffffffff98366897>] __wait_on_bit+0x67/0x90
      [40140.294200]  [<ffffffff97db93de>] ? __find_get_pages+0x11e/0x1c0
      [40140.295549]  [<ffffffff97db5881>] wait_on_page_bit+0x81/0xa0
      [40140.296959]  [<ffffffff97cc2e00>] ? wake_bit_function+0x40/0x40
      [40140.298307]  [<ffffffff97dc706b>] truncate_inode_pages_range+0x42b/0x700
      [40140.342038]  [<ffffffffc02a5dbc>] ? __ext4_journal_stop+0x3c/0xb0 [ext4]
      [40140.343859]  [<ffffffffc0281d58>] ? ext4_rename+0x168/0x890 [ext4]
      [40140.345266]  [<ffffffff97e51151>] ? link_path_walk+0x81/0x8b0
      [40140.346503]  [<ffffffff97eaf38a>] ? __dquot_initialize+0x3a/0x240
      [40140.347746]  [<ffffffff97e6fe5a>] ? __inode_wait_for_writeback+0x7a/0xf0
      [40140.349584]  [<ffffffff97dc73af>] truncate_inode_pages_final+0x4f/0x60
      [40140.350965]  [<ffffffffc027841f>] ext4_evict_inode+0x10f/0x480 [ext4]
      [40140.352310]  [<ffffffff97e5eeb4>] evict+0xb4/0x180
      [40140.353476]  [<ffffffff97e5f7bc>] iput+0xfc/0x190
      [40140.354709]  [<ffffffff97e5a020>] __dentry_kill+0x120/0x180
      [40140.356039]  [<ffffffff97e5a130>] dput+0xb0/0x160
      [40140.357170]  [<ffffffff97e53f58>] SYSC_renameat2+0x518/0x5a0
      [40140.358364]  [<ffffffff97defa61>] ? __vma_rb_erase+0x121/0x220
      [40140.359563]  [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146
      [40140.361085]  [<ffffffff98375d15>] ? system_call_after_swapgs+0xa2/0x146
      [40140.362608]  [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146
      [40140.364012]  [<ffffffff98375d15>] ? system_call_after_swapgs+0xa2/0x146
      [40140.365472]  [<ffffffff97e54e5e>] SyS_renameat2+0xe/0x10
      [40140.366695]  [<ffffffff97e54e9e>] SyS_rename+0x1e/0x20
      [40140.367909]  [<ffffffff98375ddb>] system_call_fastpath+0x22/0x27
      [40140.369258]  [<ffffffff98375d21>] ? system_call_after_swapgs+0xae/0x146
      …
      [40141.471161] txg_sync        D ffff911dbc145140     0 27944      2 0x00000080
      [40141.472432] Call Trace:
      [40141.472873]  [<ffffffff97cceca4>] ? __wake_up+0x44/0x50
      [40141.473770]  [<ffffffff98368c49>] schedule+0x29/0x70
      [40141.474622]  [<ffffffff98366721>] schedule_timeout+0x221/0x2d0
      [40141.475635]  [<ffffffffc052c57e>] ? spa_taskq_dispatch_ent+0x8e/0xc0 [zfs]
      [40141.476797]  [<ffffffff97c6a0fe>] ? kvm_clock_get_cycles+0x1e/0x20
      [40141.477853]  [<ffffffff983682ed>] io_schedule_timeout+0xad/0x130
      [40141.478872]  [<ffffffff97cc28c6>] ? prepare_to_wait_exclusive+0x56/0x90
      [40141.479986]  [<ffffffff98368388>] io_schedule+0x18/0x20
      [40141.480884]  [<ffffffffc03c7262>] cv_wait_common+0xb2/0x150 [spl]
      [40141.481929]  [<ffffffff97cc2d40>] ? wake_up_atomic_t+0x30/0x30
      [40141.482928]  [<ffffffffc03c7338>] __cv_wait_io+0x18/0x20 [spl]
      [40141.483963]  [<ffffffffc0596a3b>] zio_wait+0x11b/0x1c0 [zfs]
      [40141.484952]  [<ffffffffc050cb50>] dsl_pool_sync+0x3e0/0x440 [zfs]
      [40141.486020]  [<ffffffffc052a907>] spa_sync+0x437/0xd90 [zfs]
      [40141.487015]  [<ffffffff97cd6802>] ? default_wake_function+0x12/0x20
      [40141.488130]  [<ffffffff97cceca4>] ? __wake_up+0x44/0x50
      [40141.489066]  [<ffffffffc053f321>] txg_sync_thread+0x301/0x510 [zfs]
      [40141.490151]  [<ffffffffc053f020>] ? txg_fini+0x2a0/0x2a0 [zfs]
      [40141.491152]  [<ffffffffc03c2063>] thread_generic_wrapper+0x73/0x80 [spl]
      [40141.492288]  [<ffffffffc03c1ff0>] ? __thread_exit+0x20/0x20 [spl]
      [40141.493328]  [<ffffffff97cc1c71>] kthread+0xd1/0xe0
      [40141.494186]  [<ffffffff97cc1ba0>] ? insert_kthread_work+0x40/0x40
      [40141.495223]  [<ffffffff98375c37>] ret_from_fork_nospec_begin+0x21/0x21
      [40141.496325]  [<ffffffff97cc1ba0>] ? insert_kthread_work+0x40/0x40
      …
      
      

      This hang looks similar to LU-5575, but LU-5575 is closed as a duplicate of LU-4950. Looking at the comment from Alex on 13/Mar/17, he essentially says, if you don’t see ofd_destroy() in the traces, then the hang is probably not LU-4950. Thus, I’ve opened this ticket to capture a current failure.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: