Details
-
Bug
-
Resolution: Not a Bug
-
Minor
-
None
-
Lustre 2.10.6
-
None
-
ZFS 0.7.9-1 and RHEL 7.5
-
3
-
9223372036854775807
Description
sanity-lfsck test_2a hangs for 2.10.6 RC2; logs at https://testing.whamcloud.com/test_sets/4b667296-ee58-11e8-b67f-52540065bddc . The last thing seen in the client (vm5) test_log is mounting an OST
CMD: trevis-37vm7 zfs get -H -o value lustre:svname lustre-ost1/ost1 Starting ost1: lustre-ost1/ost1 /mnt/lustre-ost1 CMD: trevis-37vm7 mkdir -p /mnt/lustre-ost1; mount -t lustre lustre-ost1/ost1 /mnt/lustre-ost1
In the OSS (vm7) console log, we see many errors
[19184.117195] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1; mount -t lustre lustre-ost1/ost1 /mnt/lustre-ost1 [19214.033512] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.208@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [19214.036476] LustreError: Skipped 24 previous similar messages [19265.130894] WARNING: MMP writes to pool 'lustre-ost1' have not succeeded in over 20s; suspending pool [19265.132534] WARNING: Pool 'lustre-ost1' has encountered an uncorrectable I/O failure and has been suspended. [19265.193259] WARNING: MMP writes to pool 'lustre-ost7' have not succeeded in over 20s; suspending pool [19265.193297] WARNING: MMP writes to pool 'lustre-ost5' have not succeeded in over 20s; suspending pool [19265.193298] WARNING: Pool 'lustre-ost5' has encountered an uncorrectable I/O failure and has been suspended. [19265.193320] WARNING: MMP writes to pool 'lustre-ost6' have not succeeded in over 20s; suspending pool [19265.193320] WARNING: Pool 'lustre-ost6' has encountered an uncorrectable I/O failure and has been suspended. [19265.201515] WARNING: Pool 'lustre-ost7' has encountered an uncorrectable I/O failure and has been suspended.
In the MDS (vm8) console log, we see errors, call traces for blocked processes and buffer I/O errors
[19181.922927] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1 [19203.638907] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 20s; suspending pool [19203.640572] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended. [19273.862749] LustreError: 11-0: lustre-OST0003-osc-MDT0000: operation ost_connect to node 10.9.5.207@tcp failed: rc = -19 [19320.390856] INFO: task jbd2/vda1-8:264 blocked for more than 120 seconds. [19320.392208] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19320.393530] jbd2/vda1-8 D ffff9ed240013f40 0 264 2 0x00000000 [19320.394880] Call Trace: [19320.395379] [<ffffffffb3329774>] ? blk_mq_run_hw_queue+0x14/0x20 [19320.396412] [<ffffffffb3330914>] ? blk_mq_sched_insert_requests+0x64/0x80 [19320.397586] [<ffffffffb3716ec0>] ? bit_wait+0x50/0x50 [19320.398511] [<ffffffffb3718f39>] schedule+0x29/0x70 [19320.399374] [<ffffffffb37168a9>] schedule_timeout+0x239/0x2c0 [19320.400389] [<ffffffffb306a14e>] ? kvm_clock_get_cycles+0x1e/0x20 [19320.401483] [<ffffffffb30fa982>] ? ktime_get_ts64+0x52/0xf0 [19320.402442] [<ffffffffb306a14e>] ? kvm_clock_get_cycles+0x1e/0x20 [19320.403477] [<ffffffffb30fa982>] ? ktime_get_ts64+0x52/0xf0 [19320.404441] [<ffffffffb3716ec0>] ? bit_wait+0x50/0x50 [19320.405317] [<ffffffffb371844d>] io_schedule_timeout+0xad/0x130 [19320.406335] [<ffffffffb37184e8>] io_schedule+0x18/0x20 [19320.407240] [<ffffffffb3716ed1>] bit_wait_io+0x11/0x50 [19320.408143] [<ffffffffb37169f7>] __wait_on_bit+0x67/0x90 [19320.409072] [<ffffffffb31962a1>] wait_on_page_bit+0x81/0xa0 [19320.410054] [<ffffffffb30befd0>] ? wake_bit_function+0x40/0x40 [19320.411063] [<ffffffffb31963d1>] __filemap_fdatawait_range+0x111/0x190 [19320.412194] [<ffffffffb3196464>] filemap_fdatawait_range+0x14/0x30 [19320.413258] [<ffffffffb31964a7>] filemap_fdatawait+0x27/0x30 [19320.414288] [<ffffffffc0375af1>] jbd2_journal_commit_transaction+0xa81/0x19b0 [jbd2] [19320.415623] [<ffffffffb302a59e>] ? __switch_to+0xce/0x580 [19320.416569] [<ffffffffb30cc7f0>] ? finish_task_switch+0x50/0x170 [19320.417623] [<ffffffffc037bab9>] kjournald2+0xc9/0x260 [jbd2] [19320.418672] [<ffffffffb30bef10>] ? wake_up_atomic_t+0x30/0x30 [19320.419683] [<ffffffffc037b9f0>] ? commit_timeout+0x10/0x10 [jbd2] [19320.420756] [<ffffffffb30bdf21>] kthread+0xd1/0xe0 [19320.421606] [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40 [19320.422658] [<ffffffffb37255f7>] ret_from_fork_nospec_begin+0x21/0x21 [19320.423767] [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40 [19320.424806] INFO: task kworker/u4:2:22918 blocked for more than 120 seconds. [19320.425988] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19320.427306] kworker/u4:2 D ffff9ed274490fd0 0 22918 2 0x00000080 [19320.428596] Workqueue: writeback bdi_writeback_workfn (flush-253:0) [19320.429724] Call Trace: [19320.430179] [<ffffffffb3716ec0>] ? bit_wait+0x50/0x50 [19320.431054] [<ffffffffb3718f39>] schedule+0x29/0x70 [19320.431914] [<ffffffffb37168a9>] schedule_timeout+0x239/0x2c0 [19320.432929] [<ffffffffb31fbd1e>] ? __kmalloc+0x2e/0x230 [19320.433853] [<ffffffffc02b60a4>] ? virtqueue_add+0x1c4/0x4d0 [virtio_ring] [19320.435081] [<ffffffffb306a14e>] ? kvm_clock_get_cycles+0x1e/0x20 [19320.436139] [<ffffffffb30fa982>] ? ktime_get_ts64+0x52/0xf0 [19320.437099] [<ffffffffb3716ec0>] ? bit_wait+0x50/0x50 [19320.437983] [<ffffffffb371844d>] io_schedule_timeout+0xad/0x130 [19320.439000] [<ffffffffb37184e8>] io_schedule+0x18/0x20 [19320.439905] [<ffffffffb3716ed1>] bit_wait_io+0x11/0x50 [19320.440788] [<ffffffffb37169f7>] __wait_on_bit+0x67/0x90 [19320.441703] [<ffffffffb3716ec0>] ? bit_wait+0x50/0x50 [19320.442586] [<ffffffffb3716b61>] out_of_line_wait_on_bit+0x81/0xb0 [19320.443646] [<ffffffffb30befd0>] ? wake_bit_function+0x40/0x40 [19320.444662] [<ffffffffc0373ce5>] do_get_write_access+0x285/0x4d0 [jbd2] [19320.445800] [<ffffffffb325649d>] ? __getblk+0x2d/0x300 [19320.446745] [<ffffffffc0373f57>] jbd2_journal_get_write_access+0x27/0x40 [jbd2] [19320.448075] [<ffffffffc03c7beb>] __ext4_journal_get_write_access+0x3b/0x80 [ext4] [19320.449355] [<ffffffffc0398f30>] ext4_reserve_inode_write+0x70/0xa0 [ext4] [19320.450534] [<ffffffffc039c850>] ? ext4_dirty_inode+0x40/0x60 [ext4] [19320.451628] [<ffffffffc0398fb3>] ext4_mark_inode_dirty+0x53/0x220 [ext4] [19320.452779] [<ffffffffc039c850>] ext4_dirty_inode+0x40/0x60 [ext4] [19320.453841] [<ffffffffb324d55d>] __mark_inode_dirty+0x16d/0x270 [19320.454868] [<ffffffffc039588d>] ext4_da_update_reserve_space+0x13d/0x190 [ext4] [19320.456142] [<ffffffffc03c5994>] ext4_ext_map_blocks+0xaa4/0xf60 [ext4] [19320.457277] [<ffffffffc0395a75>] ext4_map_blocks+0x155/0x6e0 [ext4] [19320.458369] [<ffffffffc03995ac>] ? ext4_writepages+0x42c/0xd40 [ext4] [19320.459476] [<ffffffffc039984a>] ext4_writepages+0x6ca/0xd40 [ext4] [19320.460558] [<ffffffffb31a3b81>] do_writepages+0x21/0x50 [19320.461476] [<ffffffffb324cfd0>] __writeback_single_inode+0x40/0x260 [19320.462559] [<ffffffffb324da64>] writeback_sb_inodes+0x1c4/0x490 [19320.463584] [<ffffffffb324ddcf>] __writeback_inodes_wb+0x9f/0xd0 [19320.464611] [<ffffffffb324e603>] wb_writeback+0x263/0x2f0 [19320.465541] [<ffffffffb323ab0c>] ? get_nr_inodes+0x4c/0x70 [19320.466486] [<ffffffffb324ef8b>] bdi_writeback_workfn+0x2cb/0x460 [19320.467529] [<ffffffffb30b613f>] process_one_work+0x17f/0x440 [19320.468516] [<ffffffffb30b71d6>] worker_thread+0x126/0x3c0 [19320.469458] [<ffffffffb30b70b0>] ? manage_workers.isra.24+0x2a0/0x2a0 [19320.470550] [<ffffffffb30bdf21>] kthread+0xd1/0xe0 [19320.471384] [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40 [19320.472408] [<ffffffffb37255f7>] ret_from_fork_nospec_begin+0x21/0x21 [19320.473497] [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40 [19320.474524] INFO: task txg_sync:7615 blocked for more than 120 seconds. [19320.475631] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19320.476931] txg_sync D ffff9ed29c024f10 0 7615 2 0x00000080 [19320.478183] Call Trace: [19320.478645] [<ffffffffc0478fc7>] ? taskq_dispatch_ent+0x57/0x170 [spl] [19320.479755] [<ffffffffb3718f39>] schedule+0x29/0x70 [19320.480598] [<ffffffffb37168a9>] schedule_timeout+0x239/0x2c0 [19320.481836] [<ffffffffc0694e7f>] ? zio_taskq_dispatch+0x8f/0xa0 [zfs] [19320.482943] [<ffffffffb306a14e>] ? kvm_clock_get_cycles+0x1e/0x20 [19320.483985] [<ffffffffb371844d>] io_schedule_timeout+0xad/0x130 [19320.485018] [<ffffffffb30beac6>] ? prepare_to_wait_exclusive+0x56/0x90 [19320.486137] [<ffffffffb37184e8>] io_schedule+0x18/0x20 [19320.487035] [<ffffffffc047d192>] cv_wait_common+0xb2/0x150 [spl] [19320.488073] [<ffffffffb30bef10>] ? wake_up_atomic_t+0x30/0x30 [19320.489066] [<ffffffffc047d268>] __cv_wait_io+0x18/0x20 [spl] [19320.490126] [<ffffffffc0699023>] zio_wait+0x113/0x1c0 [zfs] [19320.491201] [<ffffffffc060f40f>] dsl_pool_sync+0xbf/0x440 [zfs] [19320.492246] [<ffffffffc062d4b7>] spa_sync+0x437/0xd90 [zfs] [19320.493211] [<ffffffffb30d2022>] ? default_wake_function+0x12/0x20 [19320.494259] [<ffffffffb30caba4>] ? __wake_up+0x44/0x50 [19320.495176] [<ffffffffc0641c41>] txg_sync_thread+0x301/0x510 [zfs] [19320.496257] [<ffffffffc0641940>] ? txg_fini+0x2a0/0x2a0 [zfs] [19320.497245] [<ffffffffc0478013>] thread_generic_wrapper+0x73/0x80 [spl] [19320.498374] [<ffffffffc0477fa0>] ? __thread_exit+0x20/0x20 [spl] [19320.499401] [<ffffffffb30bdf21>] kthread+0xd1/0xe0 [19320.500253] [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40 [19320.501286] [<ffffffffb37255f7>] ret_from_fork_nospec_begin+0x21/0x21 [19320.502385] [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40 [19320.503415] INFO: task ll_mgs_0001:7835 blocked for more than 120 seconds. [19320.504564] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19320.505860] ll_mgs_0001 D ffff9ed294508000 0 7835 2 0x00000080 [19320.507100] Call Trace: [19320.507537] [<ffffffffb3718f39>] schedule+0x29/0x70 [19320.508381] [<ffffffffc047d205>] cv_wait_common+0x125/0x150 [spl] [19320.509429] [<ffffffffb30bef10>] ? wake_up_atomic_t+0x30/0x30 [19320.510423] [<ffffffffc047d245>] __cv_wait+0x15/0x20 [spl] [19320.511401] [<ffffffffc0640bdf>] txg_wait_synced+0xef/0x140 [zfs] [19320.512482] [<ffffffffc124834e>] osd_trans_stop+0x4be/0x5a0 [osd_zfs] [19320.513667] [<ffffffffc12d3cdb>] mgs_ir_update+0x2eb/0xb70 [mgs] [19320.514701] [<ffffffffc12b3d57>] mgs_target_reg+0x787/0x1370 [mgs] [19320.516108] [<ffffffffc0ffd24f>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc] [19320.517377] [<ffffffffc0ffd3d1>] ? lustre_pack_reply+0x11/0x20 [ptlrpc] [19320.518624] [<ffffffffc106038a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [19320.519804] [<ffffffffc1008e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [19320.521082] [<ffffffffb30c7c4b>] ? __wake_up_common+0x5b/0x90 [19320.522095] [<ffffffffc100c592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [19320.523182] [<ffffffffc100bb00>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] [19320.524423] [<ffffffffb30bdf21>] kthread+0xd1/0xe0 [19320.525254] [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40 [19320.526276] [<ffffffffb37255f7>] ret_from_fork_nospec_begin+0x21/0x21 [19320.527373] [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40 [19320.528395] INFO: task ll_mgs_0002:7836 blocked for more than 120 seconds. [19320.529543] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [19320.530836] ll_mgs_0002 D ffff9ed276b3eeb0 0 7836 2 0x00000080 [19320.532123] Call Trace: [19320.532554] [<ffffffffb3719e59>] schedule_preempt_disabled+0x29/0x70 [19320.533622] [<ffffffffb3717c17>] __mutex_lock_slowpath+0xc7/0x1d0 [19320.534664] [<ffffffffb3716fff>] mutex_lock+0x1f/0x2f [19320.535545] [<ffffffffc12d4af9>] mgs_get_ir_logs+0x599/0x10c0 [mgs] [19320.536647] [<ffffffffc1035196>] ? null_alloc_rs+0x186/0x340 [ptlrpc] [19320.537770] [<ffffffffc0ffea20>] ? lustre_swab_mgs_target_info+0x70/0x70 [ptlrpc] [19320.539030] [<ffffffffc12af810>] mgs_config_read+0xc0/0x1c0 [mgs] [19320.540130] [<ffffffffc106038a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [19320.541310] [<ffffffffc1008e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [19320.542589] [<ffffffffb30c7c4b>] ? __wake_up_common+0x5b/0x90 [19320.543606] [<ffffffffc100c592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [19320.544689] [<ffffffffc100bb00>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] [19320.545929] [<ffffffffb30bdf21>] kthread+0xd1/0xe0 [19320.546754] [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40 [19320.547771] [<ffffffffb37255f7>] ret_from_fork_nospec_begin+0x21/0x21 [19320.548863] [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40 [19320.571934] blk_update_request: I/O error, dev vda, sector 1586528 [19320.573557] EXT4-fs warning (device vda1): ext4_end_bio:316: I/O error -5 writing to inode 394026 (offset 1753088 size 12288 starting block 198319) [19320.575691] Buffer I/O error on device vda1, logical block 198060 [19320.576701] Buffer I/O error on device vda1, logical block 198061 [19320.577708] Buffer I/O error on device vda1, logical block 198062 [19320.578726] blk_update_request: I/O error, dev vda, sector 1658040 [19320.578734] blk_update_request: I/O error, dev vda, sector 49363312 [19320.580787] EXT4-fs warning (device vda1): ext4_end_bio:316: I/O error -5 writing to inode 394025 (offset 9007104 size 73728 starting block 207273) [19320.582923] Buffer I/O error on device vda1, logical block 206999 [19320.583959] Buffer I/O error on device vda1, logical block 207000 [19320.584973] Buffer I/O error on device vda1, logical block 207001 [19320.585982] Buffer I/O error on device vda1, logical block 207002 [19320.586984] Buffer I/O error on device vda1, logical block 207003 [19320.588012] Buffer I/O error on device vda1, logical block 207004 [19320.589028] Buffer I/O error on device vda1, logical block 207005 [19320.590074] blk_update_request: I/O error, dev vda, sector 1586520
For this run, we are using ZFS 0.7.9-1 and RHEL 7.5.
So far, this has only been seen once. This looks like LU-9845. We haven’t seen LU-9845 in a while so I’ve opened a new ticket to track this issue.