Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11704

sanity-lfsck test 2a hangs with ‘Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended.’

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 2.10.6
    • None
    • ZFS 0.7.9-1 and RHEL 7.5
    • 3
    • 9223372036854775807

    Description

      sanity-lfsck test_2a hangs for 2.10.6 RC2; logs at https://testing.whamcloud.com/test_sets/4b667296-ee58-11e8-b67f-52540065bddc . The last thing seen in the client (vm5) test_log is mounting an OST

      CMD: trevis-37vm7 zfs get -H -o value 						lustre:svname lustre-ost1/ost1
      Starting ost1:   lustre-ost1/ost1 /mnt/lustre-ost1
      CMD: trevis-37vm7 mkdir -p /mnt/lustre-ost1; mount -t lustre   		                   lustre-ost1/ost1 /mnt/lustre-ost1
      

      In the OSS (vm7) console log, we see many errors

      [19184.117195] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost1; mount -t lustre   		                   lustre-ost1/ost1 /mnt/lustre-ost1
      [19214.033512] LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 10.9.5.208@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [19214.036476] LustreError: Skipped 24 previous similar messages
      [19265.130894] WARNING: MMP writes to pool 'lustre-ost1' have not succeeded in over 20s; suspending pool
      [19265.132534] WARNING: Pool 'lustre-ost1' has encountered an uncorrectable I/O failure and has been suspended.
      
      [19265.193259] WARNING: MMP writes to pool 'lustre-ost7' have not succeeded in over 20s; suspending pool
      [19265.193297] WARNING: MMP writes to pool 'lustre-ost5' have not succeeded in over 20s; suspending pool
      [19265.193298] WARNING: Pool 'lustre-ost5' has encountered an uncorrectable I/O failure and has been suspended.
      
      [19265.193320] WARNING: MMP writes to pool 'lustre-ost6' have not succeeded in over 20s; suspending pool
      [19265.193320] WARNING: Pool 'lustre-ost6' has encountered an uncorrectable I/O failure and has been suspended.
      
      [19265.201515] WARNING: Pool 'lustre-ost7' has encountered an uncorrectable I/O failure and has been suspended.
      

      In the MDS (vm8) console log, we see errors, call traces for blocked processes and buffer I/O errors

      [19181.922927] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
      [19203.638907] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 20s; suspending pool
      [19203.640572] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended.
      
      [19273.862749] LustreError: 11-0: lustre-OST0003-osc-MDT0000: operation ost_connect to node 10.9.5.207@tcp failed: rc = -19
      [19320.390856] INFO: task jbd2/vda1-8:264 blocked for more than 120 seconds.
      [19320.392208] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [19320.393530] jbd2/vda1-8     D ffff9ed240013f40     0   264      2 0x00000000
      [19320.394880] Call Trace:
      [19320.395379]  [<ffffffffb3329774>] ? blk_mq_run_hw_queue+0x14/0x20
      [19320.396412]  [<ffffffffb3330914>] ? blk_mq_sched_insert_requests+0x64/0x80
      [19320.397586]  [<ffffffffb3716ec0>] ? bit_wait+0x50/0x50
      [19320.398511]  [<ffffffffb3718f39>] schedule+0x29/0x70
      [19320.399374]  [<ffffffffb37168a9>] schedule_timeout+0x239/0x2c0
      [19320.400389]  [<ffffffffb306a14e>] ? kvm_clock_get_cycles+0x1e/0x20
      [19320.401483]  [<ffffffffb30fa982>] ? ktime_get_ts64+0x52/0xf0
      [19320.402442]  [<ffffffffb306a14e>] ? kvm_clock_get_cycles+0x1e/0x20
      [19320.403477]  [<ffffffffb30fa982>] ? ktime_get_ts64+0x52/0xf0
      [19320.404441]  [<ffffffffb3716ec0>] ? bit_wait+0x50/0x50
      [19320.405317]  [<ffffffffb371844d>] io_schedule_timeout+0xad/0x130
      [19320.406335]  [<ffffffffb37184e8>] io_schedule+0x18/0x20
      [19320.407240]  [<ffffffffb3716ed1>] bit_wait_io+0x11/0x50
      [19320.408143]  [<ffffffffb37169f7>] __wait_on_bit+0x67/0x90
      [19320.409072]  [<ffffffffb31962a1>] wait_on_page_bit+0x81/0xa0
      [19320.410054]  [<ffffffffb30befd0>] ? wake_bit_function+0x40/0x40
      [19320.411063]  [<ffffffffb31963d1>] __filemap_fdatawait_range+0x111/0x190
      [19320.412194]  [<ffffffffb3196464>] filemap_fdatawait_range+0x14/0x30
      [19320.413258]  [<ffffffffb31964a7>] filemap_fdatawait+0x27/0x30
      [19320.414288]  [<ffffffffc0375af1>] jbd2_journal_commit_transaction+0xa81/0x19b0 [jbd2]
      [19320.415623]  [<ffffffffb302a59e>] ? __switch_to+0xce/0x580
      [19320.416569]  [<ffffffffb30cc7f0>] ? finish_task_switch+0x50/0x170
      [19320.417623]  [<ffffffffc037bab9>] kjournald2+0xc9/0x260 [jbd2]
      [19320.418672]  [<ffffffffb30bef10>] ? wake_up_atomic_t+0x30/0x30
      [19320.419683]  [<ffffffffc037b9f0>] ? commit_timeout+0x10/0x10 [jbd2]
      [19320.420756]  [<ffffffffb30bdf21>] kthread+0xd1/0xe0
      [19320.421606]  [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40
      [19320.422658]  [<ffffffffb37255f7>] ret_from_fork_nospec_begin+0x21/0x21
      [19320.423767]  [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40
      [19320.424806] INFO: task kworker/u4:2:22918 blocked for more than 120 seconds.
      [19320.425988] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [19320.427306] kworker/u4:2    D ffff9ed274490fd0     0 22918      2 0x00000080
      [19320.428596] Workqueue: writeback bdi_writeback_workfn (flush-253:0)
      [19320.429724] Call Trace:
      [19320.430179]  [<ffffffffb3716ec0>] ? bit_wait+0x50/0x50
      [19320.431054]  [<ffffffffb3718f39>] schedule+0x29/0x70
      [19320.431914]  [<ffffffffb37168a9>] schedule_timeout+0x239/0x2c0
      [19320.432929]  [<ffffffffb31fbd1e>] ? __kmalloc+0x2e/0x230
      [19320.433853]  [<ffffffffc02b60a4>] ? virtqueue_add+0x1c4/0x4d0 [virtio_ring]
      [19320.435081]  [<ffffffffb306a14e>] ? kvm_clock_get_cycles+0x1e/0x20
      [19320.436139]  [<ffffffffb30fa982>] ? ktime_get_ts64+0x52/0xf0
      [19320.437099]  [<ffffffffb3716ec0>] ? bit_wait+0x50/0x50
      [19320.437983]  [<ffffffffb371844d>] io_schedule_timeout+0xad/0x130
      [19320.439000]  [<ffffffffb37184e8>] io_schedule+0x18/0x20
      [19320.439905]  [<ffffffffb3716ed1>] bit_wait_io+0x11/0x50
      [19320.440788]  [<ffffffffb37169f7>] __wait_on_bit+0x67/0x90
      [19320.441703]  [<ffffffffb3716ec0>] ? bit_wait+0x50/0x50
      [19320.442586]  [<ffffffffb3716b61>] out_of_line_wait_on_bit+0x81/0xb0
      [19320.443646]  [<ffffffffb30befd0>] ? wake_bit_function+0x40/0x40
      [19320.444662]  [<ffffffffc0373ce5>] do_get_write_access+0x285/0x4d0 [jbd2]
      [19320.445800]  [<ffffffffb325649d>] ? __getblk+0x2d/0x300
      [19320.446745]  [<ffffffffc0373f57>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
      [19320.448075]  [<ffffffffc03c7beb>] __ext4_journal_get_write_access+0x3b/0x80 [ext4]
      [19320.449355]  [<ffffffffc0398f30>] ext4_reserve_inode_write+0x70/0xa0 [ext4]
      [19320.450534]  [<ffffffffc039c850>] ? ext4_dirty_inode+0x40/0x60 [ext4]
      [19320.451628]  [<ffffffffc0398fb3>] ext4_mark_inode_dirty+0x53/0x220 [ext4]
      [19320.452779]  [<ffffffffc039c850>] ext4_dirty_inode+0x40/0x60 [ext4]
      [19320.453841]  [<ffffffffb324d55d>] __mark_inode_dirty+0x16d/0x270
      [19320.454868]  [<ffffffffc039588d>] ext4_da_update_reserve_space+0x13d/0x190 [ext4]
      [19320.456142]  [<ffffffffc03c5994>] ext4_ext_map_blocks+0xaa4/0xf60 [ext4]
      [19320.457277]  [<ffffffffc0395a75>] ext4_map_blocks+0x155/0x6e0 [ext4]
      [19320.458369]  [<ffffffffc03995ac>] ? ext4_writepages+0x42c/0xd40 [ext4]
      [19320.459476]  [<ffffffffc039984a>] ext4_writepages+0x6ca/0xd40 [ext4]
      [19320.460558]  [<ffffffffb31a3b81>] do_writepages+0x21/0x50
      [19320.461476]  [<ffffffffb324cfd0>] __writeback_single_inode+0x40/0x260
      [19320.462559]  [<ffffffffb324da64>] writeback_sb_inodes+0x1c4/0x490
      [19320.463584]  [<ffffffffb324ddcf>] __writeback_inodes_wb+0x9f/0xd0
      [19320.464611]  [<ffffffffb324e603>] wb_writeback+0x263/0x2f0
      [19320.465541]  [<ffffffffb323ab0c>] ? get_nr_inodes+0x4c/0x70
      [19320.466486]  [<ffffffffb324ef8b>] bdi_writeback_workfn+0x2cb/0x460
      [19320.467529]  [<ffffffffb30b613f>] process_one_work+0x17f/0x440
      [19320.468516]  [<ffffffffb30b71d6>] worker_thread+0x126/0x3c0
      [19320.469458]  [<ffffffffb30b70b0>] ? manage_workers.isra.24+0x2a0/0x2a0
      [19320.470550]  [<ffffffffb30bdf21>] kthread+0xd1/0xe0
      [19320.471384]  [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40
      [19320.472408]  [<ffffffffb37255f7>] ret_from_fork_nospec_begin+0x21/0x21
      [19320.473497]  [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40
      [19320.474524] INFO: task txg_sync:7615 blocked for more than 120 seconds.
      [19320.475631] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [19320.476931] txg_sync        D ffff9ed29c024f10     0  7615      2 0x00000080
      [19320.478183] Call Trace:
      [19320.478645]  [<ffffffffc0478fc7>] ? taskq_dispatch_ent+0x57/0x170 [spl]
      [19320.479755]  [<ffffffffb3718f39>] schedule+0x29/0x70
      [19320.480598]  [<ffffffffb37168a9>] schedule_timeout+0x239/0x2c0
      [19320.481836]  [<ffffffffc0694e7f>] ? zio_taskq_dispatch+0x8f/0xa0 [zfs]
      [19320.482943]  [<ffffffffb306a14e>] ? kvm_clock_get_cycles+0x1e/0x20
      [19320.483985]  [<ffffffffb371844d>] io_schedule_timeout+0xad/0x130
      [19320.485018]  [<ffffffffb30beac6>] ? prepare_to_wait_exclusive+0x56/0x90
      [19320.486137]  [<ffffffffb37184e8>] io_schedule+0x18/0x20
      [19320.487035]  [<ffffffffc047d192>] cv_wait_common+0xb2/0x150 [spl]
      [19320.488073]  [<ffffffffb30bef10>] ? wake_up_atomic_t+0x30/0x30
      [19320.489066]  [<ffffffffc047d268>] __cv_wait_io+0x18/0x20 [spl]
      [19320.490126]  [<ffffffffc0699023>] zio_wait+0x113/0x1c0 [zfs]
      [19320.491201]  [<ffffffffc060f40f>] dsl_pool_sync+0xbf/0x440 [zfs]
      [19320.492246]  [<ffffffffc062d4b7>] spa_sync+0x437/0xd90 [zfs]
      [19320.493211]  [<ffffffffb30d2022>] ? default_wake_function+0x12/0x20
      [19320.494259]  [<ffffffffb30caba4>] ? __wake_up+0x44/0x50
      [19320.495176]  [<ffffffffc0641c41>] txg_sync_thread+0x301/0x510 [zfs]
      [19320.496257]  [<ffffffffc0641940>] ? txg_fini+0x2a0/0x2a0 [zfs]
      [19320.497245]  [<ffffffffc0478013>] thread_generic_wrapper+0x73/0x80 [spl]
      [19320.498374]  [<ffffffffc0477fa0>] ? __thread_exit+0x20/0x20 [spl]
      [19320.499401]  [<ffffffffb30bdf21>] kthread+0xd1/0xe0
      [19320.500253]  [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40
      [19320.501286]  [<ffffffffb37255f7>] ret_from_fork_nospec_begin+0x21/0x21
      [19320.502385]  [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40
      [19320.503415] INFO: task ll_mgs_0001:7835 blocked for more than 120 seconds.
      [19320.504564] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [19320.505860] ll_mgs_0001     D ffff9ed294508000     0  7835      2 0x00000080
      [19320.507100] Call Trace:
      [19320.507537]  [<ffffffffb3718f39>] schedule+0x29/0x70
      [19320.508381]  [<ffffffffc047d205>] cv_wait_common+0x125/0x150 [spl]
      [19320.509429]  [<ffffffffb30bef10>] ? wake_up_atomic_t+0x30/0x30
      [19320.510423]  [<ffffffffc047d245>] __cv_wait+0x15/0x20 [spl]
      [19320.511401]  [<ffffffffc0640bdf>] txg_wait_synced+0xef/0x140 [zfs]
      [19320.512482]  [<ffffffffc124834e>] osd_trans_stop+0x4be/0x5a0 [osd_zfs]
      [19320.513667]  [<ffffffffc12d3cdb>] mgs_ir_update+0x2eb/0xb70 [mgs]
      [19320.514701]  [<ffffffffc12b3d57>] mgs_target_reg+0x787/0x1370 [mgs]
      [19320.516108]  [<ffffffffc0ffd24f>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc]
      [19320.517377]  [<ffffffffc0ffd3d1>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
      [19320.518624]  [<ffffffffc106038a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [19320.519804]  [<ffffffffc1008e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [19320.521082]  [<ffffffffb30c7c4b>] ? __wake_up_common+0x5b/0x90
      [19320.522095]  [<ffffffffc100c592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [19320.523182]  [<ffffffffc100bb00>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
      [19320.524423]  [<ffffffffb30bdf21>] kthread+0xd1/0xe0
      [19320.525254]  [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40
      [19320.526276]  [<ffffffffb37255f7>] ret_from_fork_nospec_begin+0x21/0x21
      [19320.527373]  [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40
      [19320.528395] INFO: task ll_mgs_0002:7836 blocked for more than 120 seconds.
      [19320.529543] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [19320.530836] ll_mgs_0002     D ffff9ed276b3eeb0     0  7836      2 0x00000080
      [19320.532123] Call Trace:
      [19320.532554]  [<ffffffffb3719e59>] schedule_preempt_disabled+0x29/0x70
      [19320.533622]  [<ffffffffb3717c17>] __mutex_lock_slowpath+0xc7/0x1d0
      [19320.534664]  [<ffffffffb3716fff>] mutex_lock+0x1f/0x2f
      [19320.535545]  [<ffffffffc12d4af9>] mgs_get_ir_logs+0x599/0x10c0 [mgs]
      [19320.536647]  [<ffffffffc1035196>] ? null_alloc_rs+0x186/0x340 [ptlrpc]
      [19320.537770]  [<ffffffffc0ffea20>] ? lustre_swab_mgs_target_info+0x70/0x70 [ptlrpc]
      [19320.539030]  [<ffffffffc12af810>] mgs_config_read+0xc0/0x1c0 [mgs]
      [19320.540130]  [<ffffffffc106038a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [19320.541310]  [<ffffffffc1008e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [19320.542589]  [<ffffffffb30c7c4b>] ? __wake_up_common+0x5b/0x90
      [19320.543606]  [<ffffffffc100c592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [19320.544689]  [<ffffffffc100bb00>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
      [19320.545929]  [<ffffffffb30bdf21>] kthread+0xd1/0xe0
      [19320.546754]  [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40
      [19320.547771]  [<ffffffffb37255f7>] ret_from_fork_nospec_begin+0x21/0x21
      [19320.548863]  [<ffffffffb30bde50>] ? insert_kthread_work+0x40/0x40
      [19320.571934] blk_update_request: I/O error, dev vda, sector 1586528
      [19320.573557] EXT4-fs warning (device vda1): ext4_end_bio:316: I/O error -5 writing to inode 394026 (offset 1753088 size 12288 starting block 198319)
      [19320.575691] Buffer I/O error on device vda1, logical block 198060
      [19320.576701] Buffer I/O error on device vda1, logical block 198061
      [19320.577708] Buffer I/O error on device vda1, logical block 198062
      [19320.578726] blk_update_request: I/O error, dev vda, sector 1658040
      [19320.578734] blk_update_request: I/O error, dev vda, sector 49363312
      [19320.580787] EXT4-fs warning (device vda1): ext4_end_bio:316: I/O error -5 writing to inode 394025 (offset 9007104 size 73728 starting block 207273)
      [19320.582923] Buffer I/O error on device vda1, logical block 206999
      [19320.583959] Buffer I/O error on device vda1, logical block 207000
      [19320.584973] Buffer I/O error on device vda1, logical block 207001
      [19320.585982] Buffer I/O error on device vda1, logical block 207002
      [19320.586984] Buffer I/O error on device vda1, logical block 207003
      [19320.588012] Buffer I/O error on device vda1, logical block 207004
      [19320.589028] Buffer I/O error on device vda1, logical block 207005
      [19320.590074] blk_update_request: I/O error, dev vda, sector 1586520
      

      For this run, we are using ZFS 0.7.9-1 and RHEL 7.5.

      So far, this has only been seen once. This looks like LU-9845. We haven’t seen LU-9845 in a while so I’ve opened a new ticket to track this issue.

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: