Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for alexey lyashkov <c17817@cray.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/34a5b227-36f5-441f-94d6-31914d7b4004

      [14277.488692] WARNING: MMP writes to pool 'lustre-ost5' have not succeeded in over 60019 ms; suspending pool. Hrtime 14277488675560
      [14277.490967] Kernel panic - not syncing: Pool 'lustre-ost5' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
      [14277.493640] CPU: 1 PID: 519418 Comm: mmp Kdump: loaded Tainted: P OE --------- - - 4.18.0-240.22.1.el8_lustre.x86_64 #1
      [14277.495797] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [14277.496854] Call Trace:
      [14277.497397] dump_stack+0x5c/0x80
      [14277.498052] panic+0xe7/0x2a9
      [14277.499014] zio_suspend+0x103/0x110 [zfs]
      [14277.499843] mmp_thread+0x61c/0x710 [zfs]
      [14277.500651] ? mmp_write_uberblock+0x700/0x700 [zfs]
      [14277.501615] ? __thread_exit+0x20/0x20 [spl]
      [14277.502438] thread_generic_wrapper+0x6f/0x80 [spl]
      [14277.503383] kthread+0x112/0x130
      [14277.504000] ? kthread_flush_work_fn+0x10/0x10
      [14277.504827] ret_from_fork+0x35/0x40

      Attachments

        Issue Links

          Activity

            [LU-15261] sanity: 160j - storage timeout

            This happens intermittently with ZFS-based systems when the VM is stalled, possibly because other VMs are doing heavy IO to the host. It looks like the tuning to increase the fail retry count is missing on the new test cluster and needs to be applied.

            It would be better if ZFS MMP handled this more gracefully, by resuming (and verifying MMP has not been modified) if the IO completes.

            adilger Andreas Dilger added a comment - This happens intermittently with ZFS-based systems when the VM is stalled, possibly because other VMs are doing heavy IO to the host. It looks like the tuning to increase the fail retry count is missing on the new test cluster and needs to be applied. It would be better if ZFS MMP handled this more gracefully, by resuming (and verifying MMP has not been modified) if the IO completes.

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: