[LU-15261] sanity: 160j - storage timeout Created: 22/Nov/21  Updated: 23/Nov/21  Resolved: 23/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-10956 sanity-pfl test_3: Kernel panic - not... Open
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for alexey lyashkov <c17817@cray.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/34a5b227-36f5-441f-94d6-31914d7b4004

[14277.488692] WARNING: MMP writes to pool 'lustre-ost5' have not succeeded in over 60019 ms; suspending pool. Hrtime 14277488675560
[14277.490967] Kernel panic - not syncing: Pool 'lustre-ost5' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
[14277.493640] CPU: 1 PID: 519418 Comm: mmp Kdump: loaded Tainted: P OE --------- - - 4.18.0-240.22.1.el8_lustre.x86_64 #1
[14277.495797] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[14277.496854] Call Trace:
[14277.497397] dump_stack+0x5c/0x80
[14277.498052] panic+0xe7/0x2a9
[14277.499014] zio_suspend+0x103/0x110 [zfs]
[14277.499843] mmp_thread+0x61c/0x710 [zfs]
[14277.500651] ? mmp_write_uberblock+0x700/0x700 [zfs]
[14277.501615] ? __thread_exit+0x20/0x20 [spl]
[14277.502438] thread_generic_wrapper+0x6f/0x80 [spl]
[14277.503383] kthread+0x112/0x130
[14277.504000] ? kthread_flush_work_fn+0x10/0x10
[14277.504827] ret_from_fork+0x35/0x40



 Comments   
Comment by Andreas Dilger [ 23/Nov/21 ]

This happens intermittently with ZFS-based systems when the VM is stalled, possibly because other VMs are doing heavy IO to the host. It looks like the tuning to increase the fail retry count is missing on the new test cluster and needs to be applied.

It would be better if ZFS MMP handled this more gracefully, by resuming (and verifying MMP has not been modified) if the IO completes.

Generated at Sat Feb 10 03:16:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.