[LU-10956] sanity-pfl test_3: Kernel panic - not syncing: Pool has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.14.0, Lustre 2.12.5, Lustre 2.12.8, Lustre 2.15.3
Labels:
- zfs

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/252abdaa-477b-11e8-95c0-52540065bddc

test_3 failed with the following error:

Test crashed during sanity-pfl test_3

env: RHEL7 zfs DNE tag-2.11.51

this is the trace found in kernel-crash.log

[34408.762645] Lustre: DEBUG MARKER: dmesg
[34409.519801] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-pfl test 3: Delete component from existing file ============================================ 04:43:50 \(1524545030\)
[34409.734904] Lustre: DEBUG MARKER: == sanity-pfl test 3: Delete component from existing file ============================================ 04:43:50 (1524545030)
[34434.509312] Lustre: lustre-OST0006: Client lustre-MDT0001-mdtlov_UUID (at 10.9.4.25@tcp) reconnecting
[34434.512144] Lustre: lustre-OST0006: Client lustre-MDT0003-mdtlov_UUID (at 10.9.4.25@tcp) reconnecting
[34434.512149] Lustre: Skipped 7 previous similar messages
[34434.516050] WARNING: MMP writes to pool 'lustre-ost2' have not succeeded in over 20s; suspending pool
[34434.516059] Kernel panic - not syncing: Pool 'lustre-ost2' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
[34434.516071] CPU: 0 PID: 16454 Comm: mmp Tainted: P OE ------------ 3.10.0-693.21.1.el7_lustre.x86_64 #1
[34434.516072] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[34434.516077] Call Trace:
[34434.516133] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[34434.516137] [<ffffffff816a8634>] panic+0xe8/0x21f
[34434.516443] [<ffffffffc05734a6>] zio_suspend+0x106/0x110 [zfs]
[34434.516470] [<ffffffffc04fa322>] mmp_thread+0x322/0x4a0 [zfs]
[34434.516491] [<ffffffffc04fa000>] ? mmp_write_done+0x1d0/0x1d0 [zfs]
[34434.516528] [<ffffffffc03aefc3>] thread_generic_wrapper+0x73/0x80 [spl]
[34434.516532] [<ffffffffc03aef50>] ? __thread_exit+0x20/0x20 [spl]
[34434.516555] [<ffffffff810b4031>] kthread+0xd1/0xe0
[34434.516558] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[34434.516574] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
[34434.516577] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-pfl test_3 - Test crashed during sanity-pfl test_3

Attachments

Issue Links

is duplicated by

LU-15261 sanity: 160j - storage timeout

Resolved

LU-11217 conf-sanity: test_32c client crashed

Resolved

LU-12281 replay-single: ZFS mmp crashed

Resolved

LU-12393 insanity: Kernel panic - not syncing: Pool 'lustre-ost6' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.

Resolved

is related to

LU-13242 sanity-pfl: WARNING: MMP writes to pool have not succeeded in over 60s; suspending pool

Resolved

is related to

LU-9845 ost-pools test_22 hangs with ‘WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended.’

Resolved

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(1 is related to , 7 mentioned in)

Activity

[LU-10956] sanity-pfl test_3: Kernel panic - not syncing: Pool has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic

Sarah Liu added a comment - 15/Dec/21 5:28 PM

same failure happened in rolling upgrade from 2.10.8 zfs to 2.12.8 zfs(all servers and clients were upgraded). sanity test_101d, MDS and OSS both crashed. crash dumps can be found on /scratch/dumps/onyx-20vm6.onyx.whamcloud.com/10.240.22.5-2021-12-15-00:43:14
/scratch/dumps/onyx-20vm7.onyx.whamcloud.com/10.240.22.6-2021-12-15-00:43:14

[ 6189.196087] Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================================== 00:42:59 (1639528979)
[ 6210.315899] Lustre: 31346:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528993/real 1639528993]  req@ffff94cc84058480 x1719168217710592/t0(0) o400->MGC10.240.22.5@tcp@10.240.22.5@tcp:26/25 lens 224/224 e 0 to 1 dl 1639529000 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 6210.315901] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528993/real 1639528993]  req@ffff94cc8405b180 x1719168217710656/t0(0) o400->lustre-MDT0000-mdc-ffff94cc847e2800@10.240.22.5@tcp:12/10 lens 224/224 e 0 to 1 dl 1639529000 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[ 6210.320698] Lustre: lustre-MDT0000-mdc-ffff94cc847e2800: Connection to lustre-MDT0000 (at 10.240.22.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
[ 6210.333787] Lustre: 31346:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[ 6210.335541] LustreError: 166-1: MGC10.240.22.5@tcp: Connection to MGS (at 10.240.22.5@tcp) was lost; in progress operations using this service will fail
[ 6217.320480] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1639529000/real 0]  req@ffff94cc8405b600 x1719168217710976/t0(0) o400->lustre-OST0000-osc-ffff94cc847e2800@10.240.22.6@tcp:28/4 lens 224/224 e 0 to 1 dl 1639529007 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
[ 6217.325616] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[ 6223.326129] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528981/real 1639528981]  req@ffff94cc849ad680 x1719168217708864/t0(0) o4->lustre-OST0000-osc-ffff94cc847e2800@10.240.22.6@tcp:6/4 lens 488/448 e 1 to 1 dl 1639529013 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 6360.063126] INFO: task lctl:32059 blocked for more than 120 seconds.
[ 6360.064419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6360.065900] lctl            D ffff94ccfaae1080     0 32059  31925 0x00000080
[ 6360.067313] Call Trace:
[ 6360.067853]  [<ffffffffa3789179>] schedule+0x29/0x70
[ 6360.068778]  [<ffffffffa3786e41>] schedule_timeout+0x221/0x2d0
[ 6360.069878]  [<ffffffffa30dae02>] ? default_wake_function+0x12/0x20
[ 6360.071048]  [<ffffffffa30d30c2>] ? __wake_up_common+0x82/0x120
[ 6360.072146]  [<ffffffffa378952d>] wait_for_completion+0xfd/0x140
[ 6360.073285]  [<ffffffffa30dadf0>] ? wake_up_state+0x20/0x20
[ 6360.074527]  [<ffffffffc0a4b5fd>] __ldlm_bl_to_thread+0xad/0x150 [ptlrpc]
[ 6360.075790]  [<ffffffffc0a4bc2b>] ldlm_bl_to_thread+0x33b/0x510 [ptlrpc]
[ 6360.077046]  [<ffffffffc0a50689>] ldlm_bl_to_thread_list+0x19/0x20 [ptlrpc]
[ 6360.078325]  [<ffffffffc0a49cb6>] ldlm_cancel_lru+0x76/0x180 [ptlrpc]
[ 6360.079527]  [<ffffffffc0a3ab97>] lru_size_store+0x67/0x430 [ptlrpc]
[ 6360.080871]  [<ffffffffc0866e7a>] lustre_attr_store+0x1a/0x20 [obdclass]
[ 6360.082113]  [<ffffffffa32dba62>] sysfs_kf_write+0x42/0x50
[ 6360.083134]  [<ffffffffa32db04b>] kernfs_fop_write+0xeb/0x160
[ 6360.084200]  [<ffffffffa324e590>] vfs_write+0xc0/0x1f0
[ 6360.085161]  [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a
[ 6360.086367]  [<ffffffffa324f36f>] SyS_write+0x7f/0xf0
[ 6360.087314]  [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a
[ 6360.088535]  [<ffffffffa3795f92>] system_call_fastpath+0x25/0x2a
[ 6360.089640]  [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a
[ 6480.083025] INFO: task lctl:32059 blocked for more than 120 seconds.
[ 6480.084299] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[14969.401446] Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================================== 00:42:59 (1639528979)
[14979.793364] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool
[14979.795218] Kernel panic - not syncing: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
[14979.797904] CPU: 0 PID: 17374 Comm: mmp Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.45.1.el7_lustre.x86_64 #1
[14979.800067] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[14979.801131] Call Trace:
[14979.801670]  [<ffffffffbab83539>] dump_stack+0x19/0x1b
[14979.802662]  [<ffffffffbab7d241>] panic+0xe8/0x21f
[14979.803818]  [<ffffffffc0829456>] zio_suspend+0x116/0x120 [zfs]
[14979.804952]  [<ffffffffc07afa4c>] mmp_thread+0x41c/0x4c0 [zfs]
[14979.806072]  [<ffffffffc07af630>] ? mmp_write_done+0x140/0x140 [zfs]
[14979.807287]  [<ffffffffc0672063>] thread_generic_wrapper+0x73/0x80 [spl]
[14979.808531]  [<ffffffffc0671ff0>] ? __thread_exit+0x20/0x20 [spl]
[14979.809705]  [<ffffffffba4c5e61>] kthread+0xd1/0xe0
[14979.810612]  [<ffffffffba4c5d90>] ? insert_kthread_work+0x40/0x40
[14979.811761]  [<ffffffffbab95df7>] ret_from_fork_nospec_begin+0x21/0x21
[14979.812969]  [<ffffffffba4c5d90>] ? insert_kthread_work+0x40/0x40

Sarah Liu added a comment - 15/Dec/21 5:28 PM same failure happened in rolling upgrade from 2.10.8 zfs to 2.12.8 zfs(all servers and clients were upgraded). sanity test_101d, MDS and OSS both crashed. crash dumps can be found on /scratch/dumps/onyx-20vm6.onyx.whamcloud.com/10.240.22.5-2021-12-15-00:43:14 /scratch/dumps/onyx-20vm7.onyx.whamcloud.com/10.240.22.6-2021-12-15-00:43:14 [ 6189.196087] Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================================== 00:42:59 (1639528979) [ 6210.315899] Lustre: 31346:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528993/real 1639528993] req@ffff94cc84058480 x1719168217710592/t0(0) o400->MGC10.240.22.5@tcp@10.240.22.5@tcp:26/25 lens 224/224 e 0 to 1 dl 1639529000 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 6210.315901] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528993/real 1639528993] req@ffff94cc8405b180 x1719168217710656/t0(0) o400->lustre-MDT0000-mdc-ffff94cc847e2800@10.240.22.5@tcp:12/10 lens 224/224 e 0 to 1 dl 1639529000 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 6210.320698] Lustre: lustre-MDT0000-mdc-ffff94cc847e2800: Connection to lustre-MDT0000 (at 10.240.22.5@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 6210.333787] Lustre: 31346:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message [ 6210.335541] LustreError: 166-1: MGC10.240.22.5@tcp: Connection to MGS (at 10.240.22.5@tcp) was lost; in progress operations using this service will fail [ 6217.320480] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1639529000/real 0] req@ffff94cc8405b600 x1719168217710976/t0(0) o400->lustre-OST0000-osc-ffff94cc847e2800@10.240.22.6@tcp:28/4 lens 224/224 e 0 to 1 dl 1639529007 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 6217.325616] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message [ 6223.326129] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528981/real 1639528981] req@ffff94cc849ad680 x1719168217708864/t0(0) o4->lustre-OST0000-osc-ffff94cc847e2800@10.240.22.6@tcp:6/4 lens 488/448 e 1 to 1 dl 1639529013 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 6360.063126] INFO: task lctl:32059 blocked for more than 120 seconds. [ 6360.064419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 6360.065900] lctl D ffff94ccfaae1080 0 32059 31925 0x00000080 [ 6360.067313] Call Trace: [ 6360.067853] [<ffffffffa3789179>] schedule+0x29/0x70 [ 6360.068778] [<ffffffffa3786e41>] schedule_timeout+0x221/0x2d0 [ 6360.069878] [<ffffffffa30dae02>] ? default_wake_function+0x12/0x20 [ 6360.071048] [<ffffffffa30d30c2>] ? __wake_up_common+0x82/0x120 [ 6360.072146] [<ffffffffa378952d>] wait_for_completion+0xfd/0x140 [ 6360.073285] [<ffffffffa30dadf0>] ? wake_up_state+0x20/0x20 [ 6360.074527] [<ffffffffc0a4b5fd>] __ldlm_bl_to_thread+0xad/0x150 [ptlrpc] [ 6360.075790] [<ffffffffc0a4bc2b>] ldlm_bl_to_thread+0x33b/0x510 [ptlrpc] [ 6360.077046] [<ffffffffc0a50689>] ldlm_bl_to_thread_list+0x19/0x20 [ptlrpc] [ 6360.078325] [<ffffffffc0a49cb6>] ldlm_cancel_lru+0x76/0x180 [ptlrpc] [ 6360.079527] [<ffffffffc0a3ab97>] lru_size_store+0x67/0x430 [ptlrpc] [ 6360.080871] [<ffffffffc0866e7a>] lustre_attr_store+0x1a/0x20 [obdclass] [ 6360.082113] [<ffffffffa32dba62>] sysfs_kf_write+0x42/0x50 [ 6360.083134] [<ffffffffa32db04b>] kernfs_fop_write+0xeb/0x160 [ 6360.084200] [<ffffffffa324e590>] vfs_write+0xc0/0x1f0 [ 6360.085161] [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a [ 6360.086367] [<ffffffffa324f36f>] SyS_write+0x7f/0xf0 [ 6360.087314] [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a [ 6360.088535] [<ffffffffa3795f92>] system_call_fastpath+0x25/0x2a [ 6360.089640] [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a [ 6480.083025] INFO: task lctl:32059 blocked for more than 120 seconds. [ 6480.084299] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [14969.401446] Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================================== 00:42:59 (1639528979) [14979.793364] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool [14979.795218] Kernel panic - not syncing: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [14979.797904] CPU: 0 PID: 17374 Comm: mmp Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.45.1.el7_lustre.x86_64 #1 [14979.800067] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [14979.801131] Call Trace: [14979.801670] [<ffffffffbab83539>] dump_stack+0x19/0x1b [14979.802662] [<ffffffffbab7d241>] panic+0xe8/0x21f [14979.803818] [<ffffffffc0829456>] zio_suspend+0x116/0x120 [zfs] [14979.804952] [<ffffffffc07afa4c>] mmp_thread+0x41c/0x4c0 [zfs] [14979.806072] [<ffffffffc07af630>] ? mmp_write_done+0x140/0x140 [zfs] [14979.807287] [<ffffffffc0672063>] thread_generic_wrapper+0x73/0x80 [spl] [14979.808531] [<ffffffffc0671ff0>] ? __thread_exit+0x20/0x20 [spl] [14979.809705] [<ffffffffba4c5e61>] kthread+0xd1/0xe0 [14979.810612] [<ffffffffba4c5d90>] ? insert_kthread_work+0x40/0x40 [14979.811761] [<ffffffffbab95df7>] ret_from_fork_nospec_begin+0x21/0x21 [14979.812969] [<ffffffffba4c5d90>] ? insert_kthread_work+0x40/0x40

Sarah Liu added a comment - 08/Dec/21 9:33 PM

Hit the same error when doing clean downgrade testing from 2.12.8 back to 2.12.7 zfs
system was first formatted as 2.12.7 zfs, clean upgrade 2.12.8, no issue; then clean downgrade back to 2.12.7, mds showed following error when doing some I/O

[  819.523684] Lustre: MGS: Connection restored to c1610ca8-7c1f-280d-38e2-6e7a30dd0376 (at 10.240.22.6@tcp)
[  826.309413] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
[  828.924608] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
[  851.395455] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
[  876.483447] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
[  901.571491] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
[  926.659577] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
[  951.747496] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
[ 1001.924417] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
[ 1001.926432] LustreError: Skipped 1 previous similar message
[ 1077.188415] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
[ 1077.190448] LustreError: Skipped 2 previous similar messages
[ 1253.326971] Lustre: MGS: Connection restored to 6e2a8795-6a65-b3f1-be54-38d6f84216b5 (at 10.240.22.7@tcp)
[ 1253.335182] Lustre: Skipped 1 previous similar message
[ 1253.383388] Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects
[ 1253.385279] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:59
[ 1278.436952] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:34
[ 1303.459633] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:09
[ 1328.546430] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:44
[ 1353.633076] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:19
[ 1378.715611] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:54
[ 1403.797922] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp
[ 1529.217734] Lustre: Skipped 2 previous similar messages
[ 1553.300393] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
[ 1553.301864] Lustre: lustre-MDT0000: disconnecting 1 stale clients
[ 1553.443706] Lustre: lustre-MDT0000: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted.
[ 1555.298604] Lustre: lustre-MDT0000: Connection restored to 6e2a8795-6a65-b3f1-be54-38d6f84216b5 (at 10.240.22.7@tcp)
[ 5227.763272] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool
[ 5227.768866] Kernel panic - not syncing: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
[ 5227.771388] CPU: 0 PID: 20720 Comm: mmp Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.25.1.el7_lustre.x86_64 #1
[ 5227.773400] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 5227.774396] Call Trace:
[ 5227.774911]  [<ffffffffb8b8311a>] dump_stack+0x19/0x1b
[ 5227.775815]  [<ffffffffb8b7c672>] panic+0xe8/0x21f
[ 5227.777042]  [<ffffffffc0902446>] zio_suspend+0x116/0x120 [zfs]
[ 5227.778109]  [<ffffffffc0888a4c>] mmp_thread+0x41c/0x4c0 [zfs]
[ 5227.779164]  [<ffffffffc0888630>] ? mmp_write_done+0x140/0x140 [zfs]
[ 5227.780316]  [<ffffffffc0755063>] thread_generic_wrapper+0x73/0x80 [spl]
[ 5227.781478]  [<ffffffffc0754ff0>] ? __thread_exit+0x20/0x20 [spl]
[ 5227.782565]  [<ffffffffb84c5da1>] kthread+0xd1/0xe0
[ 5227.783424]  [<ffffffffb84c5cd0>] ? insert_kthread_work+0x40/0x40
[ 5227.784487]  [<ffffffffb8b95df7>] ret_from_fork_nospec_begin+0x21/0x21
[ 5227.785622]  [<ffffffffb84c5cd0>] ? insert_kthread_work+0x40/0x40

Tried a seconde time, this time, no I/O and mds cannot be mounted, dmesg shows "pool suspended", similar as ~~LU-9845~~

[  100.065432] Key type id_resolver registered
[  100.066277] Key type id_legacy registered
[  358.013062] Rounding down aligned max_sectors from 4294967295 to 4294967288
[  358.039021] Loading iSCSI transport class v2.0-870.
[  358.050683] iscsi: registered transport (iser)
[  358.107129] RPC: Registered rdma transport module.
[  358.108114] RPC: Registered rdma backchannel transport module.
[  574.759980] spl: loading out-of-tree module taints kernel.
[  574.763735] spl: module verification failed: signature and/or required key missing - tainting kernel
[  574.770148] SPL: Loaded module v0.7.13-1
[  574.771122] znvpair: module license 'CDDL' taints kernel.
[  574.772226] Disabling lock debugging due to kernel taint
[  576.704935] ZFS: Loaded module v0.7.13-1, ZFS pool version 5000, ZFS filesystem version 5
[ 3621.200468] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool
[ 3621.202323] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended.


[root@onyx-20vm6 ~]# zpool list
NAME          SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
lustre-mdt1  5.97G  99.9M  5.87G         -    40%     1%  1.00x  SUSPENDED  -

Sarah Liu added a comment - 08/Dec/21 9:33 PM Hit the same error when doing clean downgrade testing from 2.12.8 back to 2.12.7 zfs system was first formatted as 2.12.7 zfs, clean upgrade 2.12.8, no issue; then clean downgrade back to 2.12.7, mds showed following error when doing some I/O [ 819.523684] Lustre: MGS: Connection restored to c1610ca8-7c1f-280d-38e2-6e7a30dd0376 (at 10.240.22.6@tcp) [ 826.309413] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 828.924608] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 851.395455] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 876.483447] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 901.571491] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 926.659577] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 951.747496] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 1001.924417] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 1001.926432] LustreError: Skipped 1 previous similar message [ 1077.188415] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 1077.190448] LustreError: Skipped 2 previous similar messages [ 1253.326971] Lustre: MGS: Connection restored to 6e2a8795-6a65-b3f1-be54-38d6f84216b5 (at 10.240.22.7@tcp) [ 1253.335182] Lustre: Skipped 1 previous similar message [ 1253.383388] Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects [ 1253.385279] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:59 [ 1278.436952] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:34 [ 1303.459633] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:09 [ 1328.546430] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:44 [ 1353.633076] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:19 [ 1378.715611] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:54 [ 1403.797922] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp [ 1529.217734] Lustre: Skipped 2 previous similar messages [ 1553.300393] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports [ 1553.301864] Lustre: lustre-MDT0000: disconnecting 1 stale clients [ 1553.443706] Lustre: lustre-MDT0000: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted. [ 1555.298604] Lustre: lustre-MDT0000: Connection restored to 6e2a8795-6a65-b3f1-be54-38d6f84216b5 (at 10.240.22.7@tcp) [ 5227.763272] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool [ 5227.768866] Kernel panic - not syncing: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [ 5227.771388] CPU: 0 PID: 20720 Comm: mmp Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.25.1.el7_lustre.x86_64 #1 [ 5227.773400] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 5227.774396] Call Trace: [ 5227.774911] [<ffffffffb8b8311a>] dump_stack+0x19/0x1b [ 5227.775815] [<ffffffffb8b7c672>] panic+0xe8/0x21f [ 5227.777042] [<ffffffffc0902446>] zio_suspend+0x116/0x120 [zfs] [ 5227.778109] [<ffffffffc0888a4c>] mmp_thread+0x41c/0x4c0 [zfs] [ 5227.779164] [<ffffffffc0888630>] ? mmp_write_done+0x140/0x140 [zfs] [ 5227.780316] [<ffffffffc0755063>] thread_generic_wrapper+0x73/0x80 [spl] [ 5227.781478] [<ffffffffc0754ff0>] ? __thread_exit+0x20/0x20 [spl] [ 5227.782565] [<ffffffffb84c5da1>] kthread+0xd1/0xe0 [ 5227.783424] [<ffffffffb84c5cd0>] ? insert_kthread_work+0x40/0x40 [ 5227.784487] [<ffffffffb8b95df7>] ret_from_fork_nospec_begin+0x21/0x21 [ 5227.785622] [<ffffffffb84c5cd0>] ? insert_kthread_work+0x40/0x40 Tried a seconde time, this time, no I/O and mds cannot be mounted, dmesg shows "pool suspended", similar as LU-9845 [ 100.065432] Key type id_resolver registered [ 100.066277] Key type id_legacy registered [ 358.013062] Rounding down aligned max_sectors from 4294967295 to 4294967288 [ 358.039021] Loading iSCSI transport class v2.0-870. [ 358.050683] iscsi: registered transport (iser) [ 358.107129] RPC: Registered rdma transport module. [ 358.108114] RPC: Registered rdma backchannel transport module. [ 574.759980] spl: loading out-of-tree module taints kernel. [ 574.763735] spl: module verification failed: signature and/or required key missing - tainting kernel [ 574.770148] SPL: Loaded module v0.7.13-1 [ 574.771122] znvpair: module license 'CDDL' taints kernel. [ 574.772226] Disabling lock debugging due to kernel taint [ 576.704935] ZFS: Loaded module v0.7.13-1, ZFS pool version 5000, ZFS filesystem version 5 [ 3621.200468] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool [ 3621.202323] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended. [root@onyx-20vm6 ~]# zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT lustre-mdt1 5.97G 99.9M 5.87G - 40% 1% 1.00x SUSPENDED -

Alex Zhuravlev added a comment - 23/Nov/21 7:37 PM

[ 2868.632037] WARNING: MMP writes to pool 'lustre-mdt3' have not succeeded in over 60004 ms; suspending pool. Hrtime 2868632010566

this is 60 seconds, right?
https://testing.whamcloud.com/test_sets/719b272d-26d8-491c-9853-1291a8d4ede6

Alex Zhuravlev added a comment - 23/Nov/21 7:37 PM [ 2868.632037] WARNING: MMP writes to pool 'lustre-mdt3' have not succeeded in over 60004 ms; suspending pool. Hrtime 2868632010566 this is 60 seconds, right? https://testing.whamcloud.com/test_sets/719b272d-26d8-491c-9853-1291a8d4ede6

Sebastien Buisson added a comment - 17/Nov/21 8:22 AM

+1 in recovery-small test_17a
https://testing.whamcloud.com/test_sets/90191f3b-00b4-45e9-b6b7-be55b45bbedc

Sebastien Buisson added a comment - 17/Nov/21 8:22 AM +1 in recovery-small test_17a https://testing.whamcloud.com/test_sets/90191f3b-00b4-45e9-b6b7-be55b45bbedc

Li Xi added a comment - 16/Nov/21 12:06 AM

+1 https://testing.whamcloud.com/test_sets/7494598f-5f66-4f53-99c2-a42fd4422b99

Li Xi added a comment - 16/Nov/21 12:06 AM +1 https://testing.whamcloud.com/test_sets/7494598f-5f66-4f53-99c2-a42fd4422b99

Li Xi added a comment - 15/Nov/21 2:44 AM

+1 for sanity:413a

https://testing.whamcloud.com/test_sets/91ccf494-32bd-4502-bff5-c9cba0cbc108

Li Xi added a comment - 15/Nov/21 2:44 AM +1 for sanity:413a https://testing.whamcloud.com/test_sets/91ccf494-32bd-4502-bff5-c9cba0cbc108

Sergey Cheremencev added a comment - 11/Nov/21 4:25 PM

replay-dual test_21a https://testing.whamcloud.com/test_sets/6e31af1a-3eb5-484a-8aca-000eb45c172c

Sergey Cheremencev added a comment - 11/Nov/21 4:25 PM replay-dual test_21a https://testing.whamcloud.com/test_sets/6e31af1a-3eb5-484a-8aca-000eb45c172c

James Nunez (Inactive) added a comment - 15/Jan/21 4:23 PM

We’ve seen this crash for several test suites:
Lustre 2.13.56.23 - mds-survey test_1 - https://testing.whamcloud.com/test_sets/edd10cda-ed14-4cd3-a1b7-6df26a6cc734
Lustre 2.13.56.44 - parallel-scale-nfsv3 test_compilebench - https://testing.whamcloud.com/test_sets/546e48a1-cac0-4c05-866b-5bbe4fd0925f
Lustre 2.13.57.53 - recovery-mds-scale test_failover_mds - https://testing.whamcloud.com/test_sets/c3d24d75-de7f-497b-af46-452764362666
Lustre 2.13.57.53 - recovery-random-scale test_fail_client_mds - https://testing.whamcloud.com/test_sets/b9f869f9-3b10-4d3e-9c43-98c5d333eb46

James Nunez (Inactive) added a comment - 15/Jan/21 4:23 PM We’ve seen this crash for several test suites: Lustre 2.13.56.23 - mds-survey test_1 - https://testing.whamcloud.com/test_sets/edd10cda-ed14-4cd3-a1b7-6df26a6cc734 Lustre 2.13.56.44 - parallel-scale-nfsv3 test_compilebench - https://testing.whamcloud.com/test_sets/546e48a1-cac0-4c05-866b-5bbe4fd0925f Lustre 2.13.57.53 - recovery-mds-scale test_failover_mds - https://testing.whamcloud.com/test_sets/c3d24d75-de7f-497b-af46-452764362666 Lustre 2.13.57.53 - recovery-random-scale test_fail_client_mds - https://testing.whamcloud.com/test_sets/b9f869f9-3b10-4d3e-9c43-98c5d333eb46

Sarah Liu added a comment - 14/Jan/21 7:36 PM

hit the same issue in master zfs failover
https://testing.whamcloud.com/test_sets/7e12ed8b-b695-46d6-9103-a35b55b378a0

Sarah Liu added a comment - 14/Jan/21 7:36 PM hit the same issue in master zfs failover https://testing.whamcloud.com/test_sets/7e12ed8b-b695-46d6-9103-a35b55b378a0

Sarah Liu added a comment - 05/Jun/20 10:00 PM

hit the problem in rolling upgrade from 2.10.8 EL7.6 to 2.12.5 EL7.8 zfs
After rolling upgrade all servers and client to 2.12.5, sanity 42e,180c hit the same crash on OSS

[  907.517100] Lustre: DEBUG MARKER: == sanity test 42e: verify sub-RPC writes are not done synchronously ================================= 06:16:17 (1591337777)
[  910.460672] Lustre: lustre-OST0000: Connection restored to 47f23d53-3404-b4c5-e304-97df136e115c (at 10.9.6.157@tcp)
[  910.462673] Lustre: Skipped 1 previous similar message
[  922.630198] WARNING: MMP writes to pool 'lustre-ost1' have not succeeded in over 5s; suspending pool
[  922.631786] Kernel panic - not syncing: Pool 'lustre-ost1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
[  922.634113] CPU: 1 PID: 2823 Comm: mmp Kdump: loaded Tainted: P           OE  ------------   3.10.0-1127.8.2.el7_lustre.x86_64 #1
[  922.635942] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  922.636859] Call Trace:
[  922.637279]  [<ffffffffa717ffa5>] dump_stack+0x19/0x1b
[  922.638104]  [<ffffffffa7179541>] panic+0xe8/0x21f
[  922.638963]  [<ffffffffc05b7446>] zio_suspend+0x116/0x120 [zfs]
[  922.639937]  [<ffffffffc053da4c>] mmp_thread+0x41c/0x4c0 [zfs]
[  922.640900]  [<ffffffffc053d630>] ? mmp_write_done+0x140/0x140 [zfs]
[  922.641928]  [<ffffffffc02f6063>] thread_generic_wrapper+0x73/0x80 [spl]
[  922.642999]  [<ffffffffc02f5ff0>] ? __thread_exit+0x20/0x20 [spl]
[  922.643977]  [<ffffffffa6ac6691>] kthread+0xd1/0xe0
[  922.644765]  [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40
[  922.645735]  [<ffffffffa7192d37>] ret_from_fork_nospec_begin+0x21/0x21
[  922.646777]  [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu

Sarah Liu added a comment - 05/Jun/20 10:00 PM hit the problem in rolling upgrade from 2.10.8 EL7.6 to 2.12.5 EL7.8 zfs After rolling upgrade all servers and client to 2.12.5, sanity 42e,180c hit the same crash on OSS [ 907.517100] Lustre: DEBUG MARKER: == sanity test 42e: verify sub-RPC writes are not done synchronously ================================= 06:16:17 (1591337777) [ 910.460672] Lustre: lustre-OST0000: Connection restored to 47f23d53-3404-b4c5-e304-97df136e115c (at 10.9.6.157@tcp) [ 910.462673] Lustre: Skipped 1 previous similar message [ 922.630198] WARNING: MMP writes to pool 'lustre-ost1' have not succeeded in over 5s; suspending pool [ 922.631786] Kernel panic - not syncing: Pool 'lustre-ost1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [ 922.634113] CPU: 1 PID: 2823 Comm: mmp Kdump: loaded Tainted: P OE ------------ 3.10.0-1127.8.2.el7_lustre.x86_64 #1 [ 922.635942] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 922.636859] Call Trace: [ 922.637279] [<ffffffffa717ffa5>] dump_stack+0x19/0x1b [ 922.638104] [<ffffffffa7179541>] panic+0xe8/0x21f [ 922.638963] [<ffffffffc05b7446>] zio_suspend+0x116/0x120 [zfs] [ 922.639937] [<ffffffffc053da4c>] mmp_thread+0x41c/0x4c0 [zfs] [ 922.640900] [<ffffffffc053d630>] ? mmp_write_done+0x140/0x140 [zfs] [ 922.641928] [<ffffffffc02f6063>] thread_generic_wrapper+0x73/0x80 [spl] [ 922.642999] [<ffffffffc02f5ff0>] ? __thread_exit+0x20/0x20 [spl] [ 922.643977] [<ffffffffa6ac6691>] kthread+0xd1/0xe0 [ 922.644765] [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40 [ 922.645735] [<ffffffffa7192d37>] ret_from_fork_nospec_begin+0x21/0x21 [ 922.646777] [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu

People

Assignee:: WC Triage

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 25/Apr/18 8:15 PM

Updated:: 16/Jun/23 10:35 PM