[LU-10956] sanity-pfl test_3: Kernel panic - not syncing: Pool has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic Created: 25/Apr/18 Updated: 16/Jun/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.14.0, Lustre 2.12.5, Lustre 2.12.8, Lustre 2.15.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | zfs | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah_lw <wei3.liu@intel.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/252abdaa-477b-11e8-95c0-52540065bddc test_3 failed with the following error: Test crashed during sanity-pfl test_3 env: RHEL7 zfs DNE tag-2.11.51 this is the trace found in kernel-crash.log [34408.762645] Lustre: DEBUG MARKER: dmesg [34409.519801] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-pfl test 3: Delete component from existing file ============================================ 04:43:50 \(1524545030\) [34409.734904] Lustre: DEBUG MARKER: == sanity-pfl test 3: Delete component from existing file ============================================ 04:43:50 (1524545030) [34434.509312] Lustre: lustre-OST0006: Client lustre-MDT0001-mdtlov_UUID (at 10.9.4.25@tcp) reconnecting [34434.512144] Lustre: lustre-OST0006: Client lustre-MDT0003-mdtlov_UUID (at 10.9.4.25@tcp) reconnecting [34434.512149] Lustre: Skipped 7 previous similar messages [34434.516050] WARNING: MMP writes to pool 'lustre-ost2' have not succeeded in over 20s; suspending pool [34434.516059] Kernel panic - not syncing: Pool 'lustre-ost2' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [34434.516071] CPU: 0 PID: 16454 Comm: mmp Tainted: P OE ------------ 3.10.0-693.21.1.el7_lustre.x86_64 #1 [34434.516072] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [34434.516077] Call Trace: [34434.516133] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [34434.516137] [<ffffffff816a8634>] panic+0xe8/0x21f [34434.516443] [<ffffffffc05734a6>] zio_suspend+0x106/0x110 [zfs] [34434.516470] [<ffffffffc04fa322>] mmp_thread+0x322/0x4a0 [zfs] [34434.516491] [<ffffffffc04fa000>] ? mmp_write_done+0x1d0/0x1d0 [zfs] [34434.516528] [<ffffffffc03aefc3>] thread_generic_wrapper+0x73/0x80 [spl] [34434.516532] [<ffffffffc03aef50>] ? __thread_exit+0x20/0x20 [spl] [34434.516555] [<ffffffff810b4031>] kthread+0xd1/0xe0 [34434.516558] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 [34434.516574] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0 [34434.516577] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40 VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Sarah Liu [ 05/Jun/20 ] |
|
hit the problem in rolling upgrade from 2.10.8 EL7.6 to 2.12.5 EL7.8 zfs [ 907.517100] Lustre: DEBUG MARKER: == sanity test 42e: verify sub-RPC writes are not done synchronously ================================= 06:16:17 (1591337777) [ 910.460672] Lustre: lustre-OST0000: Connection restored to 47f23d53-3404-b4c5-e304-97df136e115c (at 10.9.6.157@tcp) [ 910.462673] Lustre: Skipped 1 previous similar message [ 922.630198] WARNING: MMP writes to pool 'lustre-ost1' have not succeeded in over 5s; suspending pool [ 922.631786] Kernel panic - not syncing: Pool 'lustre-ost1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [ 922.634113] CPU: 1 PID: 2823 Comm: mmp Kdump: loaded Tainted: P OE ------------ 3.10.0-1127.8.2.el7_lustre.x86_64 #1 [ 922.635942] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 922.636859] Call Trace: [ 922.637279] [<ffffffffa717ffa5>] dump_stack+0x19/0x1b [ 922.638104] [<ffffffffa7179541>] panic+0xe8/0x21f [ 922.638963] [<ffffffffc05b7446>] zio_suspend+0x116/0x120 [zfs] [ 922.639937] [<ffffffffc053da4c>] mmp_thread+0x41c/0x4c0 [zfs] [ 922.640900] [<ffffffffc053d630>] ? mmp_write_done+0x140/0x140 [zfs] [ 922.641928] [<ffffffffc02f6063>] thread_generic_wrapper+0x73/0x80 [spl] [ 922.642999] [<ffffffffc02f5ff0>] ? __thread_exit+0x20/0x20 [spl] [ 922.643977] [<ffffffffa6ac6691>] kthread+0xd1/0xe0 [ 922.644765] [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40 [ 922.645735] [<ffffffffa7192d37>] ret_from_fork_nospec_begin+0x21/0x21 [ 922.646777] [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu |
| Comment by Sarah Liu [ 14/Jan/21 ] |
|
hit the same issue in master zfs failover |
| Comment by James Nunez (Inactive) [ 15/Jan/21 ] |
|
We’ve seen this crash for several test suites: |
| Comment by Sergey Cheremencev [ 11/Nov/21 ] |
|
replay-dual test_21a https://testing.whamcloud.com/test_sets/6e31af1a-3eb5-484a-8aca-000eb45c172c |
| Comment by Li Xi [ 15/Nov/21 ] |
|
+1 for sanity:413a https://testing.whamcloud.com/test_sets/91ccf494-32bd-4502-bff5-c9cba0cbc108 |
| Comment by Li Xi [ 16/Nov/21 ] |
|
+1 https://testing.whamcloud.com/test_sets/7494598f-5f66-4f53-99c2-a42fd4422b99 |
| Comment by Sebastien Buisson [ 17/Nov/21 ] |
|
+1 in recovery-small test_17a |
| Comment by Alex Zhuravlev [ 23/Nov/21 ] |
[ 2868.632037] WARNING: MMP writes to pool 'lustre-mdt3' have not succeeded in over 60004 ms; suspending pool. Hrtime 2868632010566
this is 60 seconds, right? |
| Comment by Sarah Liu [ 08/Dec/21 ] |
|
Hit the same error when doing clean downgrade testing from 2.12.8 back to 2.12.7 zfs [ 819.523684] Lustre: MGS: Connection restored to c1610ca8-7c1f-280d-38e2-6e7a30dd0376 (at 10.240.22.6@tcp) [ 826.309413] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 828.924608] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 851.395455] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 876.483447] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 901.571491] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 926.659577] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 951.747496] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 1001.924417] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 1001.926432] LustreError: Skipped 1 previous similar message [ 1077.188415] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 1077.190448] LustreError: Skipped 2 previous similar messages [ 1253.326971] Lustre: MGS: Connection restored to 6e2a8795-6a65-b3f1-be54-38d6f84216b5 (at 10.240.22.7@tcp) [ 1253.335182] Lustre: Skipped 1 previous similar message [ 1253.383388] Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects [ 1253.385279] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:59 [ 1278.436952] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:34 [ 1303.459633] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:09 [ 1328.546430] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:44 [ 1353.633076] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:19 [ 1378.715611] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:54 [ 1403.797922] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp [ 1529.217734] Lustre: Skipped 2 previous similar messages [ 1553.300393] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports [ 1553.301864] Lustre: lustre-MDT0000: disconnecting 1 stale clients [ 1553.443706] Lustre: lustre-MDT0000: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted. [ 1555.298604] Lustre: lustre-MDT0000: Connection restored to 6e2a8795-6a65-b3f1-be54-38d6f84216b5 (at 10.240.22.7@tcp) [ 5227.763272] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool [ 5227.768866] Kernel panic - not syncing: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [ 5227.771388] CPU: 0 PID: 20720 Comm: mmp Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.25.1.el7_lustre.x86_64 #1 [ 5227.773400] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 5227.774396] Call Trace: [ 5227.774911] [<ffffffffb8b8311a>] dump_stack+0x19/0x1b [ 5227.775815] [<ffffffffb8b7c672>] panic+0xe8/0x21f [ 5227.777042] [<ffffffffc0902446>] zio_suspend+0x116/0x120 [zfs] [ 5227.778109] [<ffffffffc0888a4c>] mmp_thread+0x41c/0x4c0 [zfs] [ 5227.779164] [<ffffffffc0888630>] ? mmp_write_done+0x140/0x140 [zfs] [ 5227.780316] [<ffffffffc0755063>] thread_generic_wrapper+0x73/0x80 [spl] [ 5227.781478] [<ffffffffc0754ff0>] ? __thread_exit+0x20/0x20 [spl] [ 5227.782565] [<ffffffffb84c5da1>] kthread+0xd1/0xe0 [ 5227.783424] [<ffffffffb84c5cd0>] ? insert_kthread_work+0x40/0x40 [ 5227.784487] [<ffffffffb8b95df7>] ret_from_fork_nospec_begin+0x21/0x21 [ 5227.785622] [<ffffffffb84c5cd0>] ? insert_kthread_work+0x40/0x40 Tried a seconde time, this time, no I/O and mds cannot be mounted, dmesg shows "pool suspended", similar as [ 100.065432] Key type id_resolver registered [ 100.066277] Key type id_legacy registered [ 358.013062] Rounding down aligned max_sectors from 4294967295 to 4294967288 [ 358.039021] Loading iSCSI transport class v2.0-870. [ 358.050683] iscsi: registered transport (iser) [ 358.107129] RPC: Registered rdma transport module. [ 358.108114] RPC: Registered rdma backchannel transport module. [ 574.759980] spl: loading out-of-tree module taints kernel. [ 574.763735] spl: module verification failed: signature and/or required key missing - tainting kernel [ 574.770148] SPL: Loaded module v0.7.13-1 [ 574.771122] znvpair: module license 'CDDL' taints kernel. [ 574.772226] Disabling lock debugging due to kernel taint [ 576.704935] ZFS: Loaded module v0.7.13-1, ZFS pool version 5000, ZFS filesystem version 5 [ 3621.200468] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool [ 3621.202323] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended. [root@onyx-20vm6 ~]# zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT lustre-mdt1 5.97G 99.9M 5.87G - 40% 1% 1.00x SUSPENDED - |
| Comment by Sarah Liu [ 15/Dec/21 ] |
|
same failure happened in rolling upgrade from 2.10.8 zfs to 2.12.8 zfs(all servers and clients were upgraded). sanity test_101d, MDS and OSS both crashed. crash dumps can be found on /scratch/dumps/onyx-20vm6.onyx.whamcloud.com/10.240.22.5-2021-12-15-00:43:14 [ 6189.196087] Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================================== 00:42:59 (1639528979) [ 6210.315899] Lustre: 31346:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528993/real 1639528993] req@ffff94cc84058480 x1719168217710592/t0(0) o400->MGC10.240.22.5@tcp@10.240.22.5@tcp:26/25 lens 224/224 e 0 to 1 dl 1639529000 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 6210.315901] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528993/real 1639528993] req@ffff94cc8405b180 x1719168217710656/t0(0) o400->lustre-MDT0000-mdc-ffff94cc847e2800@10.240.22.5@tcp:12/10 lens 224/224 e 0 to 1 dl 1639529000 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 6210.320698] Lustre: lustre-MDT0000-mdc-ffff94cc847e2800: Connection to lustre-MDT0000 (at 10.240.22.5@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 6210.333787] Lustre: 31346:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message [ 6210.335541] LustreError: 166-1: MGC10.240.22.5@tcp: Connection to MGS (at 10.240.22.5@tcp) was lost; in progress operations using this service will fail [ 6217.320480] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1639529000/real 0] req@ffff94cc8405b600 x1719168217710976/t0(0) o400->lustre-OST0000-osc-ffff94cc847e2800@10.240.22.6@tcp:28/4 lens 224/224 e 0 to 1 dl 1639529007 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 6217.325616] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message [ 6223.326129] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528981/real 1639528981] req@ffff94cc849ad680 x1719168217708864/t0(0) o4->lustre-OST0000-osc-ffff94cc847e2800@10.240.22.6@tcp:6/4 lens 488/448 e 1 to 1 dl 1639529013 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 6360.063126] INFO: task lctl:32059 blocked for more than 120 seconds. [ 6360.064419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 6360.065900] lctl D ffff94ccfaae1080 0 32059 31925 0x00000080 [ 6360.067313] Call Trace: [ 6360.067853] [<ffffffffa3789179>] schedule+0x29/0x70 [ 6360.068778] [<ffffffffa3786e41>] schedule_timeout+0x221/0x2d0 [ 6360.069878] [<ffffffffa30dae02>] ? default_wake_function+0x12/0x20 [ 6360.071048] [<ffffffffa30d30c2>] ? __wake_up_common+0x82/0x120 [ 6360.072146] [<ffffffffa378952d>] wait_for_completion+0xfd/0x140 [ 6360.073285] [<ffffffffa30dadf0>] ? wake_up_state+0x20/0x20 [ 6360.074527] [<ffffffffc0a4b5fd>] __ldlm_bl_to_thread+0xad/0x150 [ptlrpc] [ 6360.075790] [<ffffffffc0a4bc2b>] ldlm_bl_to_thread+0x33b/0x510 [ptlrpc] [ 6360.077046] [<ffffffffc0a50689>] ldlm_bl_to_thread_list+0x19/0x20 [ptlrpc] [ 6360.078325] [<ffffffffc0a49cb6>] ldlm_cancel_lru+0x76/0x180 [ptlrpc] [ 6360.079527] [<ffffffffc0a3ab97>] lru_size_store+0x67/0x430 [ptlrpc] [ 6360.080871] [<ffffffffc0866e7a>] lustre_attr_store+0x1a/0x20 [obdclass] [ 6360.082113] [<ffffffffa32dba62>] sysfs_kf_write+0x42/0x50 [ 6360.083134] [<ffffffffa32db04b>] kernfs_fop_write+0xeb/0x160 [ 6360.084200] [<ffffffffa324e590>] vfs_write+0xc0/0x1f0 [ 6360.085161] [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a [ 6360.086367] [<ffffffffa324f36f>] SyS_write+0x7f/0xf0 [ 6360.087314] [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a [ 6360.088535] [<ffffffffa3795f92>] system_call_fastpath+0x25/0x2a [ 6360.089640] [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a [ 6480.083025] INFO: task lctl:32059 blocked for more than 120 seconds. [ 6480.084299] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [14969.401446] Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================================== 00:42:59 (1639528979) [14979.793364] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool [14979.795218] Kernel panic - not syncing: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [14979.797904] CPU: 0 PID: 17374 Comm: mmp Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.45.1.el7_lustre.x86_64 #1 [14979.800067] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [14979.801131] Call Trace: [14979.801670] [<ffffffffbab83539>] dump_stack+0x19/0x1b [14979.802662] [<ffffffffbab7d241>] panic+0xe8/0x21f [14979.803818] [<ffffffffc0829456>] zio_suspend+0x116/0x120 [zfs] [14979.804952] [<ffffffffc07afa4c>] mmp_thread+0x41c/0x4c0 [zfs] [14979.806072] [<ffffffffc07af630>] ? mmp_write_done+0x140/0x140 [zfs] [14979.807287] [<ffffffffc0672063>] thread_generic_wrapper+0x73/0x80 [spl] [14979.808531] [<ffffffffc0671ff0>] ? __thread_exit+0x20/0x20 [spl] [14979.809705] [<ffffffffba4c5e61>] kthread+0xd1/0xe0 [14979.810612] [<ffffffffba4c5d90>] ? insert_kthread_work+0x40/0x40 [14979.811761] [<ffffffffbab95df7>] ret_from_fork_nospec_begin+0x21/0x21 [14979.812969] [<ffffffffba4c5d90>] ? insert_kthread_work+0x40/0x40 |