Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10956

sanity-pfl test_3: Kernel panic - not syncing: Pool has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.14.0, Lustre 2.12.5, Lustre 2.12.8, Lustre 2.15.3
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/252abdaa-477b-11e8-95c0-52540065bddc

      test_3 failed with the following error:

      Test crashed during sanity-pfl test_3
      

      env: RHEL7 zfs DNE tag-2.11.51

      this is the trace found in kernel-crash.log

      [34408.762645] Lustre: DEBUG MARKER: dmesg
      [34409.519801] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-pfl test 3: Delete component from existing file ============================================ 04:43:50 \(1524545030\)
      [34409.734904] Lustre: DEBUG MARKER: == sanity-pfl test 3: Delete component from existing file ============================================ 04:43:50 (1524545030)
      [34434.509312] Lustre: lustre-OST0006: Client lustre-MDT0001-mdtlov_UUID (at 10.9.4.25@tcp) reconnecting
      [34434.512144] Lustre: lustre-OST0006: Client lustre-MDT0003-mdtlov_UUID (at 10.9.4.25@tcp) reconnecting
      [34434.512149] Lustre: Skipped 7 previous similar messages
      [34434.516050] WARNING: MMP writes to pool 'lustre-ost2' have not succeeded in over 20s; suspending pool
      [34434.516059] Kernel panic - not syncing: Pool 'lustre-ost2' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
      [34434.516071] CPU: 0 PID: 16454 Comm: mmp Tainted: P OE ------------ 3.10.0-693.21.1.el7_lustre.x86_64 #1
      [34434.516072] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      [34434.516077] Call Trace:
      [34434.516133] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
      [34434.516137] [<ffffffff816a8634>] panic+0xe8/0x21f
      [34434.516443] [<ffffffffc05734a6>] zio_suspend+0x106/0x110 [zfs]
      [34434.516470] [<ffffffffc04fa322>] mmp_thread+0x322/0x4a0 [zfs]
      [34434.516491] [<ffffffffc04fa000>] ? mmp_write_done+0x1d0/0x1d0 [zfs]
      [34434.516528] [<ffffffffc03aefc3>] thread_generic_wrapper+0x73/0x80 [spl]
      [34434.516532] [<ffffffffc03aef50>] ? __thread_exit+0x20/0x20 [spl]
      [34434.516555] [<ffffffff810b4031>] kthread+0xd1/0xe0
      [34434.516558] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [34434.516574] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [34434.516577] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-pfl test_3 - Test crashed during sanity-pfl test_3

      Attachments

        Issue Links

          Activity

            [LU-10956] sanity-pfl test_3: Kernel panic - not syncing: Pool has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic
            sarah Sarah Liu added a comment -

            same failure happened in rolling upgrade from 2.10.8 zfs to 2.12.8 zfs(all servers and clients were upgraded). sanity test_101d, MDS and OSS both crashed. crash dumps can be found on /scratch/dumps/onyx-20vm6.onyx.whamcloud.com/10.240.22.5-2021-12-15-00:43:14
            /scratch/dumps/onyx-20vm7.onyx.whamcloud.com/10.240.22.6-2021-12-15-00:43:14

            [ 6189.196087] Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================================== 00:42:59 (1639528979)
            [ 6210.315899] Lustre: 31346:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528993/real 1639528993]  req@ffff94cc84058480 x1719168217710592/t0(0) o400->MGC10.240.22.5@tcp@10.240.22.5@tcp:26/25 lens 224/224 e 0 to 1 dl 1639529000 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [ 6210.315901] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528993/real 1639528993]  req@ffff94cc8405b180 x1719168217710656/t0(0) o400->lustre-MDT0000-mdc-ffff94cc847e2800@10.240.22.5@tcp:12/10 lens 224/224 e 0 to 1 dl 1639529000 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [ 6210.320698] Lustre: lustre-MDT0000-mdc-ffff94cc847e2800: Connection to lustre-MDT0000 (at 10.240.22.5@tcp) was lost; in progress operations using this service will wait for recovery to complete
            [ 6210.333787] Lustre: 31346:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            [ 6210.335541] LustreError: 166-1: MGC10.240.22.5@tcp: Connection to MGS (at 10.240.22.5@tcp) was lost; in progress operations using this service will fail
            [ 6217.320480] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1639529000/real 0]  req@ffff94cc8405b600 x1719168217710976/t0(0) o400->lustre-OST0000-osc-ffff94cc847e2800@10.240.22.6@tcp:28/4 lens 224/224 e 0 to 1 dl 1639529007 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            [ 6217.325616] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            [ 6223.326129] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528981/real 1639528981]  req@ffff94cc849ad680 x1719168217708864/t0(0) o4->lustre-OST0000-osc-ffff94cc847e2800@10.240.22.6@tcp:6/4 lens 488/448 e 1 to 1 dl 1639529013 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            [ 6360.063126] INFO: task lctl:32059 blocked for more than 120 seconds.
            [ 6360.064419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [ 6360.065900] lctl            D ffff94ccfaae1080     0 32059  31925 0x00000080
            [ 6360.067313] Call Trace:
            [ 6360.067853]  [<ffffffffa3789179>] schedule+0x29/0x70
            [ 6360.068778]  [<ffffffffa3786e41>] schedule_timeout+0x221/0x2d0
            [ 6360.069878]  [<ffffffffa30dae02>] ? default_wake_function+0x12/0x20
            [ 6360.071048]  [<ffffffffa30d30c2>] ? __wake_up_common+0x82/0x120
            [ 6360.072146]  [<ffffffffa378952d>] wait_for_completion+0xfd/0x140
            [ 6360.073285]  [<ffffffffa30dadf0>] ? wake_up_state+0x20/0x20
            [ 6360.074527]  [<ffffffffc0a4b5fd>] __ldlm_bl_to_thread+0xad/0x150 [ptlrpc]
            [ 6360.075790]  [<ffffffffc0a4bc2b>] ldlm_bl_to_thread+0x33b/0x510 [ptlrpc]
            [ 6360.077046]  [<ffffffffc0a50689>] ldlm_bl_to_thread_list+0x19/0x20 [ptlrpc]
            [ 6360.078325]  [<ffffffffc0a49cb6>] ldlm_cancel_lru+0x76/0x180 [ptlrpc]
            [ 6360.079527]  [<ffffffffc0a3ab97>] lru_size_store+0x67/0x430 [ptlrpc]
            [ 6360.080871]  [<ffffffffc0866e7a>] lustre_attr_store+0x1a/0x20 [obdclass]
            [ 6360.082113]  [<ffffffffa32dba62>] sysfs_kf_write+0x42/0x50
            [ 6360.083134]  [<ffffffffa32db04b>] kernfs_fop_write+0xeb/0x160
            [ 6360.084200]  [<ffffffffa324e590>] vfs_write+0xc0/0x1f0
            [ 6360.085161]  [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a
            [ 6360.086367]  [<ffffffffa324f36f>] SyS_write+0x7f/0xf0
            [ 6360.087314]  [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a
            [ 6360.088535]  [<ffffffffa3795f92>] system_call_fastpath+0x25/0x2a
            [ 6360.089640]  [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a
            [ 6480.083025] INFO: task lctl:32059 blocked for more than 120 seconds.
            [ 6480.084299] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            
            [14969.401446] Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================================== 00:42:59 (1639528979)
            [14979.793364] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool
            [14979.795218] Kernel panic - not syncing: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
            [14979.797904] CPU: 0 PID: 17374 Comm: mmp Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.45.1.el7_lustre.x86_64 #1
            [14979.800067] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [14979.801131] Call Trace:
            [14979.801670]  [<ffffffffbab83539>] dump_stack+0x19/0x1b
            [14979.802662]  [<ffffffffbab7d241>] panic+0xe8/0x21f
            [14979.803818]  [<ffffffffc0829456>] zio_suspend+0x116/0x120 [zfs]
            [14979.804952]  [<ffffffffc07afa4c>] mmp_thread+0x41c/0x4c0 [zfs]
            [14979.806072]  [<ffffffffc07af630>] ? mmp_write_done+0x140/0x140 [zfs]
            [14979.807287]  [<ffffffffc0672063>] thread_generic_wrapper+0x73/0x80 [spl]
            [14979.808531]  [<ffffffffc0671ff0>] ? __thread_exit+0x20/0x20 [spl]
            [14979.809705]  [<ffffffffba4c5e61>] kthread+0xd1/0xe0
            [14979.810612]  [<ffffffffba4c5d90>] ? insert_kthread_work+0x40/0x40
            [14979.811761]  [<ffffffffbab95df7>] ret_from_fork_nospec_begin+0x21/0x21
            [14979.812969]  [<ffffffffba4c5d90>] ? insert_kthread_work+0x40/0x40
            
            sarah Sarah Liu added a comment - same failure happened in rolling upgrade from 2.10.8 zfs to 2.12.8 zfs(all servers and clients were upgraded). sanity test_101d, MDS and OSS both crashed. crash dumps can be found on /scratch/dumps/onyx-20vm6.onyx.whamcloud.com/10.240.22.5-2021-12-15-00:43:14 /scratch/dumps/onyx-20vm7.onyx.whamcloud.com/10.240.22.6-2021-12-15-00:43:14 [ 6189.196087] Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================================== 00:42:59 (1639528979) [ 6210.315899] Lustre: 31346:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528993/real 1639528993] req@ffff94cc84058480 x1719168217710592/t0(0) o400->MGC10.240.22.5@tcp@10.240.22.5@tcp:26/25 lens 224/224 e 0 to 1 dl 1639529000 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 6210.315901] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528993/real 1639528993] req@ffff94cc8405b180 x1719168217710656/t0(0) o400->lustre-MDT0000-mdc-ffff94cc847e2800@10.240.22.5@tcp:12/10 lens 224/224 e 0 to 1 dl 1639529000 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [ 6210.320698] Lustre: lustre-MDT0000-mdc-ffff94cc847e2800: Connection to lustre-MDT0000 (at 10.240.22.5@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 6210.333787] Lustre: 31346:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message [ 6210.335541] LustreError: 166-1: MGC10.240.22.5@tcp: Connection to MGS (at 10.240.22.5@tcp) was lost; in progress operations using this service will fail [ 6217.320480] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1639529000/real 0] req@ffff94cc8405b600 x1719168217710976/t0(0) o400->lustre-OST0000-osc-ffff94cc847e2800@10.240.22.6@tcp:28/4 lens 224/224 e 0 to 1 dl 1639529007 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 6217.325616] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message [ 6223.326129] Lustre: 31347:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639528981/real 1639528981] req@ffff94cc849ad680 x1719168217708864/t0(0) o4->lustre-OST0000-osc-ffff94cc847e2800@10.240.22.6@tcp:6/4 lens 488/448 e 1 to 1 dl 1639529013 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [ 6360.063126] INFO: task lctl:32059 blocked for more than 120 seconds. [ 6360.064419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 6360.065900] lctl D ffff94ccfaae1080 0 32059 31925 0x00000080 [ 6360.067313] Call Trace: [ 6360.067853] [<ffffffffa3789179>] schedule+0x29/0x70 [ 6360.068778] [<ffffffffa3786e41>] schedule_timeout+0x221/0x2d0 [ 6360.069878] [<ffffffffa30dae02>] ? default_wake_function+0x12/0x20 [ 6360.071048] [<ffffffffa30d30c2>] ? __wake_up_common+0x82/0x120 [ 6360.072146] [<ffffffffa378952d>] wait_for_completion+0xfd/0x140 [ 6360.073285] [<ffffffffa30dadf0>] ? wake_up_state+0x20/0x20 [ 6360.074527] [<ffffffffc0a4b5fd>] __ldlm_bl_to_thread+0xad/0x150 [ptlrpc] [ 6360.075790] [<ffffffffc0a4bc2b>] ldlm_bl_to_thread+0x33b/0x510 [ptlrpc] [ 6360.077046] [<ffffffffc0a50689>] ldlm_bl_to_thread_list+0x19/0x20 [ptlrpc] [ 6360.078325] [<ffffffffc0a49cb6>] ldlm_cancel_lru+0x76/0x180 [ptlrpc] [ 6360.079527] [<ffffffffc0a3ab97>] lru_size_store+0x67/0x430 [ptlrpc] [ 6360.080871] [<ffffffffc0866e7a>] lustre_attr_store+0x1a/0x20 [obdclass] [ 6360.082113] [<ffffffffa32dba62>] sysfs_kf_write+0x42/0x50 [ 6360.083134] [<ffffffffa32db04b>] kernfs_fop_write+0xeb/0x160 [ 6360.084200] [<ffffffffa324e590>] vfs_write+0xc0/0x1f0 [ 6360.085161] [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a [ 6360.086367] [<ffffffffa324f36f>] SyS_write+0x7f/0xf0 [ 6360.087314] [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a [ 6360.088535] [<ffffffffa3795f92>] system_call_fastpath+0x25/0x2a [ 6360.089640] [<ffffffffa3795ed5>] ? system_call_after_swapgs+0xa2/0x13a [ 6480.083025] INFO: task lctl:32059 blocked for more than 120 seconds. [ 6480.084299] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [14969.401446] Lustre: DEBUG MARKER: == sanity test 101d: file read with and without read-ahead enabled =================================== 00:42:59 (1639528979) [14979.793364] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool [14979.795218] Kernel panic - not syncing: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [14979.797904] CPU: 0 PID: 17374 Comm: mmp Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.45.1.el7_lustre.x86_64 #1 [14979.800067] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [14979.801131] Call Trace: [14979.801670] [<ffffffffbab83539>] dump_stack+0x19/0x1b [14979.802662] [<ffffffffbab7d241>] panic+0xe8/0x21f [14979.803818] [<ffffffffc0829456>] zio_suspend+0x116/0x120 [zfs] [14979.804952] [<ffffffffc07afa4c>] mmp_thread+0x41c/0x4c0 [zfs] [14979.806072] [<ffffffffc07af630>] ? mmp_write_done+0x140/0x140 [zfs] [14979.807287] [<ffffffffc0672063>] thread_generic_wrapper+0x73/0x80 [spl] [14979.808531] [<ffffffffc0671ff0>] ? __thread_exit+0x20/0x20 [spl] [14979.809705] [<ffffffffba4c5e61>] kthread+0xd1/0xe0 [14979.810612] [<ffffffffba4c5d90>] ? insert_kthread_work+0x40/0x40 [14979.811761] [<ffffffffbab95df7>] ret_from_fork_nospec_begin+0x21/0x21 [14979.812969] [<ffffffffba4c5d90>] ? insert_kthread_work+0x40/0x40
            sarah Sarah Liu added a comment -

            Hit the same error when doing clean downgrade testing from 2.12.8 back to 2.12.7 zfs
            system was first formatted as 2.12.7 zfs, clean upgrade 2.12.8, no issue; then clean downgrade back to 2.12.7, mds showed following error when doing some I/O

            [  819.523684] Lustre: MGS: Connection restored to c1610ca8-7c1f-280d-38e2-6e7a30dd0376 (at 10.240.22.6@tcp)
            [  826.309413] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
            [  828.924608] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
            [  851.395455] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
            [  876.483447] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
            [  901.571491] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
            [  926.659577] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
            [  951.747496] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
            [ 1001.924417] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
            [ 1001.926432] LustreError: Skipped 1 previous similar message
            [ 1077.188415] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16
            [ 1077.190448] LustreError: Skipped 2 previous similar messages
            [ 1253.326971] Lustre: MGS: Connection restored to 6e2a8795-6a65-b3f1-be54-38d6f84216b5 (at 10.240.22.7@tcp)
            [ 1253.335182] Lustre: Skipped 1 previous similar message
            [ 1253.383388] Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects
            [ 1253.385279] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:59
            [ 1278.436952] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:34
            [ 1303.459633] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:09
            [ 1328.546430] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:44
            [ 1353.633076] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:19
            [ 1378.715611] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:54
            [ 1403.797922] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp
            [ 1529.217734] Lustre: Skipped 2 previous similar messages
            [ 1553.300393] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
            [ 1553.301864] Lustre: lustre-MDT0000: disconnecting 1 stale clients
            [ 1553.443706] Lustre: lustre-MDT0000: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted.
            [ 1555.298604] Lustre: lustre-MDT0000: Connection restored to 6e2a8795-6a65-b3f1-be54-38d6f84216b5 (at 10.240.22.7@tcp)
            [ 5227.763272] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool
            [ 5227.768866] Kernel panic - not syncing: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
            [ 5227.771388] CPU: 0 PID: 20720 Comm: mmp Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.25.1.el7_lustre.x86_64 #1
            [ 5227.773400] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [ 5227.774396] Call Trace:
            [ 5227.774911]  [<ffffffffb8b8311a>] dump_stack+0x19/0x1b
            [ 5227.775815]  [<ffffffffb8b7c672>] panic+0xe8/0x21f
            [ 5227.777042]  [<ffffffffc0902446>] zio_suspend+0x116/0x120 [zfs]
            [ 5227.778109]  [<ffffffffc0888a4c>] mmp_thread+0x41c/0x4c0 [zfs]
            [ 5227.779164]  [<ffffffffc0888630>] ? mmp_write_done+0x140/0x140 [zfs]
            [ 5227.780316]  [<ffffffffc0755063>] thread_generic_wrapper+0x73/0x80 [spl]
            [ 5227.781478]  [<ffffffffc0754ff0>] ? __thread_exit+0x20/0x20 [spl]
            [ 5227.782565]  [<ffffffffb84c5da1>] kthread+0xd1/0xe0
            [ 5227.783424]  [<ffffffffb84c5cd0>] ? insert_kthread_work+0x40/0x40
            [ 5227.784487]  [<ffffffffb8b95df7>] ret_from_fork_nospec_begin+0x21/0x21
            [ 5227.785622]  [<ffffffffb84c5cd0>] ? insert_kthread_work+0x40/0x40
            

            Tried a seconde time, this time, no I/O and mds cannot be mounted, dmesg shows "pool suspended", similar as LU-9845

            [  100.065432] Key type id_resolver registered
            [  100.066277] Key type id_legacy registered
            [  358.013062] Rounding down aligned max_sectors from 4294967295 to 4294967288
            [  358.039021] Loading iSCSI transport class v2.0-870.
            [  358.050683] iscsi: registered transport (iser)
            [  358.107129] RPC: Registered rdma transport module.
            [  358.108114] RPC: Registered rdma backchannel transport module.
            [  574.759980] spl: loading out-of-tree module taints kernel.
            [  574.763735] spl: module verification failed: signature and/or required key missing - tainting kernel
            [  574.770148] SPL: Loaded module v0.7.13-1
            [  574.771122] znvpair: module license 'CDDL' taints kernel.
            [  574.772226] Disabling lock debugging due to kernel taint
            [  576.704935] ZFS: Loaded module v0.7.13-1, ZFS pool version 5000, ZFS filesystem version 5
            [ 3621.200468] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool
            [ 3621.202323] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended.
            
            
            [root@onyx-20vm6 ~]# zpool list
            NAME          SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
            lustre-mdt1  5.97G  99.9M  5.87G         -    40%     1%  1.00x  SUSPENDED  -
            
            sarah Sarah Liu added a comment - Hit the same error when doing clean downgrade testing from 2.12.8 back to 2.12.7 zfs system was first formatted as 2.12.7 zfs, clean upgrade 2.12.8, no issue; then clean downgrade back to 2.12.7, mds showed following error when doing some I/O [ 819.523684] Lustre: MGS: Connection restored to c1610ca8-7c1f-280d-38e2-6e7a30dd0376 (at 10.240.22.6@tcp) [ 826.309413] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 828.924608] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 851.395455] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 876.483447] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 901.571491] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 926.659577] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 951.747496] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 1001.924417] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 1001.926432] LustreError: Skipped 1 previous similar message [ 1077.188415] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_connect to node 10.240.22.6@tcp failed: rc = -16 [ 1077.190448] LustreError: Skipped 2 previous similar messages [ 1253.326971] Lustre: MGS: Connection restored to 6e2a8795-6a65-b3f1-be54-38d6f84216b5 (at 10.240.22.7@tcp) [ 1253.335182] Lustre: Skipped 1 previous similar message [ 1253.383388] Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects [ 1253.385279] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:59 [ 1278.436952] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:34 [ 1303.459633] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 4:09 [ 1328.546430] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:44 [ 1353.633076] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 3:19 [ 1378.715611] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp), waiting for 1 known clients (0 recovered, 0 in progress, and 0 evicted) to recover in 2:54 [ 1403.797922] Lustre: lustre-MDT0000: Denying connection for new client 09f93b4c-bd9c-dc16-257f-001e9b331e63 (at 10.240.22.7@tcp [ 1529.217734] Lustre: Skipped 2 previous similar messages [ 1553.300393] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports [ 1553.301864] Lustre: lustre-MDT0000: disconnecting 1 stale clients [ 1553.443706] Lustre: lustre-MDT0000: Recovery over after 5:00, of 1 clients 0 recovered and 1 was evicted. [ 1555.298604] Lustre: lustre-MDT0000: Connection restored to 6e2a8795-6a65-b3f1-be54-38d6f84216b5 (at 10.240.22.7@tcp) [ 5227.763272] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool [ 5227.768866] Kernel panic - not syncing: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [ 5227.771388] CPU: 0 PID: 20720 Comm: mmp Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.25.1.el7_lustre.x86_64 #1 [ 5227.773400] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 5227.774396] Call Trace: [ 5227.774911] [<ffffffffb8b8311a>] dump_stack+0x19/0x1b [ 5227.775815] [<ffffffffb8b7c672>] panic+0xe8/0x21f [ 5227.777042] [<ffffffffc0902446>] zio_suspend+0x116/0x120 [zfs] [ 5227.778109] [<ffffffffc0888a4c>] mmp_thread+0x41c/0x4c0 [zfs] [ 5227.779164] [<ffffffffc0888630>] ? mmp_write_done+0x140/0x140 [zfs] [ 5227.780316] [<ffffffffc0755063>] thread_generic_wrapper+0x73/0x80 [spl] [ 5227.781478] [<ffffffffc0754ff0>] ? __thread_exit+0x20/0x20 [spl] [ 5227.782565] [<ffffffffb84c5da1>] kthread+0xd1/0xe0 [ 5227.783424] [<ffffffffb84c5cd0>] ? insert_kthread_work+0x40/0x40 [ 5227.784487] [<ffffffffb8b95df7>] ret_from_fork_nospec_begin+0x21/0x21 [ 5227.785622] [<ffffffffb84c5cd0>] ? insert_kthread_work+0x40/0x40 Tried a seconde time, this time, no I/O and mds cannot be mounted, dmesg shows "pool suspended", similar as LU-9845 [ 100.065432] Key type id_resolver registered [ 100.066277] Key type id_legacy registered [ 358.013062] Rounding down aligned max_sectors from 4294967295 to 4294967288 [ 358.039021] Loading iSCSI transport class v2.0-870. [ 358.050683] iscsi: registered transport (iser) [ 358.107129] RPC: Registered rdma transport module. [ 358.108114] RPC: Registered rdma backchannel transport module. [ 574.759980] spl: loading out-of-tree module taints kernel. [ 574.763735] spl: module verification failed: signature and/or required key missing - tainting kernel [ 574.770148] SPL: Loaded module v0.7.13-1 [ 574.771122] znvpair: module license 'CDDL' taints kernel. [ 574.772226] Disabling lock debugging due to kernel taint [ 576.704935] ZFS: Loaded module v0.7.13-1, ZFS pool version 5000, ZFS filesystem version 5 [ 3621.200468] WARNING: MMP writes to pool 'lustre-mdt1' have not succeeded in over 5s; suspending pool [ 3621.202323] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended. [root@onyx-20vm6 ~]# zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT lustre-mdt1 5.97G 99.9M 5.87G - 40% 1% 1.00x SUSPENDED -
            [ 2868.632037] WARNING: MMP writes to pool 'lustre-mdt3' have not succeeded in over 60004 ms; suspending pool. Hrtime 2868632010566

            this is 60 seconds, right?
            https://testing.whamcloud.com/test_sets/719b272d-26d8-491c-9853-1291a8d4ede6

            bzzz Alex Zhuravlev added a comment - [ 2868.632037] WARNING: MMP writes to pool 'lustre-mdt3' have not succeeded in over 60004 ms; suspending pool. Hrtime 2868632010566 this is 60 seconds, right? https://testing.whamcloud.com/test_sets/719b272d-26d8-491c-9853-1291a8d4ede6
            sebastien Sebastien Buisson added a comment - +1 in recovery-small test_17a https://testing.whamcloud.com/test_sets/90191f3b-00b4-45e9-b6b7-be55b45bbedc
            lixi_wc Li Xi added a comment - +1 https://testing.whamcloud.com/test_sets/7494598f-5f66-4f53-99c2-a42fd4422b99
            lixi_wc Li Xi added a comment - +1 for sanity:413a https://testing.whamcloud.com/test_sets/91ccf494-32bd-4502-bff5-c9cba0cbc108
            scherementsev Sergey Cheremencev added a comment - replay-dual test_21a https://testing.whamcloud.com/test_sets/6e31af1a-3eb5-484a-8aca-000eb45c172c

            We’ve seen this crash for several test suites:
            Lustre 2.13.56.23 - mds-survey test_1 - https://testing.whamcloud.com/test_sets/edd10cda-ed14-4cd3-a1b7-6df26a6cc734
            Lustre 2.13.56.44 - parallel-scale-nfsv3 test_compilebench - https://testing.whamcloud.com/test_sets/546e48a1-cac0-4c05-866b-5bbe4fd0925f
            Lustre 2.13.57.53 - recovery-mds-scale test_failover_mds - https://testing.whamcloud.com/test_sets/c3d24d75-de7f-497b-af46-452764362666
            Lustre 2.13.57.53 - recovery-random-scale test_fail_client_mds - https://testing.whamcloud.com/test_sets/b9f869f9-3b10-4d3e-9c43-98c5d333eb46

            jamesanunez James Nunez (Inactive) added a comment - We’ve seen this crash for several test suites: Lustre 2.13.56.23 - mds-survey test_1 - https://testing.whamcloud.com/test_sets/edd10cda-ed14-4cd3-a1b7-6df26a6cc734 Lustre 2.13.56.44 - parallel-scale-nfsv3 test_compilebench - https://testing.whamcloud.com/test_sets/546e48a1-cac0-4c05-866b-5bbe4fd0925f Lustre 2.13.57.53 - recovery-mds-scale test_failover_mds - https://testing.whamcloud.com/test_sets/c3d24d75-de7f-497b-af46-452764362666 Lustre 2.13.57.53 - recovery-random-scale test_fail_client_mds - https://testing.whamcloud.com/test_sets/b9f869f9-3b10-4d3e-9c43-98c5d333eb46
            sarah Sarah Liu added a comment - hit the same issue in master zfs failover https://testing.whamcloud.com/test_sets/7e12ed8b-b695-46d6-9103-a35b55b378a0
            sarah Sarah Liu added a comment -

            hit the problem in rolling upgrade from 2.10.8 EL7.6 to 2.12.5 EL7.8 zfs
            After rolling upgrade all servers and client to 2.12.5, sanity 42e,180c hit the same crash on OSS

            [  907.517100] Lustre: DEBUG MARKER: == sanity test 42e: verify sub-RPC writes are not done synchronously ================================= 06:16:17 (1591337777)
            [  910.460672] Lustre: lustre-OST0000: Connection restored to 47f23d53-3404-b4c5-e304-97df136e115c (at 10.9.6.157@tcp)
            [  910.462673] Lustre: Skipped 1 previous similar message
            [  922.630198] WARNING: MMP writes to pool 'lustre-ost1' have not succeeded in over 5s; suspending pool
            [  922.631786] Kernel panic - not syncing: Pool 'lustre-ost1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
            [  922.634113] CPU: 1 PID: 2823 Comm: mmp Kdump: loaded Tainted: P           OE  ------------   3.10.0-1127.8.2.el7_lustre.x86_64 #1
            [  922.635942] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [  922.636859] Call Trace:
            [  922.637279]  [<ffffffffa717ffa5>] dump_stack+0x19/0x1b
            [  922.638104]  [<ffffffffa7179541>] panic+0xe8/0x21f
            [  922.638963]  [<ffffffffc05b7446>] zio_suspend+0x116/0x120 [zfs]
            [  922.639937]  [<ffffffffc053da4c>] mmp_thread+0x41c/0x4c0 [zfs]
            [  922.640900]  [<ffffffffc053d630>] ? mmp_write_done+0x140/0x140 [zfs]
            [  922.641928]  [<ffffffffc02f6063>] thread_generic_wrapper+0x73/0x80 [spl]
            [  922.642999]  [<ffffffffc02f5ff0>] ? __thread_exit+0x20/0x20 [spl]
            [  922.643977]  [<ffffffffa6ac6691>] kthread+0xd1/0xe0
            [  922.644765]  [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40
            [  922.645735]  [<ffffffffa7192d37>] ret_from_fork_nospec_begin+0x21/0x21
            [  922.646777]  [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            
            
            sarah Sarah Liu added a comment - hit the problem in rolling upgrade from 2.10.8 EL7.6 to 2.12.5 EL7.8 zfs After rolling upgrade all servers and client to 2.12.5, sanity 42e,180c hit the same crash on OSS [ 907.517100] Lustre: DEBUG MARKER: == sanity test 42e: verify sub-RPC writes are not done synchronously ================================= 06:16:17 (1591337777) [ 910.460672] Lustre: lustre-OST0000: Connection restored to 47f23d53-3404-b4c5-e304-97df136e115c (at 10.9.6.157@tcp) [ 910.462673] Lustre: Skipped 1 previous similar message [ 922.630198] WARNING: MMP writes to pool 'lustre-ost1' have not succeeded in over 5s; suspending pool [ 922.631786] Kernel panic - not syncing: Pool 'lustre-ost1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [ 922.634113] CPU: 1 PID: 2823 Comm: mmp Kdump: loaded Tainted: P OE ------------ 3.10.0-1127.8.2.el7_lustre.x86_64 #1 [ 922.635942] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 922.636859] Call Trace: [ 922.637279] [<ffffffffa717ffa5>] dump_stack+0x19/0x1b [ 922.638104] [<ffffffffa7179541>] panic+0xe8/0x21f [ 922.638963] [<ffffffffc05b7446>] zio_suspend+0x116/0x120 [zfs] [ 922.639937] [<ffffffffc053da4c>] mmp_thread+0x41c/0x4c0 [zfs] [ 922.640900] [<ffffffffc053d630>] ? mmp_write_done+0x140/0x140 [zfs] [ 922.641928] [<ffffffffc02f6063>] thread_generic_wrapper+0x73/0x80 [spl] [ 922.642999] [<ffffffffc02f5ff0>] ? __thread_exit+0x20/0x20 [spl] [ 922.643977] [<ffffffffa6ac6691>] kthread+0xd1/0xe0 [ 922.644765] [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40 [ 922.645735] [<ffffffffa7192d37>] ret_from_fork_nospec_begin+0x21/0x21 [ 922.646777] [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: