Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10956

sanity-pfl test_3: Kernel panic - not syncing: Pool has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.14.0, Lustre 2.12.5, Lustre 2.12.8, Lustre 2.15.3
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/252abdaa-477b-11e8-95c0-52540065bddc

      test_3 failed with the following error:

      Test crashed during sanity-pfl test_3
      

      env: RHEL7 zfs DNE tag-2.11.51

      this is the trace found in kernel-crash.log

      [34408.762645] Lustre: DEBUG MARKER: dmesg
      [34409.519801] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity-pfl test 3: Delete component from existing file ============================================ 04:43:50 \(1524545030\)
      [34409.734904] Lustre: DEBUG MARKER: == sanity-pfl test 3: Delete component from existing file ============================================ 04:43:50 (1524545030)
      [34434.509312] Lustre: lustre-OST0006: Client lustre-MDT0001-mdtlov_UUID (at 10.9.4.25@tcp) reconnecting
      [34434.512144] Lustre: lustre-OST0006: Client lustre-MDT0003-mdtlov_UUID (at 10.9.4.25@tcp) reconnecting
      [34434.512149] Lustre: Skipped 7 previous similar messages
      [34434.516050] WARNING: MMP writes to pool 'lustre-ost2' have not succeeded in over 20s; suspending pool
      [34434.516059] Kernel panic - not syncing: Pool 'lustre-ost2' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
      [34434.516071] CPU: 0 PID: 16454 Comm: mmp Tainted: P OE ------------ 3.10.0-693.21.1.el7_lustre.x86_64 #1
      [34434.516072] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      [34434.516077] Call Trace:
      [34434.516133] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
      [34434.516137] [<ffffffff816a8634>] panic+0xe8/0x21f
      [34434.516443] [<ffffffffc05734a6>] zio_suspend+0x106/0x110 [zfs]
      [34434.516470] [<ffffffffc04fa322>] mmp_thread+0x322/0x4a0 [zfs]
      [34434.516491] [<ffffffffc04fa000>] ? mmp_write_done+0x1d0/0x1d0 [zfs]
      [34434.516528] [<ffffffffc03aefc3>] thread_generic_wrapper+0x73/0x80 [spl]
      [34434.516532] [<ffffffffc03aef50>] ? __thread_exit+0x20/0x20 [spl]
      [34434.516555] [<ffffffff810b4031>] kthread+0xd1/0xe0
      [34434.516558] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      [34434.516574] [<ffffffff816c0577>] ret_from_fork+0x77/0xb0
      [34434.516577] [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
      
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-pfl test_3 - Test crashed during sanity-pfl test_3

      Attachments

        Issue Links

          Activity

            [LU-10956] sanity-pfl test_3: Kernel panic - not syncing: Pool has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic
            [ 2868.632037] WARNING: MMP writes to pool 'lustre-mdt3' have not succeeded in over 60004 ms; suspending pool. Hrtime 2868632010566

            this is 60 seconds, right?
            https://testing.whamcloud.com/test_sets/719b272d-26d8-491c-9853-1291a8d4ede6

            bzzz Alex Zhuravlev added a comment - [ 2868.632037] WARNING: MMP writes to pool 'lustre-mdt3' have not succeeded in over 60004 ms; suspending pool. Hrtime 2868632010566 this is 60 seconds, right? https://testing.whamcloud.com/test_sets/719b272d-26d8-491c-9853-1291a8d4ede6
            sebastien Sebastien Buisson added a comment - +1 in recovery-small test_17a https://testing.whamcloud.com/test_sets/90191f3b-00b4-45e9-b6b7-be55b45bbedc
            lixi_wc Li Xi added a comment - +1 https://testing.whamcloud.com/test_sets/7494598f-5f66-4f53-99c2-a42fd4422b99
            lixi_wc Li Xi added a comment - +1 for sanity:413a https://testing.whamcloud.com/test_sets/91ccf494-32bd-4502-bff5-c9cba0cbc108
            scherementsev Sergey Cheremencev added a comment - replay-dual test_21a https://testing.whamcloud.com/test_sets/6e31af1a-3eb5-484a-8aca-000eb45c172c

            We’ve seen this crash for several test suites:
            Lustre 2.13.56.23 - mds-survey test_1 - https://testing.whamcloud.com/test_sets/edd10cda-ed14-4cd3-a1b7-6df26a6cc734
            Lustre 2.13.56.44 - parallel-scale-nfsv3 test_compilebench - https://testing.whamcloud.com/test_sets/546e48a1-cac0-4c05-866b-5bbe4fd0925f
            Lustre 2.13.57.53 - recovery-mds-scale test_failover_mds - https://testing.whamcloud.com/test_sets/c3d24d75-de7f-497b-af46-452764362666
            Lustre 2.13.57.53 - recovery-random-scale test_fail_client_mds - https://testing.whamcloud.com/test_sets/b9f869f9-3b10-4d3e-9c43-98c5d333eb46

            jamesanunez James Nunez (Inactive) added a comment - We’ve seen this crash for several test suites: Lustre 2.13.56.23 - mds-survey test_1 - https://testing.whamcloud.com/test_sets/edd10cda-ed14-4cd3-a1b7-6df26a6cc734 Lustre 2.13.56.44 - parallel-scale-nfsv3 test_compilebench - https://testing.whamcloud.com/test_sets/546e48a1-cac0-4c05-866b-5bbe4fd0925f Lustre 2.13.57.53 - recovery-mds-scale test_failover_mds - https://testing.whamcloud.com/test_sets/c3d24d75-de7f-497b-af46-452764362666 Lustre 2.13.57.53 - recovery-random-scale test_fail_client_mds - https://testing.whamcloud.com/test_sets/b9f869f9-3b10-4d3e-9c43-98c5d333eb46
            sarah Sarah Liu added a comment - hit the same issue in master zfs failover https://testing.whamcloud.com/test_sets/7e12ed8b-b695-46d6-9103-a35b55b378a0
            sarah Sarah Liu added a comment -

            hit the problem in rolling upgrade from 2.10.8 EL7.6 to 2.12.5 EL7.8 zfs
            After rolling upgrade all servers and client to 2.12.5, sanity 42e,180c hit the same crash on OSS

            [  907.517100] Lustre: DEBUG MARKER: == sanity test 42e: verify sub-RPC writes are not done synchronously ================================= 06:16:17 (1591337777)
            [  910.460672] Lustre: lustre-OST0000: Connection restored to 47f23d53-3404-b4c5-e304-97df136e115c (at 10.9.6.157@tcp)
            [  910.462673] Lustre: Skipped 1 previous similar message
            [  922.630198] WARNING: MMP writes to pool 'lustre-ost1' have not succeeded in over 5s; suspending pool
            [  922.631786] Kernel panic - not syncing: Pool 'lustre-ost1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
            [  922.634113] CPU: 1 PID: 2823 Comm: mmp Kdump: loaded Tainted: P           OE  ------------   3.10.0-1127.8.2.el7_lustre.x86_64 #1
            [  922.635942] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [  922.636859] Call Trace:
            [  922.637279]  [<ffffffffa717ffa5>] dump_stack+0x19/0x1b
            [  922.638104]  [<ffffffffa7179541>] panic+0xe8/0x21f
            [  922.638963]  [<ffffffffc05b7446>] zio_suspend+0x116/0x120 [zfs]
            [  922.639937]  [<ffffffffc053da4c>] mmp_thread+0x41c/0x4c0 [zfs]
            [  922.640900]  [<ffffffffc053d630>] ? mmp_write_done+0x140/0x140 [zfs]
            [  922.641928]  [<ffffffffc02f6063>] thread_generic_wrapper+0x73/0x80 [spl]
            [  922.642999]  [<ffffffffc02f5ff0>] ? __thread_exit+0x20/0x20 [spl]
            [  922.643977]  [<ffffffffa6ac6691>] kthread+0xd1/0xe0
            [  922.644765]  [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40
            [  922.645735]  [<ffffffffa7192d37>] ret_from_fork_nospec_begin+0x21/0x21
            [  922.646777]  [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40
            [    0.000000] Initializing cgroup subsys cpuset
            [    0.000000] Initializing cgroup subsys cpu
            
            
            sarah Sarah Liu added a comment - hit the problem in rolling upgrade from 2.10.8 EL7.6 to 2.12.5 EL7.8 zfs After rolling upgrade all servers and client to 2.12.5, sanity 42e,180c hit the same crash on OSS [ 907.517100] Lustre: DEBUG MARKER: == sanity test 42e: verify sub-RPC writes are not done synchronously ================================= 06:16:17 (1591337777) [ 910.460672] Lustre: lustre-OST0000: Connection restored to 47f23d53-3404-b4c5-e304-97df136e115c (at 10.9.6.157@tcp) [ 910.462673] Lustre: Skipped 1 previous similar message [ 922.630198] WARNING: MMP writes to pool 'lustre-ost1' have not succeeded in over 5s; suspending pool [ 922.631786] Kernel panic - not syncing: Pool 'lustre-ost1' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic. [ 922.634113] CPU: 1 PID: 2823 Comm: mmp Kdump: loaded Tainted: P OE ------------ 3.10.0-1127.8.2.el7_lustre.x86_64 #1 [ 922.635942] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 922.636859] Call Trace: [ 922.637279] [<ffffffffa717ffa5>] dump_stack+0x19/0x1b [ 922.638104] [<ffffffffa7179541>] panic+0xe8/0x21f [ 922.638963] [<ffffffffc05b7446>] zio_suspend+0x116/0x120 [zfs] [ 922.639937] [<ffffffffc053da4c>] mmp_thread+0x41c/0x4c0 [zfs] [ 922.640900] [<ffffffffc053d630>] ? mmp_write_done+0x140/0x140 [zfs] [ 922.641928] [<ffffffffc02f6063>] thread_generic_wrapper+0x73/0x80 [spl] [ 922.642999] [<ffffffffc02f5ff0>] ? __thread_exit+0x20/0x20 [spl] [ 922.643977] [<ffffffffa6ac6691>] kthread+0xd1/0xe0 [ 922.644765] [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40 [ 922.645735] [<ffffffffa7192d37>] ret_from_fork_nospec_begin+0x21/0x21 [ 922.646777] [<ffffffffa6ac65c0>] ? insert_kthread_work+0x40/0x40 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: