Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5780

Corrupted OST and very long running e2fsck

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.5.2
    • None
    • CentOS 6.5, kernel 2.6.32-431.17.1.el6_lustre.x86_64, e2fsprogs 1.42.12.wc1-7
    • 3
    • 16219

    Description

      We had an OST on Stampede experience corruption and e2fsck does not appear to be making much progress in the repairs.

      The situation is that a drive failed and the mdadm software raid (RAID-6) detected the drive had failed and automatically added the hot spare into the raid-6 device (8+2 3TB drives). The rebuild of the device started, then paused shortly after the resync began at 0.3% progress and it started to issue handle_stripe errors as shown in the attached system log. The next morning, the system administrator came in and saw that the drive had failed and that the rebuild was not making progress (it was still stuck at 0.3% of rebuild). The admin then removed the failed drive from the array using the mdadm -r command and the array resumed the rebuild process again. At that point, everything looked fine with the filesystem, the rebuild looked like it was doing the right thing and filesystem was operating normally. Then at 16:49 ldiskfs detected corruption and remounted the filesystem as read-only. At this point, the rebuild of the array was still in process at about 50% and appeared to have been working correctly.

      The main issue now is that e2fsck on the device is taking far too long and requiring too much memory. Right now we have been running e2fsck for more than 5 days and keeps spewing errors similar to:

      Inode 76812 is too big. Truncate? yes

      Block #1 (3115984703) causes symlink to be too big. CLEARED.
      Block #2 (1071886003) causes symlink to be too big. CLEARED.
      Block #3 (318283451) causes symlink to be too big. CLEARED.
      Block #4 (1071886789) causes symlink to be too big. CLEARED.
      Block #5 (3173575420) causes symlink to be too big. CLEARED.
      Block #6 (1071898601) causes symlink to be too big. CLEARED.
      Block #7 (742927844) causes symlink to be too big. CLEARED.
      Block #8 (1071900730) causes symlink to be too big. CLEARED.
      Block #9 (2126297851) causes symlink to be too big. CLEARED.
      Block #10 (1071885849) causes symlink to be too big. CLEARED.
      Block #11 (3395849213) causes symlink to be too big. CLEARED.
      Too many illegal blocks in inode 76812.
      Clear inode? yes

      Inode 76865 has INDEX_FL flag set but is not a directory.
      Clear HTree index? yes

      Inode 76865 has INDEX_FL flag set on filesystem without htree support.
      Clear HTree index? yes

      Inode 76865, i_size is 2450935918319985952, should be 0. Fix? yes

      Inode 76865, i_blocks is 144761049163443, should be 0. Fix? yes

      Inode 76906 has compression flag set on filesystem without compression support. Clear? yes

      Inode 76906 is too big. Truncate? yes

      Block #797810728 (827805746) causes file to be too big. CLEARED.
      Block #797810729 (1482179393) causes file to be too big. CLEARED.
      Block #797810730 (842676282) causes file to be too big. CLEARED.
      Block #797810731 (976236851) causes file to be too big. CLEARED.
      Block #797810732 (842084665) causes file to be too big. CLEARED.
      Block #797810733 (959657786) causes file to be too big. CLEARED.
      Block #797810734 (828322868) causes file to be too big. CLEARED.
      Block #797810735 (809127482) causes file to be too big. CLEARED.
      Block #797810736 (1413567546) causes file to be too big. CLEARED.
      Block #797810737 (156516673) causes file to be too big. CLEARED.
      Block #797810738 (1917258029) causes file to be too big. CLEARED.
      Too many illegal blocks in inode 76906.
      Clear inode? yes

      At this point we suspect that the array experienced too severe a corruption to recover. Any suggestions on how to get e2fsck to complete in a timely manner or somehow get the OST into a state that we could mount it and try to copy what files we can off of the device?

      It would be helpful to have the e2fsck process show some type of progress like it does for ext3/4 filesystems to give us some indication as to how much longer it might take. The last e2fsck message indicates that it is on about inode 76906 and is using up about 70GB of memory. debugfs reports the following about the filesystem:

      debugfs: stats
      Filesystem volume name: scratch-OST0010
      Last mounted on: /
      Filesystem UUID: 2989173c-9d2b-4fe5-8f45-99c0eebfda57
      Filesystem magic number: 0xEF53
      Filesystem revision #: 1 (dynamic)
      Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_su
      per large_file huge_file uninit_bg dir_nlink quota
      Filesystem flags: signed_directory_hash
      Default mount options: user_xattr acl
      Filesystem state: not clean with errors
      Errors behavior: Continue
      Filesystem OS type: Linux
      Inode count: 22892800
      Block count: 5860530816
      Reserved block count: 0
      Free blocks: 1588593434
      Free inodes: 18496224
      First block: 0
      Block size: 4096
      Fragment size: 4096
      Group descriptor size: 64
      Reserved GDT blocks: 1024
      Blocks per group: 32768
      Fragments per group: 32768
      Inodes per group: 128
      Inode blocks per group: 8
      RAID stride: 8
      RAID stripe width: 64
      Flex block group size: 256
      Filesystem created: Tue Dec 11 08:50:05 2012
      Last mount time: Tue Jun 17 18:29:25 2014
      Last write time: Wed Oct 8 16:57:09 2014

      Attachments

        Issue Links

          Activity

            [LU-5780] Corrupted OST and very long running e2fsck

            interesting. after producing errors like those shown above for many minutes the node I was testing on appears to have crashed. stopped outputting anything to console, stopped responding to pings. seems entirely dead.

            bogl Bob Glossman (Inactive) added a comment - interesting. after producing errors like those shown above for many minutes the node I was testing on appears to have crashed. stopped outputting anything to console, stopped responding to pings. seems entirely dead.
            bogl Bob Glossman (Inactive) added a comment - - edited

            similar errors seen on a recent upstream linux, kernel version 3.19.0-rc7
            example errors:

            Feb 12 17:53:57 shadow-33 kernel: WARNING: CPU: 1 PID: 2352 at drivers/md/raid5.c:3893 handle_stripe+0x2622/0x27a0 [raid456]()
            Feb 12 17:53:57 shadow-33 kernel: Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache cfg80211 x86_pkg_temp_thermal coretemp kvm_intel rfkill kvm crct10dif_pclmul crc32_pclmul ib_ipoib rdma_ucm crc32c_intel ib_ucm ioatdma ib_uverbs ib_umad rdma_cm ghash_clmulni_intel sb_edac edac_core ib_cm iTCO_wdt iw_cm iTCO_vendor_support microcode ipmi_devintf tpm_tis shpchp pcspkr joydev ipmi_si tpm ipmi_msghandler mei_me lpc_ich i2c_i801 wmi mei mfd_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc mlx4_ib ib_sa mlx4_en ib_mad ib_core ib_addr mgag200 drm_kms_helper ttm isci drm igb mlx4_core libsas ptp pps_core dca scsi_transport_sas i2c_algo_bit
            Feb 12 17:53:57 shadow-33 kernel: CPU: 1 PID: 2352 Comm: md1_raid5 Tainted: G        W I    3.19.0-rc7 #1
            Feb 12 17:53:57 shadow-33 kernel: Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.99.99.x045.022820121209 02/28/2012
            Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 00000000b47fa71a ffff88082512fa88 ffffffff818496fe
            Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 0000000000000000 ffff88082512fac8 ffffffff810a705a
            Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 ffff880822c25200 0000000000000001 0000000000000001
            Feb 12 17:53:57 shadow-33 kernel: Call Trace:
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff818496fe>] dump_stack+0x4c/0x65
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810a705a>] warn_slowpath_common+0x8a/0xc0
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810a718a>] warn_slowpath_null+0x1a/0x20
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d4922>] handle_stripe+0x2622/0x27a0 [raid456]
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d4e0d>] handle_active_stripes.isra.39+0x36d/0x540 [raid456]
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81679d09>] ? md_wakeup_thread+0x39/0x70
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d54b7>] raid5d+0x4d7/0x7b0 [raid456]
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81102445>] ? trace_hardirqs_on_caller+0x105/0x1d0
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff8167a0f7>] md_thread+0x157/0x170
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810f87c0>] ? prepare_to_wait_event+0x110/0x110
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81679fa0>] ? md_stop+0x40/0x40
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cdf7d>] kthread+0x10d/0x130
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810e4535>] ? local_clock+0x25/0x30
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cde70>] ? kthread_create_on_node+0x250/0x250
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81853cfc>] ret_from_fork+0x7c/0xb0
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cde70>] ? kthread_create_on_node+0x250/0x250
            Feb 12 17:53:57 shadow-33 kernel: ---[ end trace e66438a9d130171c ]---
            Feb 12 17:53:57 shadow-33 kernel: ------------[ cut here ]------------
            Feb 12 17:53:57 shadow-33 kernel: WARNING: CPU: 1 PID: 2352 at drivers/md/raid5.c:3893 handle_stripe+0x2622/0x27a0 [raid456]()
            Feb 12 17:53:57 shadow-33 kernel: Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache cfg80211 x86_pkg_temp_thermal coretemp kvm_intel rfkill kvm crct10dif_pclmul crc32_pclmul ib_ipoib rdma_ucm crc32c_intel ib_ucm ioatdma ib_uverbs ib_umad rdma_cm ghash_clmulni_intel sb_edac edac_core ib_cm iTCO_wdt iw_cm iTCO_vendor_support microcode ipmi_devintf tpm_tis shpchp pcspkr joydev ipmi_si tpm ipmi_msghandler mei_me lpc_ich i2c_i801 wmi mei mfd_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc mlx4_ib ib_sa mlx4_en ib_mad ib_core ib_addr mgag200 drm_kms_helper ttm isci drm igb mlx4_core libsas ptp pps_core dca scsi_transport_sas i2c_algo_bit
            Feb 12 17:53:57 shadow-33 kernel: CPU: 1 PID: 2352 Comm: md1_raid5 Tainted: G        W I    3.19.0-rc7 #1
            Feb 12 17:53:57 shadow-33 kernel: Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.99.99.x045.022820121209 02/28/2012
            Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 00000000b47fa71a ffff88082512fa88 ffffffff818496fe
            Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 0000000000000000 ffff88082512fac8 ffffffff810a705a
            Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 ffff88081a072e20 0000000000000001 0000000000000001
            Feb 12 17:53:57 shadow-33 kernel: Call Trace:
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff818496fe>] dump_stack+0x4c/0x65
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810a705a>] warn_slowpath_common+0x8a/0xc0
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810a718a>] warn_slowpath_null+0x1a/0x20
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d4922>] handle_stripe+0x2622/0x27a0 [raid456]
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d4e0d>] handle_active_stripes.isra.39+0x36d/0x540 [raid456]
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81679d09>] ? md_wakeup_thread+0x39/0x70
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d54b7>] raid5d+0x4d7/0x7b0 [raid456]
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81102445>] ? trace_hardirqs_on_caller+0x105/0x1d0
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff8167a0f7>] md_thread+0x157/0x170
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810f87c0>] ? prepare_to_wait_event+0x110/0x110
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81679fa0>] ? md_stop+0x40/0x40
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cdf7d>] kthread+0x10d/0x130
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810e4535>] ? local_clock+0x25/0x30
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cde70>] ? kthread_create_on_node+0x250/0x250
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81853cfc>] ret_from_fork+0x7c/0xb0
            Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cde70>] ? kthread_create_on_node+0x250/0x250
            Feb 12 17:53:57 shadow-33 kernel: ---[ end trace e66438a9d130171d ]---
            
            
            bogl Bob Glossman (Inactive) added a comment - - edited similar errors seen on a recent upstream linux, kernel version 3.19.0-rc7 example errors: Feb 12 17:53:57 shadow-33 kernel: WARNING: CPU: 1 PID: 2352 at drivers/md/raid5.c:3893 handle_stripe+0x2622/0x27a0 [raid456]() Feb 12 17:53:57 shadow-33 kernel: Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache cfg80211 x86_pkg_temp_thermal coretemp kvm_intel rfkill kvm crct10dif_pclmul crc32_pclmul ib_ipoib rdma_ucm crc32c_intel ib_ucm ioatdma ib_uverbs ib_umad rdma_cm ghash_clmulni_intel sb_edac edac_core ib_cm iTCO_wdt iw_cm iTCO_vendor_support microcode ipmi_devintf tpm_tis shpchp pcspkr joydev ipmi_si tpm ipmi_msghandler mei_me lpc_ich i2c_i801 wmi mei mfd_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc mlx4_ib ib_sa mlx4_en ib_mad ib_core ib_addr mgag200 drm_kms_helper ttm isci drm igb mlx4_core libsas ptp pps_core dca scsi_transport_sas i2c_algo_bit Feb 12 17:53:57 shadow-33 kernel: CPU: 1 PID: 2352 Comm: md1_raid5 Tainted: G W I 3.19.0-rc7 #1 Feb 12 17:53:57 shadow-33 kernel: Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.99.99.x045.022820121209 02/28/2012 Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 00000000b47fa71a ffff88082512fa88 ffffffff818496fe Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 0000000000000000 ffff88082512fac8 ffffffff810a705a Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 ffff880822c25200 0000000000000001 0000000000000001 Feb 12 17:53:57 shadow-33 kernel: Call Trace: Feb 12 17:53:57 shadow-33 kernel: [<ffffffff818496fe>] dump_stack+0x4c/0x65 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810a705a>] warn_slowpath_common+0x8a/0xc0 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810a718a>] warn_slowpath_null+0x1a/0x20 Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d4922>] handle_stripe+0x2622/0x27a0 [raid456] Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d4e0d>] handle_active_stripes.isra.39+0x36d/0x540 [raid456] Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81679d09>] ? md_wakeup_thread+0x39/0x70 Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d54b7>] raid5d+0x4d7/0x7b0 [raid456] Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81102445>] ? trace_hardirqs_on_caller+0x105/0x1d0 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff8167a0f7>] md_thread+0x157/0x170 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810f87c0>] ? prepare_to_wait_event+0x110/0x110 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81679fa0>] ? md_stop+0x40/0x40 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cdf7d>] kthread+0x10d/0x130 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810e4535>] ? local_clock+0x25/0x30 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cde70>] ? kthread_create_on_node+0x250/0x250 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81853cfc>] ret_from_fork+0x7c/0xb0 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cde70>] ? kthread_create_on_node+0x250/0x250 Feb 12 17:53:57 shadow-33 kernel: ---[ end trace e66438a9d130171c ]--- Feb 12 17:53:57 shadow-33 kernel: ------------[ cut here ]------------ Feb 12 17:53:57 shadow-33 kernel: WARNING: CPU: 1 PID: 2352 at drivers/md/raid5.c:3893 handle_stripe+0x2622/0x27a0 [raid456]() Feb 12 17:53:57 shadow-33 kernel: Modules linked in: raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache cfg80211 x86_pkg_temp_thermal coretemp kvm_intel rfkill kvm crct10dif_pclmul crc32_pclmul ib_ipoib rdma_ucm crc32c_intel ib_ucm ioatdma ib_uverbs ib_umad rdma_cm ghash_clmulni_intel sb_edac edac_core ib_cm iTCO_wdt iw_cm iTCO_vendor_support microcode ipmi_devintf tpm_tis shpchp pcspkr joydev ipmi_si tpm ipmi_msghandler mei_me lpc_ich i2c_i801 wmi mei mfd_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc mlx4_ib ib_sa mlx4_en ib_mad ib_core ib_addr mgag200 drm_kms_helper ttm isci drm igb mlx4_core libsas ptp pps_core dca scsi_transport_sas i2c_algo_bit Feb 12 17:53:57 shadow-33 kernel: CPU: 1 PID: 2352 Comm: md1_raid5 Tainted: G W I 3.19.0-rc7 #1 Feb 12 17:53:57 shadow-33 kernel: Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.99.99.x045.022820121209 02/28/2012 Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 00000000b47fa71a ffff88082512fa88 ffffffff818496fe Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 0000000000000000 ffff88082512fac8 ffffffff810a705a Feb 12 17:53:57 shadow-33 kernel: 0000000000000000 ffff88081a072e20 0000000000000001 0000000000000001 Feb 12 17:53:57 shadow-33 kernel: Call Trace: Feb 12 17:53:57 shadow-33 kernel: [<ffffffff818496fe>] dump_stack+0x4c/0x65 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810a705a>] warn_slowpath_common+0x8a/0xc0 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810a718a>] warn_slowpath_null+0x1a/0x20 Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d4922>] handle_stripe+0x2622/0x27a0 [raid456] Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d4e0d>] handle_active_stripes.isra.39+0x36d/0x540 [raid456] Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81679d09>] ? md_wakeup_thread+0x39/0x70 Feb 12 17:53:57 shadow-33 kernel: [<ffffffffa06d54b7>] raid5d+0x4d7/0x7b0 [raid456] Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81102445>] ? trace_hardirqs_on_caller+0x105/0x1d0 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff8167a0f7>] md_thread+0x157/0x170 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810f87c0>] ? prepare_to_wait_event+0x110/0x110 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81679fa0>] ? md_stop+0x40/0x40 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cdf7d>] kthread+0x10d/0x130 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810e4535>] ? local_clock+0x25/0x30 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cde70>] ? kthread_create_on_node+0x250/0x250 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff81853cfc>] ret_from_fork+0x7c/0xb0 Feb 12 17:53:57 shadow-33 kernel: [<ffffffff810cde70>] ? kthread_create_on_node+0x250/0x250 Feb 12 17:53:57 shadow-33 kernel: ---[ end trace e66438a9d130171d ]---
            green Oleg Drokin added a comment -

            Bob, I wonder if you can try on some recent Linus tree? I wonder if this as ever fixed upstream, since I never saw resolution to that message in the mailinglist.

            green Oleg Drokin added a comment - Bob, I wonder if you can try on some recent Linus tree? I wonder if this as ever fixed upstream, since I never saw resolution to that message in the mailinglist.
            bogl Bob Glossman (Inactive) added a comment - - edited

            I see very similar, repeated errors on the latest el6.6 kernel version, 2.6.32-504.8.1 with raid5.
            So I'm pretty sure it's no better in el6.6

            example errors:

            Feb  6 16:41:43 onyx-25 kernel: WARNING: at drivers/md/raid5.c:3863 handle_stripe+0x2920/0x2a10 [raid456]() (Tainted: G        W  ---------------   )
            Feb  6 16:41:43 onyx-25 kernel: Hardware name: S2600GZ
            Feb  6 16:41:43 onyx-25 kernel: Modules linked in: raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iTCO_wdt iTCO_vendor_support sg microcode sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
            Feb  6 16:41:43 onyx-25 kernel: Pid: 11359, comm: md1_raid5 Tainted: G        W  ---------------    2.6.32-504.8.1.el6_lustre.g8b9b97b.x86_64 #1
            Feb  6 16:41:43 onyx-25 kernel: Call Trace:
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff81074e4a>] ? warn_slowpath_null+0x1a/0x20
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffffa0452ea0>] ? handle_stripe+0x2920/0x2a10 [raid456]
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8126fd90>] ? blk_delay_work+0x0/0x60
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8152a27e>] ? thread_return+0x4e/0x7d0
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff81413d6b>] ? md_super_wait+0x7b/0x90
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffffa0453158>] ? handle_active_stripes+0x1c8/0x4f0 [raid456]
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffffa0453a11>] ? raid5d+0x441/0x590 [raid456]
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff81416275>] ? md_thread+0x115/0x150
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff81416160>] ? md_thread+0x0/0x150
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8109e66e>] ? kthread+0x9e/0xc0
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
            Feb  6 16:41:43 onyx-25 kernel: ---[ end trace 92f481b646f009d3 ]---
            Feb  6 16:41:43 onyx-25 kernel: ------------[ cut here ]------------
            Feb  6 16:41:43 onyx-25 kernel: WARNING: at drivers/md/raid5.c:3863 handle_stripe+0x2920/0x2a10 [raid456]() (Tainted: G        W  ---------------   )
            Feb  6 16:41:43 onyx-25 kernel: Hardware name: S2600GZ
            Feb  6 16:41:43 onyx-25 kernel: Modules linked in: raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iTCO_wdt iTCO_vendor_support sg microcode sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
            Feb  6 16:41:43 onyx-25 kernel: Pid: 11359, comm: md1_raid5 Tainted: G        W  ---------------    2.6.32-504.8.1.el6_lustre.g8b9b97b.x86_64 #1
            Feb  6 16:41:43 onyx-25 kernel: Call Trace:
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff81074e4a>] ? warn_slowpath_null+0x1a/0x20
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffffa0452ea0>] ? handle_stripe+0x2920/0x2a10 [raid456]
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8126fd90>] ? blk_delay_work+0x0/0x60
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8152a27e>] ? thread_return+0x4e/0x7d0
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff81413d6b>] ? md_super_wait+0x7b/0x90
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffffa0453158>] ? handle_active_stripes+0x1c8/0x4f0 [raid456]
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffffa0453a11>] ? raid5d+0x441/0x590 [raid456]
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff81416275>] ? md_thread+0x115/0x150
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff81416160>] ? md_thread+0x0/0x150
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8109e66e>] ? kthread+0x9e/0xc0
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
            Feb  6 16:41:43 onyx-25 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
            Feb  6 16:41:43 onyx-25 kernel: ---[ end trace 92f481b646f009d4 ]---
            
            bogl Bob Glossman (Inactive) added a comment - - edited I see very similar, repeated errors on the latest el6.6 kernel version, 2.6.32-504.8.1 with raid5. So I'm pretty sure it's no better in el6.6 example errors: Feb 6 16:41:43 onyx-25 kernel: WARNING: at drivers/md/raid5.c:3863 handle_stripe+0x2920/0x2a10 [raid456]() (Tainted: G W --------------- ) Feb 6 16:41:43 onyx-25 kernel: Hardware name: S2600GZ Feb 6 16:41:43 onyx-25 kernel: Modules linked in: raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iTCO_wdt iTCO_vendor_support sg microcode sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Feb 6 16:41:43 onyx-25 kernel: Pid: 11359, comm: md1_raid5 Tainted: G W --------------- 2.6.32-504.8.1.el6_lustre.g8b9b97b.x86_64 #1 Feb 6 16:41:43 onyx-25 kernel: Call Trace: Feb 6 16:41:43 onyx-25 kernel: [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff81074e4a>] ? warn_slowpath_null+0x1a/0x20 Feb 6 16:41:43 onyx-25 kernel: [<ffffffffa0452ea0>] ? handle_stripe+0x2920/0x2a10 [raid456] Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8126fd90>] ? blk_delay_work+0x0/0x60 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8152a27e>] ? thread_return+0x4e/0x7d0 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff81413d6b>] ? md_super_wait+0x7b/0x90 Feb 6 16:41:43 onyx-25 kernel: [<ffffffffa0453158>] ? handle_active_stripes+0x1c8/0x4f0 [raid456] Feb 6 16:41:43 onyx-25 kernel: [<ffffffffa0453a11>] ? raid5d+0x441/0x590 [raid456] Feb 6 16:41:43 onyx-25 kernel: [<ffffffff81416275>] ? md_thread+0x115/0x150 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff81416160>] ? md_thread+0x0/0x150 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8109e66e>] ? kthread+0x9e/0xc0 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8100c20a>] ? child_rip+0xa/0x20 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Feb 6 16:41:43 onyx-25 kernel: ---[ end trace 92f481b646f009d3 ]--- Feb 6 16:41:43 onyx-25 kernel: ------------[ cut here ]------------ Feb 6 16:41:43 onyx-25 kernel: WARNING: at drivers/md/raid5.c:3863 handle_stripe+0x2920/0x2a10 [raid456]() (Tainted: G W --------------- ) Feb 6 16:41:43 onyx-25 kernel: Hardware name: S2600GZ Feb 6 16:41:43 onyx-25 kernel: Modules linked in: raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iTCO_wdt iTCO_vendor_support sg microcode sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_core wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Feb 6 16:41:43 onyx-25 kernel: Pid: 11359, comm: md1_raid5 Tainted: G W --------------- 2.6.32-504.8.1.el6_lustre.g8b9b97b.x86_64 #1 Feb 6 16:41:43 onyx-25 kernel: Call Trace: Feb 6 16:41:43 onyx-25 kernel: [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff81074e4a>] ? warn_slowpath_null+0x1a/0x20 Feb 6 16:41:43 onyx-25 kernel: [<ffffffffa0452ea0>] ? handle_stripe+0x2920/0x2a10 [raid456] Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8126fd90>] ? blk_delay_work+0x0/0x60 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8152a27e>] ? thread_return+0x4e/0x7d0 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff81413d6b>] ? md_super_wait+0x7b/0x90 Feb 6 16:41:43 onyx-25 kernel: [<ffffffffa0453158>] ? handle_active_stripes+0x1c8/0x4f0 [raid456] Feb 6 16:41:43 onyx-25 kernel: [<ffffffffa0453a11>] ? raid5d+0x441/0x590 [raid456] Feb 6 16:41:43 onyx-25 kernel: [<ffffffff81416275>] ? md_thread+0x115/0x150 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff81416160>] ? md_thread+0x0/0x150 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8109e66e>] ? kthread+0x9e/0xc0 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8100c20a>] ? child_rip+0xa/0x20 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 Feb 6 16:41:43 onyx-25 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Feb 6 16:41:43 onyx-25 kernel: ---[ end trace 92f481b646f009d4 ]---
            green Oleg Drokin added a comment - - edited

            When I spoke with Sarah - she was able to hit the bug with raid5 100% of the time using the steps suggested in that email. - that was using rhel6.5 kernel. No rhel6.6 testing was performed due to unrelated problem a I understand it

            green Oleg Drokin added a comment - - edited When I spoke with Sarah - she was able to hit the bug with raid5 100% of the time using the steps suggested in that email. - that was using rhel6.5 kernel. No rhel6.6 testing was performed due to unrelated problem a I understand it

            Peter, we've put some measures in place to avoid the problem for now. We've disabled all the hot spares for the time being and have been adding them in manually to make sure the rebuilds do the right thing. It seems that manually adding in the new drive goes through a different code path than the automatic hot spare resync so we don't trigger the bug.

            Oleg, we are trying to reproduce the raid bug on our test filesystem but it does not seem to occur every time reliably on raid6. I'm not sure if Sarah was able to get it to happen every time with raid5.

            Also, any news on if the 2.6.32-504 version of kernel for rhel6.6 has the same problem?

            Thanks for the assistance with this bug.

            minyard Tommy Minyard added a comment - Peter, we've put some measures in place to avoid the problem for now. We've disabled all the hot spares for the time being and have been adding them in manually to make sure the rebuilds do the right thing. It seems that manually adding in the new drive goes through a different code path than the automatic hot spare resync so we don't trigger the bug. Oleg, we are trying to reproduce the raid bug on our test filesystem but it does not seem to occur every time reliably on raid6. I'm not sure if Sarah was able to get it to happen every time with raid5. Also, any news on if the 2.6.32-504 version of kernel for rhel6.6 has the same problem? Thanks for the assistance with this bug.
            green Oleg Drokin added a comment -

            Actually, newer rhel6.5 kernel seems to be reproducing the same problem on raid5. I know raid6 is used at TACC, but possibly the configuration is just more complex compared to simplistic configuration tried by Sarah, so the same code path is not executed?

            It is still interesting if Sarah can try on modern rhel6.6 kernel for completeness - I see they included many-many more raid patches in that kernel compared even to rhel6.5

            Additionally it also might be a good idea to ask RedHat about the issue at this point since demonstrably similar problem is visible in unmodified kernel using the reproduction steps and the steps involve hot spare and other similar things.

            Tommy, If you can try on a similarly configured raid array to what you have to see it hits there, that would be even more solid evidence I think.

            green Oleg Drokin added a comment - Actually, newer rhel6.5 kernel seems to be reproducing the same problem on raid5. I know raid6 is used at TACC, but possibly the configuration is just more complex compared to simplistic configuration tried by Sarah, so the same code path is not executed? It is still interesting if Sarah can try on modern rhel6.6 kernel for completeness - I see they included many-many more raid patches in that kernel compared even to rhel6.5 Additionally it also might be a good idea to ask RedHat about the issue at this point since demonstrably similar problem is visible in unmodified kernel using the reproduction steps and the steps involve hot spare and other similar things. Tommy, If you can try on a similarly configured raid array to what you have to see it hits there, that would be even more solid evidence I think.
            pjones Peter Jones added a comment -

            Tommy

            When might you be able to schedule an upgrade to a newer kernel to avoid being expose to the issue identified above?

            Peter

            pjones Peter Jones added a comment - Tommy When might you be able to schedule an upgrade to a newer kernel to avoid being expose to the issue identified above? Peter
            sarah Sarah Liu added a comment - - edited

            I tried with a newer 6.5 kernel. It turns out that hit the same error on raid5 while raid6 passed.

            [root@onyx-25 ~]# uname -a
            Linux onyx-25.onyx.hpdd.intel.com 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
            

            on raid5

            ------------[ cut here ]------------
            WARNING: at drivers/md/raid5.c:3626 handle_stripe+0x2969/0x2980 [raid456]() (Tainted: G        W  ---------------   )
            Hardware name: S2600GZ
            Modules linked in: raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 iTCO_wdt iTCO_vendor_support microcode sg sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
            Pid: 3758, comm: md1_raid5 Tainted: G        W  ---------------    2.6.32-431.29.2.el6.x86_64 #1
            Call Trace:
             [<ffffffff81071b37>] ? warn_slowpath_common+0x87/0xc0
             [<ffffffff81071b8a>] ? warn_slowpath_null+0x1a/0x20
             [<ffffffffa035a269>] ? handle_stripe+0x2969/0x2980 [raid456]
             [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
             [<ffffffffa035a850>] ? raid5d+0x5d0/0x850 [raid456]
             [<ffffffff81417cc5>] ? md_thread+0x115/0x150
             [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
             [<ffffffff81417bb0>] ? md_thread+0x0/0x150
             [<ffffffff8109abf6>] ? kthread+0x96/0xa0
             [<ffffffff8100c20a>] ? child_rip+0xa/0x20
             [<ffffffff8109ab60>] ? kthread+0x0/0xa0
             [<ffffffff8100c200>] ? child_rip+0x0/0x20
            ---[ end trace 6a6e28ca438d065f ]---
            
            

            on raid6

            [root@onyx-25 ~]# mdadm -C /dev/md1 --force -l 6 -n 4 /dev/sd[b-e]
            mdadm: /dev/sdb appears to be part of a raid array:
                level=raid0 devices=0 ctime=Wed Dec 31 16:00:00 1969
            mdadm: partition table exists on /dev/sdb but will be lost or
                   meaningless after creating array
            mdadm: /dev/sdc appears to be part of a raid array:
                level=raid0 devices=0 ctime=Wed Dec 31 16:00:00 1969
            mdadm: partition table exists on /dev/sdc but will be lost or
                   meaningless after creating array
            mdadm: /dev/sdd appears to be part of a raid array:
                level=raid0 devices=0 ctime=Wed Dec 31 16:00:00 1969
            mdadm: partition table exists on /dev/sdd but will be lost or
                   meaningless after creating array
            mdadm: /dev/sde appears to be part of a raid array:
                level=raid0 devices=0 ctime=Wed Dec 31 16:00:00 1969
            mdadm: partition table exists on /dev/sde but will be lost or
                   meaningless after creating array
            Continue creating array? y
            mdadm: Defaulting to version 1.2 metadata
            md: bind<sdb>
            md: bind<sdc>
            md: bind<sdd>
            md: bind<sde>
            async_tx: api initialized (async)
            xor: automatically using best checksumming function: avx
               avx       : 19776.000 MB/sec
            xor: using function: avx (19776.000 MB/sec)
            raid6: sse2x1    8753 MB/s
            raid6: sse2x2   10832 MB/s
            raid6: sse2x4   12441 MB/s
            raid6: using algorithm sse2x4 (12441 MB/s)
            raid6: using ssse3x2 recovery algorithm
            md: raid6 personality registered for level 6
            md: raid5 personality registered for level 5
            md: raid4 personality registered for level 4
            bio: create slab <bio-1> at 1
            md/raid:md1: not clean -- starting background reconstruction
            md/raid:md1: device sde operational as raid disk 3
            md/raid:md1: device sdd operational as raid disk 2
            md/raid:md1: device sdc operational as raid disk 1
            md/raid:md1: device sdb operational as raid disk 0
            md/raid:md1: allocated 4314kB
            md/raid:md1: raid level 6 active with 4 out of 4 devices, algorithm 2
            md1: detected capacity change from 0 to 999946190848
            mdadm: array /demd: resync of RAID array md1
             md1:v/md1 started.
            
            md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
            md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
            [root@onyx-25 ~]md: using 128k window, over a total of 488254976k.
            #  unknown partition table
            md: bind<sdf>
            md: md_do_sync() got signal ... exiting
            md: checkpointing resync of md1.
            md: recovery of RAID array md1
            md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
            md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
            md: using 128k window, over a total of 488254976k.
            
            sarah Sarah Liu added a comment - - edited I tried with a newer 6.5 kernel. It turns out that hit the same error on raid5 while raid6 passed. [root@onyx-25 ~]# uname -a Linux onyx-25.onyx.hpdd.intel.com 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux on raid5 ------------[ cut here ]------------ WARNING: at drivers/md/raid5.c:3626 handle_stripe+0x2969/0x2980 [raid456]() (Tainted: G W --------------- ) Hardware name: S2600GZ Modules linked in: raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 iTCO_wdt iTCO_vendor_support microcode sg sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 3758, comm: md1_raid5 Tainted: G W --------------- 2.6.32-431.29.2.el6.x86_64 #1 Call Trace: [<ffffffff81071b37>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81071b8a>] ? warn_slowpath_null+0x1a/0x20 [<ffffffffa035a269>] ? handle_stripe+0x2969/0x2980 [raid456] [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70 [<ffffffffa035a850>] ? raid5d+0x5d0/0x850 [raid456] [<ffffffff81417cc5>] ? md_thread+0x115/0x150 [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff81417bb0>] ? md_thread+0x0/0x150 [<ffffffff8109abf6>] ? kthread+0x96/0xa0 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 [<ffffffff8109ab60>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 ---[ end trace 6a6e28ca438d065f ]--- on raid6 [root@onyx-25 ~]# mdadm -C /dev/md1 --force -l 6 -n 4 /dev/sd[b-e] mdadm: /dev/sdb appears to be part of a raid array: level=raid0 devices=0 ctime=Wed Dec 31 16:00:00 1969 mdadm: partition table exists on /dev/sdb but will be lost or meaningless after creating array mdadm: /dev/sdc appears to be part of a raid array: level=raid0 devices=0 ctime=Wed Dec 31 16:00:00 1969 mdadm: partition table exists on /dev/sdc but will be lost or meaningless after creating array mdadm: /dev/sdd appears to be part of a raid array: level=raid0 devices=0 ctime=Wed Dec 31 16:00:00 1969 mdadm: partition table exists on /dev/sdd but will be lost or meaningless after creating array mdadm: /dev/sde appears to be part of a raid array: level=raid0 devices=0 ctime=Wed Dec 31 16:00:00 1969 mdadm: partition table exists on /dev/sde but will be lost or meaningless after creating array Continue creating array? y mdadm: Defaulting to version 1.2 metadata md: bind<sdb> md: bind<sdc> md: bind<sdd> md: bind<sde> async_tx: api initialized (async) xor: automatically using best checksumming function: avx avx : 19776.000 MB/sec xor: using function: avx (19776.000 MB/sec) raid6: sse2x1 8753 MB/s raid6: sse2x2 10832 MB/s raid6: sse2x4 12441 MB/s raid6: using algorithm sse2x4 (12441 MB/s) raid6: using ssse3x2 recovery algorithm md: raid6 personality registered for level 6 md: raid5 personality registered for level 5 md: raid4 personality registered for level 4 bio: create slab <bio-1> at 1 md/raid:md1: not clean -- starting background reconstruction md/raid:md1: device sde operational as raid disk 3 md/raid:md1: device sdd operational as raid disk 2 md/raid:md1: device sdc operational as raid disk 1 md/raid:md1: device sdb operational as raid disk 0 md/raid:md1: allocated 4314kB md/raid:md1: raid level 6 active with 4 out of 4 devices, algorithm 2 md1: detected capacity change from 0 to 999946190848 mdadm: array /demd: resync of RAID array md1 md1:v/md1 started. md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync. [root@onyx-25 ~]md: using 128k window, over a total of 488254976k. # unknown partition table md: bind<sdf> md: md_do_sync() got signal ... exiting md: checkpointing resync of md1. md: recovery of RAID array md1 md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. md: using 128k window, over a total of 488254976k.
            green Oleg Drokin added a comment -

            Ok. so it is succeptible after all. Does raid resync ever complete I wonder?

            Can you try the same thing on a more modern rhel6.5 kernel, e.g. 2.6.32-431.29.2.el6 ?

            Thanks!

            green Oleg Drokin added a comment - Ok. so it is succeptible after all. Does raid resync ever complete I wonder? Can you try the same thing on a more modern rhel6.5 kernel, e.g. 2.6.32-431.29.2.el6 ? Thanks!
            sarah Sarah Liu added a comment - - edited

            I did this as Oleg mentioned in the previous comment

            1 on node onyx-23 I install 2.6.32-431.17.1.el6 kernel

            [root@onyx-23 ~]# uname -a
            Linux onyx-23.onyx.hpdd.intel.com 2.6.32-431.17.1.el6.x86_64 #1 SMP Wed May 7 23:32:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
            

            2 did following commands

            [root@onyx-23 ~]# mdadm -C /dev/md1 --force -l 5 -n 3 /dev/sdf[1-3]
            
            [root@onyx-23 ~]# mdadm /dev/md1 -a /dev/sdf5
            mdadm: added /dev/sdf5
            [root@onyx-23 ~]# echo want_replacement > /sys/block/md1/md/dev-sdf2/state
            [root@onyx-23 ~]# echo "idle" >  /sys/block/md1/md/sync_action
            

            Then keep seeing these warnings:

            ------------[ cut here ]------------
            WARNING: at drivers/md/raid5.c:3626 handle_stripe+0x2969/0x2980 [raid456]() (Tainted: G        W  ---------------   )
            Hardware name: S2600GZ
            Modules linked in: nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx iTCO_wdt iTCO_vendor_support microcode sg sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
            Pid: 3763, comm: md1_raid5 Tainted: G        W  ---------------    2.6.32-431.17.1.el6.x86_64 #1
            Call Trace:
             [<ffffffff81071a97>] ? warn_slowpath_common+0x87/0xc0
             [<ffffffff81071aea>] ? warn_slowpath_null+0x1a/0x20
             [<ffffffffa026c269>] ? handle_stripe+0x2969/0x2980 [raid456]
             [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
             [<ffffffffa026c850>] ? raid5d+0x5d0/0x850 [raid456]
             [<ffffffff81416f65>] ? md_thread+0x115/0x150
             [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
             [<ffffffff81416e50>] ? md_thread+0x0/0x150
             [<ffffffff8109ab56>] ? kthread+0x96/0xa0
             [<ffffffff8100c20a>] ? child_rip+0xa/0x20
             [<ffffffff8109aac0>] ? kthread+0x0/0xa0
             [<ffffffff8100c200>] ? child_rip+0x0/0x20
            ---[ end trace 36abf499708cf848 ]---
            ------------[ cut here ]------------
            WARNING: at drivers/md/raid5.c:3626 handle_stripe+0x2969/0x2980 [raid456]() (Tainted: G        W  ---------------   )
            Hardware name: S2600GZ
            Modules linked in: nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx iTCO_wdt iTCO_vendor_support microcode sg sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
            Pid: 3763, comm: md1_raid5 Tainted: G        W  ---------------    2.6.32-431.17.1.el6.x86_64 #1
            Call Trace:
             [<ffffffff81071a97>] ? warn_slowpath_common+0x87/0xc0
             [<ffffffff81071aea>] ? warn_slowpath_null+0x1a/0x20
             [<ffffffffa026c269>] ? handle_stripe+0x2969/0x2980 [raid456]
             [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
             [<ffffffffa026c850>] ? raid5d+0x5d0/0x850 [raid456]
             [<ffffffff81416f65>] ? md_thread+0x115/0x150
             [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
             [<ffffffff81416e50>] ? md_thread+0x0/0x150
             [<ffffffff8109ab56>] ? kthread+0x96/0xa0
             [<ffffffff8100c20a>] ? child_rip+0xa/0x20
             [<ffffffff8109aac0>] ? kthread+0x0/0xa0
             [<ffffffff8100c200>] ? child_rip+0x0/0x20
            ---[ end trace 36abf499708cf849 ]---
            ------------[ cut here ]------------
            
            sarah Sarah Liu added a comment - - edited I did this as Oleg mentioned in the previous comment 1 on node onyx-23 I install 2.6.32-431.17.1.el6 kernel [root@onyx-23 ~]# uname -a Linux onyx-23.onyx.hpdd.intel.com 2.6.32-431.17.1.el6.x86_64 #1 SMP Wed May 7 23:32:49 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux 2 did following commands [root@onyx-23 ~]# mdadm -C /dev/md1 --force -l 5 -n 3 /dev/sdf[1-3] [root@onyx-23 ~]# mdadm /dev/md1 -a /dev/sdf5 mdadm: added /dev/sdf5 [root@onyx-23 ~]# echo want_replacement > /sys/block/md1/md/dev-sdf2/state [root@onyx-23 ~]# echo "idle" > /sys/block/md1/md/sync_action Then keep seeing these warnings: ------------[ cut here ]------------ WARNING: at drivers/md/raid5.c:3626 handle_stripe+0x2969/0x2980 [raid456]() (Tainted: G W --------------- ) Hardware name: S2600GZ Modules linked in: nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx iTCO_wdt iTCO_vendor_support microcode sg sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 3763, comm: md1_raid5 Tainted: G W --------------- 2.6.32-431.17.1.el6.x86_64 #1 Call Trace: [<ffffffff81071a97>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81071aea>] ? warn_slowpath_null+0x1a/0x20 [<ffffffffa026c269>] ? handle_stripe+0x2969/0x2980 [raid456] [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70 [<ffffffffa026c850>] ? raid5d+0x5d0/0x850 [raid456] [<ffffffff81416f65>] ? md_thread+0x115/0x150 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [<ffffffff81416e50>] ? md_thread+0x0/0x150 [<ffffffff8109ab56>] ? kthread+0x96/0xa0 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 ---[ end trace 36abf499708cf848 ]--- ------------[ cut here ]------------ WARNING: at drivers/md/raid5.c:3626 handle_stripe+0x2969/0x2980 [raid456]() (Tainted: G W --------------- ) Hardware name: S2600GZ Modules linked in: nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx iTCO_wdt iTCO_vendor_support microcode sg sb_edac edac_core i2c_i801 lpc_ich mfd_core ioatdma mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 3763, comm: md1_raid5 Tainted: G W --------------- 2.6.32-431.17.1.el6.x86_64 #1 Call Trace: [<ffffffff81071a97>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81071aea>] ? warn_slowpath_null+0x1a/0x20 [<ffffffffa026c269>] ? handle_stripe+0x2969/0x2980 [raid456] [<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70 [<ffffffffa026c850>] ? raid5d+0x5d0/0x850 [raid456] [<ffffffff81416f65>] ? md_thread+0x115/0x150 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [<ffffffff81416e50>] ? md_thread+0x0/0x150 [<ffffffff8109ab56>] ? kthread+0x96/0xa0 [<ffffffff8100c20a>] ? child_rip+0xa/0x20 [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 ---[ end trace 36abf499708cf849 ]--- ------------[ cut here ]------------

            People

              niu Niu Yawei (Inactive)
              minyard Tommy Minyard
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: