Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1115

software raid6 related BUG in fs/bio.c:222 when raid chunk > 64k

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.1.4, Lustre 1.8.9
    • Lustre 1.8.7
    • None
    • x86_64, centos5/rhel5, server, software raid 8+2 raid6 with 128k chunks
    • 3
    • 6451

    Description

      RedHat have changed drivers/md/raid5.c between kernels 2.6.18-238.12.1.el5 (1.8.6) and 2.6.18-274.3.1.el5 (1.8.7) (see attached diff) and I think those changes might be interacting with the Lustre md raid5/6 patches and causing the kernel to BUG.

      the 2.6.18-274.3.1.el5 + lustre 1.8.7 kernel works fine with a md raid6 8+2 setup with 64k raid chunks, but with 128k raid chunks it BUG's pretty much immediately when the first Lustre traffic starts. another site has seen the same problem with 256k raid chunks and the stock 1.8.7 server rpm.

      one data point is that if I revert RedHat's raid5.c back to the previous version (eg. from 2.6.18-238.12.1.el5 as used with lustre 1.8.6) then everything seems ok - 128k chunk works, and I'm told 256k does as well. I don't understand enough of the bio and raid5 logic to know why this helps, but maybe it's a hint.

      LU-489 looks somewhat similar to this bug, but that's in raid10 code (that Lustre doesn't patch) and also with the 238 kernel, so I don't think it is related to this problem?

      a typical BUG looks like:

      2012-02-13 16:55:10 ----------- [cut here ] --------- [please bite here ] ---------
      2012-02-13 16:55:10 Kernel BUG at fs/bio.c:222
      2012-02-13 16:55:10 invalid opcode: 0000 [1] SMP
      2012-02-13 16:55:10 last sysfs file: /block/md0/md/stripe_cache_size
      2012-02-13 16:55:10 CPU 0
      2012-02-13 16:55:10 Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid1(U) raid456(U) xor(U) coretemp(U) mptsas(U) mptscsih(U) mptbase(U) dm_mirror(U) dm_log(U) dm_multipath(U) scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U) cdrom(U) sd_mod(U) sg(U) usb_storage(U) joydev(U) shpchp(U) i7core_edac(U) edac_mc(U) pcspkr(U) mlx4_en(U) scsi_transport_sas(U) i2c_i801(U) i2c_core(U) uhci_hcd(U) qla2xxx(U) ehci_hcd(U) scsi_transport_fc(U) tpm_tis(U) tpm(U) tpm_bios(U) ahci(U) libata(U) scsi_mod(U) rdma_cm(U) ib_addr(U) iw_cm(U) ib_umad(U) ib_uverbs(U) ib_ipoib(U) ipoib_helper(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_cm(U) ib_sa(U) mlx4_ib(U) mlx4_core(U) ib_mad(U) ib_core(U) igb(U) 8021q(U) dca(U)
      2012-02-13 16:55:10 Pid: 4532, comm: md0_raid5 Tainted: G ---- 2.6.18-274.3.1.el5-1.8.7-wc1.a #1
      2012-02-13 16:55:10 RIP: 0010:[<ffffffff8002dcda>] [<ffffffff8002dcda>] bio_put+0xa/0x31
      2012-02-13 16:55:10 RSP: 0018:ffff810306973ca8 EFLAGS: 00010246
      2012-02-13 16:55:10 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000100
      2012-02-13 16:55:10 RDX: ffff8103013977c0 RSI: 0000000000000001 RDI: ffff8103013977c0
      2012-02-13 16:55:10 RBP: ffff81032e6dc280 R08: 0000000000000000 R09: ffff81067bce7e00
      2012-02-13 16:55:10 R10: ffff8103070aa600 R11: 0000000000000080 R12: ffff8103013977c0
      2012-02-13 16:55:10 R13: ffff8103070aa600 R14: ffff8102fefd5b40 R15: 0000000000000000
      2012-02-13 16:55:10 FS: 0000000000000000(0000) GS:ffffffff803fd000(0000) knlGS:0000000000000000
      2012-02-13 16:55:10 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      2012-02-13 16:55:10 CR2: 00002aaaab54d020 CR3: 000000067fc35000 CR4: 00000000000006a0
      2012-02-13 16:55:10 Process md0_raid5 (pid: 4532, threadinfo ffff810306972000, task ffff81067ebfd100)
      2012-02-13 16:55:10 Stack: ffffffff888e3f10 ffff81067ebfd100 ffff81067ebfd100 ffff81067ebfd100
      2012-02-13 16:55:10 ffff81067ebfd138 ffff81067b742900 ffffffff8008da86 000006e4100b0d07
      2012-02-13 16:55:10 ffff810001025e20 ffff810306973d20 ffffffff8008daf1 00000001100b0d07
      2012-02-13 16:55:10 Call Trace:
      2012-02-13 16:55:10 [<ffffffff888e3f10>] :obdfilter:dio_complete_routine+0x238/0x249
      2012-02-13 16:55:10 [<ffffffff8008da86>] enqueue_task+0x41/0x56
      2012-02-13 16:55:10 [<ffffffff8008daf1>] __activate_task+0x56/0x6d
      2012-02-13 16:55:10 [<ffffffff884292f6>] :raid456:handle_stripe+0x103c/0x25c9
      2012-02-13 16:55:10 [<ffffffff8002de67>] __wake_up+0x38/0x4f
      2012-02-13 16:55:10 [<ffffffff800a1dde>] keventd_create_kthread+0x0/0x98
      2012-02-13 16:55:10 [<ffffffff800a1dde>] keventd_create_kthread+0x0/0x98
      2012-02-13 16:55:10 [<ffffffff8842a9db>] :raid456:raid5d+0x158/0x18b
      2012-02-13 16:55:10 [<ffffffff8003aa36>] prepare_to_wait+0x34/0x61
      2012-02-13 16:55:10 [<ffffffff8021f422>] md_thread+0xf8/0x10e
      2012-02-13 16:55:10 [<ffffffff800a1fca>] autoremove_wake_function+0x0/0x2e
      2012-02-13 16:55:10 [<ffffffff8021f32a>] md_thread+0x0/0x10e
      2012-02-13 16:55:10 [<ffffffff80032548>] kthread+0xd4/0x106
      2012-02-13 16:55:10 [<ffffffff8005dfb1>] child_rip+0xa/0x11
      2012-02-13 16:55:10 [<ffffffff800a1dde>] keventd_create_kthread+0x0/0x98
      2012-02-13 16:55:10 [<ffffffff80032474>] kthread+0x0/0x106
      2012-02-13 16:55:10 [<ffffffff8005dfa7>] child_rip+0x0/0x11
      2012-02-13 16:55:10
      2012-02-13 16:55:10
      2012-02-13 16:55:10 Code: 0f 0b 68 d1 cd 2b 80 c2 de 00 eb fe f0 ff 4f 50 0f 94 c0 84
      2012-02-13 16:55:10 RIP [<ffffffff8002dcda>] bio_put+0xa/0x31
      2012-02-13 16:55:10 RSP <ffff810306973ca8>
      2012-02-13 16:55:10 <0>Kernel panic - not syncing: Fatal exception

      Attachments

        Issue Links

          Activity

            [LU-1115] software raid6 related BUG in fs/bio.c:222 when raid chunk > 64k
            emoly.liu Emoly Liu added a comment -

            Port for b2_1 has been successfully cherry-picked as 96af312f068b642417cf1bba079822f4abb5723d.

            emoly.liu Emoly Liu added a comment - Port for b2_1 has been successfully cherry-picked as 96af312f068b642417cf1bba079822f4abb5723d.
            emoly.liu Emoly Liu added a comment - port for b2_1 is here http://review.whamcloud.com/#change,4526
            ys Yang Sheng added a comment -

            Patch landed, Close bug.

            ys Yang Sheng added a comment - Patch landed, Close bug.
            ys Yang Sheng added a comment - Patch commit to: http://review.whamcloud.com/#change,2625
            ys Yang Sheng added a comment -

            Looks like this issue still exist latest rhel5.8 kernel. As Robin point out, we may carry http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5b99c2ffa980528a197f26c7d876cceeccce8dd5

            in our series as a solution. So we can just simple remove it while Redhat also included this change.

            ys Yang Sheng added a comment - Looks like this issue still exist latest rhel5.8 kernel. As Robin point out, we may carry http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5b99c2ffa980528a197f26c7d876cceeccce8dd5 in our series as a solution. So we can just simple remove it while Redhat also included this change.
            pjones Peter Jones added a comment -

            Yangsheng

            Could you please check whether this problem still exists in the latest kernel update?

            Thanks

            Peter

            pjones Peter Jones added a comment - Yangsheng Could you please check whether this problem still exists in the latest kernel update? Thanks Peter

            after looking at this some more, I think RedHat just made a mistake.

            the diff that RedHat cherry picked from mainline for RHEL5.7 is basically this commit:
            http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=960e739d9e9f1c2346d8bdc65299ee2e1ed42218

            and the very next commit is:
            http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5b99c2ffa980528a197f26c7d876cceeccce8dd5

            ""block: make bi_phys_segments an unsigned int instead of short
            raid5 can overflow with more than 255 stripes, ... ""

            which reverts the behaviour so that bio->bi_phys_segments has a usable 4 bytes again.

            so I think RedHat's patch to raid5.c
            a) breaks any bio with bi_phys_segments > 255 which stops all large i/o to md raid5/6 that have a stripe >=1M (256 bios) in size
            b) changes no other behaviour in raid5.c, so does nothing to fix any bugs
            c) omits the 2nd patch in the series which fixes a regression

            so IMHO it is safe to revert all or part of the RedHat patch in order to let bio->bi_phys_segments use all 4 bytes again. nothing in raid5.c uses the *_bi_hw_segments functions, or the high order bytes that are squirreled away in bi_phys_segments.

            md_raid5_fix_rhel5.7.patch is an attempt to revert part of RedHat's patch so that > 255 bio's are available again, or the whole thing can be reverted as per md_raid5_2.6.18-238.12.1.el5_to_2.6.18-274.3.1.el5.diff

            rjh Robin Humble (Inactive) added a comment - after looking at this some more, I think RedHat just made a mistake. the diff that RedHat cherry picked from mainline for RHEL5.7 is basically this commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=960e739d9e9f1c2346d8bdc65299ee2e1ed42218 and the very next commit is: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5b99c2ffa980528a197f26c7d876cceeccce8dd5 ""block: make bi_phys_segments an unsigned int instead of short raid5 can overflow with more than 255 stripes, ... "" which reverts the behaviour so that bio->bi_phys_segments has a usable 4 bytes again. so I think RedHat's patch to raid5.c a) breaks any bio with bi_phys_segments > 255 which stops all large i/o to md raid5/6 that have a stripe >=1M (256 bios) in size b) changes no other behaviour in raid5.c, so does nothing to fix any bugs c) omits the 2nd patch in the series which fixes a regression so IMHO it is safe to revert all or part of the RedHat patch in order to let bio->bi_phys_segments use all 4 bytes again. nothing in raid5.c uses the *_bi_hw_segments functions, or the high order bytes that are squirreled away in bi_phys_segments. md_raid5_fix_rhel5.7.patch is an attempt to revert part of RedHat's patch so that > 255 bio's are available again, or the whole thing can be reverted as per md_raid5_2.6.18-238.12.1.el5_to_2.6.18-274.3.1.el5.diff

            I've bisected the problem to these two patches:

            raid5-large-io-rhel5.patch
            raid5-maxsectors-rhel5.patch

            if I apply all the standard rhel5 server patches except these two then md raid6 works. the second patch above is a refactoring of the first. if I apply just the first patch above then the kernel BUG's as before.

            I wrote the first patch, but it was a long time ago now. I can't remember where I got the idea/justification for it. I'll try to figure it out, but would appreciate any help.

            these patches allow 1M i/o's from lustre to get through to the raid code without being split up. write performance suffers considerably if they are omitted.

            depending on raid chunk size, some % of all software raid users will simply see a crashed kernel with stock 1.8.7-wc1 lustre whereas it all worked fine in 1.8.6-wc1, so I don't understand why more folks haven't reported this problem. perhaps they've just gone back to 1.8.6-wc1...

            rjh Robin Humble (Inactive) added a comment - I've bisected the problem to these two patches: raid5-large-io-rhel5.patch raid5-maxsectors-rhel5.patch if I apply all the standard rhel5 server patches except these two then md raid6 works. the second patch above is a refactoring of the first. if I apply just the first patch above then the kernel BUG's as before. I wrote the first patch, but it was a long time ago now. I can't remember where I got the idea/justification for it. I'll try to figure it out, but would appreciate any help. these patches allow 1M i/o's from lustre to get through to the raid code without being split up. write performance suffers considerably if they are omitted. depending on raid chunk size, some % of all software raid users will simply see a crashed kernel with stock 1.8.7-wc1 lustre whereas it all worked fine in 1.8.6-wc1, so I don't understand why more folks haven't reported this problem. perhaps they've just gone back to 1.8.6-wc1...

            People

              ys Yang Sheng
              rjh Robin Humble (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: