[LU-1115] software raid6 related BUG in fs/bio.c:222 when raid chunk > 64k Created: 17/Feb/12  Updated: 22/Feb/13  Resolved: 22/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.7
Fix Version/s: Lustre 2.1.4, Lustre 1.8.9

Type: Bug Priority: Critical
Reporter: Robin Humble Assignee: Yang Sheng
Resolution: Fixed Votes: 0
Labels: None
Environment:

x86_64, centos5/rhel5, server, software raid 8+2 raid6 with 128k chunks


Attachments: File md_raid5_2.6.18-238.12.1.el5_to_2.6.18-274.3.1.el5.diff     File md_raid5_fix_rhel5.7.patch    
Severity: 3
Rank (Obsolete): 6451

 Description   

RedHat have changed drivers/md/raid5.c between kernels 2.6.18-238.12.1.el5 (1.8.6) and 2.6.18-274.3.1.el5 (1.8.7) (see attached diff) and I think those changes might be interacting with the Lustre md raid5/6 patches and causing the kernel to BUG.

the 2.6.18-274.3.1.el5 + lustre 1.8.7 kernel works fine with a md raid6 8+2 setup with 64k raid chunks, but with 128k raid chunks it BUG's pretty much immediately when the first Lustre traffic starts. another site has seen the same problem with 256k raid chunks and the stock 1.8.7 server rpm.

one data point is that if I revert RedHat's raid5.c back to the previous version (eg. from 2.6.18-238.12.1.el5 as used with lustre 1.8.6) then everything seems ok - 128k chunk works, and I'm told 256k does as well. I don't understand enough of the bio and raid5 logic to know why this helps, but maybe it's a hint.

LU-489 looks somewhat similar to this bug, but that's in raid10 code (that Lustre doesn't patch) and also with the 238 kernel, so I don't think it is related to this problem?

a typical BUG looks like:

2012-02-13 16:55:10 ----------- [cut here ] --------- [please bite here ] ---------
2012-02-13 16:55:10 Kernel BUG at fs/bio.c:222
2012-02-13 16:55:10 invalid opcode: 0000 [1] SMP
2012-02-13 16:55:10 last sysfs file: /block/md0/md/stripe_cache_size
2012-02-13 16:55:10 CPU 0
2012-02-13 16:55:10 Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid1(U) raid456(U) xor(U) coretemp(U) mptsas(U) mptscsih(U) mptbase(U) dm_mirror(U) dm_log(U) dm_multipath(U) scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U) cdrom(U) sd_mod(U) sg(U) usb_storage(U) joydev(U) shpchp(U) i7core_edac(U) edac_mc(U) pcspkr(U) mlx4_en(U) scsi_transport_sas(U) i2c_i801(U) i2c_core(U) uhci_hcd(U) qla2xxx(U) ehci_hcd(U) scsi_transport_fc(U) tpm_tis(U) tpm(U) tpm_bios(U) ahci(U) libata(U) scsi_mod(U) rdma_cm(U) ib_addr(U) iw_cm(U) ib_umad(U) ib_uverbs(U) ib_ipoib(U) ipoib_helper(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_cm(U) ib_sa(U) mlx4_ib(U) mlx4_core(U) ib_mad(U) ib_core(U) igb(U) 8021q(U) dca(U)
2012-02-13 16:55:10 Pid: 4532, comm: md0_raid5 Tainted: G ---- 2.6.18-274.3.1.el5-1.8.7-wc1.a #1
2012-02-13 16:55:10 RIP: 0010:[<ffffffff8002dcda>] [<ffffffff8002dcda>] bio_put+0xa/0x31
2012-02-13 16:55:10 RSP: 0018:ffff810306973ca8 EFLAGS: 00010246
2012-02-13 16:55:10 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000100
2012-02-13 16:55:10 RDX: ffff8103013977c0 RSI: 0000000000000001 RDI: ffff8103013977c0
2012-02-13 16:55:10 RBP: ffff81032e6dc280 R08: 0000000000000000 R09: ffff81067bce7e00
2012-02-13 16:55:10 R10: ffff8103070aa600 R11: 0000000000000080 R12: ffff8103013977c0
2012-02-13 16:55:10 R13: ffff8103070aa600 R14: ffff8102fefd5b40 R15: 0000000000000000
2012-02-13 16:55:10 FS: 0000000000000000(0000) GS:ffffffff803fd000(0000) knlGS:0000000000000000
2012-02-13 16:55:10 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
2012-02-13 16:55:10 CR2: 00002aaaab54d020 CR3: 000000067fc35000 CR4: 00000000000006a0
2012-02-13 16:55:10 Process md0_raid5 (pid: 4532, threadinfo ffff810306972000, task ffff81067ebfd100)
2012-02-13 16:55:10 Stack: ffffffff888e3f10 ffff81067ebfd100 ffff81067ebfd100 ffff81067ebfd100
2012-02-13 16:55:10 ffff81067ebfd138 ffff81067b742900 ffffffff8008da86 000006e4100b0d07
2012-02-13 16:55:10 ffff810001025e20 ffff810306973d20 ffffffff8008daf1 00000001100b0d07
2012-02-13 16:55:10 Call Trace:
2012-02-13 16:55:10 [<ffffffff888e3f10>] :obdfilter:dio_complete_routine+0x238/0x249
2012-02-13 16:55:10 [<ffffffff8008da86>] enqueue_task+0x41/0x56
2012-02-13 16:55:10 [<ffffffff8008daf1>] __activate_task+0x56/0x6d
2012-02-13 16:55:10 [<ffffffff884292f6>] :raid456:handle_stripe+0x103c/0x25c9
2012-02-13 16:55:10 [<ffffffff8002de67>] __wake_up+0x38/0x4f
2012-02-13 16:55:10 [<ffffffff800a1dde>] keventd_create_kthread+0x0/0x98
2012-02-13 16:55:10 [<ffffffff800a1dde>] keventd_create_kthread+0x0/0x98
2012-02-13 16:55:10 [<ffffffff8842a9db>] :raid456:raid5d+0x158/0x18b
2012-02-13 16:55:10 [<ffffffff8003aa36>] prepare_to_wait+0x34/0x61
2012-02-13 16:55:10 [<ffffffff8021f422>] md_thread+0xf8/0x10e
2012-02-13 16:55:10 [<ffffffff800a1fca>] autoremove_wake_function+0x0/0x2e
2012-02-13 16:55:10 [<ffffffff8021f32a>] md_thread+0x0/0x10e
2012-02-13 16:55:10 [<ffffffff80032548>] kthread+0xd4/0x106
2012-02-13 16:55:10 [<ffffffff8005dfb1>] child_rip+0xa/0x11
2012-02-13 16:55:10 [<ffffffff800a1dde>] keventd_create_kthread+0x0/0x98
2012-02-13 16:55:10 [<ffffffff80032474>] kthread+0x0/0x106
2012-02-13 16:55:10 [<ffffffff8005dfa7>] child_rip+0x0/0x11
2012-02-13 16:55:10
2012-02-13 16:55:10
2012-02-13 16:55:10 Code: 0f 0b 68 d1 cd 2b 80 c2 de 00 eb fe f0 ff 4f 50 0f 94 c0 84
2012-02-13 16:55:10 RIP [<ffffffff8002dcda>] bio_put+0xa/0x31
2012-02-13 16:55:10 RSP <ffff810306973ca8>
2012-02-13 16:55:10 <0>Kernel panic - not syncing: Fatal exception



 Comments   
Comment by Robin Humble [ 20/Feb/12 ]

I've bisected the problem to these two patches:

raid5-large-io-rhel5.patch
raid5-maxsectors-rhel5.patch

if I apply all the standard rhel5 server patches except these two then md raid6 works. the second patch above is a refactoring of the first. if I apply just the first patch above then the kernel BUG's as before.

I wrote the first patch, but it was a long time ago now. I can't remember where I got the idea/justification for it. I'll try to figure it out, but would appreciate any help.

these patches allow 1M i/o's from lustre to get through to the raid code without being split up. write performance suffers considerably if they are omitted.

depending on raid chunk size, some % of all software raid users will simply see a crashed kernel with stock 1.8.7-wc1 lustre whereas it all worked fine in 1.8.6-wc1, so I don't understand why more folks haven't reported this problem. perhaps they've just gone back to 1.8.6-wc1...

Comment by Robin Humble [ 01/Mar/12 ]

after looking at this some more, I think RedHat just made a mistake.

the diff that RedHat cherry picked from mainline for RHEL5.7 is basically this commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=960e739d9e9f1c2346d8bdc65299ee2e1ed42218

and the very next commit is:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5b99c2ffa980528a197f26c7d876cceeccce8dd5

""block: make bi_phys_segments an unsigned int instead of short
raid5 can overflow with more than 255 stripes, ... ""

which reverts the behaviour so that bio->bi_phys_segments has a usable 4 bytes again.

so I think RedHat's patch to raid5.c
a) breaks any bio with bi_phys_segments > 255 which stops all large i/o to md raid5/6 that have a stripe >=1M (256 bios) in size
b) changes no other behaviour in raid5.c, so does nothing to fix any bugs
c) omits the 2nd patch in the series which fixes a regression

so IMHO it is safe to revert all or part of the RedHat patch in order to let bio->bi_phys_segments use all 4 bytes again. nothing in raid5.c uses the *_bi_hw_segments functions, or the high order bytes that are squirreled away in bi_phys_segments.

md_raid5_fix_rhel5.7.patch is an attempt to revert part of RedHat's patch so that > 255 bio's are available again, or the whole thing can be reverted as per md_raid5_2.6.18-238.12.1.el5_to_2.6.18-274.3.1.el5.diff

Comment by Peter Jones [ 05/Apr/12 ]

Yangsheng

Could you please check whether this problem still exists in the latest kernel update?

Thanks

Peter

Comment by Yang Sheng [ 06/Apr/12 ]

Looks like this issue still exist latest rhel5.8 kernel. As Robin point out, we may carry http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5b99c2ffa980528a197f26c7d876cceeccce8dd5

in our series as a solution. So we can just simple remove it while Redhat also included this change.

Comment by Yang Sheng [ 02/May/12 ]

Patch commit to:http://review.whamcloud.com/#change,2625

Comment by Yang Sheng [ 23/Aug/12 ]

Patch landed, Close bug.

Comment by Emoly Liu [ 13/Nov/12 ]

port for b2_1 is here http://review.whamcloud.com/#change,4526

Comment by Emoly Liu [ 21/Nov/12 ]

Port for b2_1 has been successfully cherry-picked as 96af312f068b642417cf1bba079822f4abb5723d.

Generated at Sat Feb 10 01:13:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.