Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1115

software raid6 related BUG in fs/bio.c:222 when raid chunk > 64k

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.1.4, Lustre 1.8.9
    • Lustre 1.8.7
    • None
    • x86_64, centos5/rhel5, server, software raid 8+2 raid6 with 128k chunks
    • 3
    • 6451

    Description

      RedHat have changed drivers/md/raid5.c between kernels 2.6.18-238.12.1.el5 (1.8.6) and 2.6.18-274.3.1.el5 (1.8.7) (see attached diff) and I think those changes might be interacting with the Lustre md raid5/6 patches and causing the kernel to BUG.

      the 2.6.18-274.3.1.el5 + lustre 1.8.7 kernel works fine with a md raid6 8+2 setup with 64k raid chunks, but with 128k raid chunks it BUG's pretty much immediately when the first Lustre traffic starts. another site has seen the same problem with 256k raid chunks and the stock 1.8.7 server rpm.

      one data point is that if I revert RedHat's raid5.c back to the previous version (eg. from 2.6.18-238.12.1.el5 as used with lustre 1.8.6) then everything seems ok - 128k chunk works, and I'm told 256k does as well. I don't understand enough of the bio and raid5 logic to know why this helps, but maybe it's a hint.

      LU-489 looks somewhat similar to this bug, but that's in raid10 code (that Lustre doesn't patch) and also with the 238 kernel, so I don't think it is related to this problem?

      a typical BUG looks like:

      2012-02-13 16:55:10 ----------- [cut here ] --------- [please bite here ] ---------
      2012-02-13 16:55:10 Kernel BUG at fs/bio.c:222
      2012-02-13 16:55:10 invalid opcode: 0000 [1] SMP
      2012-02-13 16:55:10 last sysfs file: /block/md0/md/stripe_cache_size
      2012-02-13 16:55:10 CPU 0
      2012-02-13 16:55:10 Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid1(U) raid456(U) xor(U) coretemp(U) mptsas(U) mptscsih(U) mptbase(U) dm_mirror(U) dm_log(U) dm_multipath(U) scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U) cdrom(U) sd_mod(U) sg(U) usb_storage(U) joydev(U) shpchp(U) i7core_edac(U) edac_mc(U) pcspkr(U) mlx4_en(U) scsi_transport_sas(U) i2c_i801(U) i2c_core(U) uhci_hcd(U) qla2xxx(U) ehci_hcd(U) scsi_transport_fc(U) tpm_tis(U) tpm(U) tpm_bios(U) ahci(U) libata(U) scsi_mod(U) rdma_cm(U) ib_addr(U) iw_cm(U) ib_umad(U) ib_uverbs(U) ib_ipoib(U) ipoib_helper(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_cm(U) ib_sa(U) mlx4_ib(U) mlx4_core(U) ib_mad(U) ib_core(U) igb(U) 8021q(U) dca(U)
      2012-02-13 16:55:10 Pid: 4532, comm: md0_raid5 Tainted: G ---- 2.6.18-274.3.1.el5-1.8.7-wc1.a #1
      2012-02-13 16:55:10 RIP: 0010:[<ffffffff8002dcda>] [<ffffffff8002dcda>] bio_put+0xa/0x31
      2012-02-13 16:55:10 RSP: 0018:ffff810306973ca8 EFLAGS: 00010246
      2012-02-13 16:55:10 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000100
      2012-02-13 16:55:10 RDX: ffff8103013977c0 RSI: 0000000000000001 RDI: ffff8103013977c0
      2012-02-13 16:55:10 RBP: ffff81032e6dc280 R08: 0000000000000000 R09: ffff81067bce7e00
      2012-02-13 16:55:10 R10: ffff8103070aa600 R11: 0000000000000080 R12: ffff8103013977c0
      2012-02-13 16:55:10 R13: ffff8103070aa600 R14: ffff8102fefd5b40 R15: 0000000000000000
      2012-02-13 16:55:10 FS: 0000000000000000(0000) GS:ffffffff803fd000(0000) knlGS:0000000000000000
      2012-02-13 16:55:10 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      2012-02-13 16:55:10 CR2: 00002aaaab54d020 CR3: 000000067fc35000 CR4: 00000000000006a0
      2012-02-13 16:55:10 Process md0_raid5 (pid: 4532, threadinfo ffff810306972000, task ffff81067ebfd100)
      2012-02-13 16:55:10 Stack: ffffffff888e3f10 ffff81067ebfd100 ffff81067ebfd100 ffff81067ebfd100
      2012-02-13 16:55:10 ffff81067ebfd138 ffff81067b742900 ffffffff8008da86 000006e4100b0d07
      2012-02-13 16:55:10 ffff810001025e20 ffff810306973d20 ffffffff8008daf1 00000001100b0d07
      2012-02-13 16:55:10 Call Trace:
      2012-02-13 16:55:10 [<ffffffff888e3f10>] :obdfilter:dio_complete_routine+0x238/0x249
      2012-02-13 16:55:10 [<ffffffff8008da86>] enqueue_task+0x41/0x56
      2012-02-13 16:55:10 [<ffffffff8008daf1>] __activate_task+0x56/0x6d
      2012-02-13 16:55:10 [<ffffffff884292f6>] :raid456:handle_stripe+0x103c/0x25c9
      2012-02-13 16:55:10 [<ffffffff8002de67>] __wake_up+0x38/0x4f
      2012-02-13 16:55:10 [<ffffffff800a1dde>] keventd_create_kthread+0x0/0x98
      2012-02-13 16:55:10 [<ffffffff800a1dde>] keventd_create_kthread+0x0/0x98
      2012-02-13 16:55:10 [<ffffffff8842a9db>] :raid456:raid5d+0x158/0x18b
      2012-02-13 16:55:10 [<ffffffff8003aa36>] prepare_to_wait+0x34/0x61
      2012-02-13 16:55:10 [<ffffffff8021f422>] md_thread+0xf8/0x10e
      2012-02-13 16:55:10 [<ffffffff800a1fca>] autoremove_wake_function+0x0/0x2e
      2012-02-13 16:55:10 [<ffffffff8021f32a>] md_thread+0x0/0x10e
      2012-02-13 16:55:10 [<ffffffff80032548>] kthread+0xd4/0x106
      2012-02-13 16:55:10 [<ffffffff8005dfb1>] child_rip+0xa/0x11
      2012-02-13 16:55:10 [<ffffffff800a1dde>] keventd_create_kthread+0x0/0x98
      2012-02-13 16:55:10 [<ffffffff80032474>] kthread+0x0/0x106
      2012-02-13 16:55:10 [<ffffffff8005dfa7>] child_rip+0x0/0x11
      2012-02-13 16:55:10
      2012-02-13 16:55:10
      2012-02-13 16:55:10 Code: 0f 0b 68 d1 cd 2b 80 c2 de 00 eb fe f0 ff 4f 50 0f 94 c0 84
      2012-02-13 16:55:10 RIP [<ffffffff8002dcda>] bio_put+0xa/0x31
      2012-02-13 16:55:10 RSP <ffff810306973ca8>
      2012-02-13 16:55:10 <0>Kernel panic - not syncing: Fatal exception

      Attachments

        Issue Links

          Activity

            People

              ys Yang Sheng
              rjh Robin Humble (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: