Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-489

Hyperion-mds1 - swraid crash in mkfs.lustre

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • None
    • Lustre 1.8.6
    • None
    • 4
    • 10607

    Description

      Ran command:
      #mkfs.lustre --reformat --mgs --mdt --fsname lustre /dev/md0

      Result:
      ---------------

      2011-07-05 16:25:34 hyperion-mds1 login: ----------- [cut here ] --------- [please bite here ] ---------
      2011-07-05 16:26:50 Kernel BUG at fs/bio.c:222
      2011-07-05 16:26:50 invalid opcode: 0000 [1] SMP
      2011-07-05 16:26:50 last sysfs file: /devices/pci0000:00/0000:00:0d.0/0000:05:00.0/0000:06:03.1/irq
      2011-07-05 16:26:50 CPU 11
      2011-07-05 16:26:50 Modules linked in: ext4(U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U) ib_srp(U) ib_ipoib(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ucm(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) dm_mirror(U) dm_log(U) dm_multipath(U) scsi_dh(U) dm_mod(U) raid10(U) video(U) backlight(U) sbs(U) power_meter(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sd_mod(U) sg(U) floppy(U) mptsas(U) mptscsih(U) sata_nv(U) mptbase(U) i2c_nforce2(U) pcspkr(U) libata(U) ohci_hcd(U) i2c_core(U) k10temp(U) scsi_transport_sas(U) amd64_edac_mod(U) hwmon(U) tpm_tis(U) scsi_mod(U) edac_mc(U) shpchp(U) tpm(U) tpm_bios(U) serio_raw(U) ide_cd(U) cdrom(U) nfs(U) nfs_acl(U) lockd(U) fscache(U) sunrpc(U) e1000(U)
      2011-07-05 16:26:50 Pid: 36, comm: ksoftirqd/11 Tainted: G 2.6.18-238.12.1.el5_lustre.g266a955 #1
      2011-07-05 16:26:50 RIP: 0010:[<ffffffff8002e266>] [<ffffffff8002e266>] bio_put+0xa/0x32
      2011-07-05 16:26:50 RSP: 0000:ffff810138df3db0 EFLAGS: 00010246
      2011-07-05 16:26:50 RAX: 0000000000000000 RBX: ffff811032cb7d80 RCX: ffff813770b8bec0
      2011-07-05 16:26:50 RDX: ffff8124f07f7d40 RSI: ffff810f6dab3b40 RDI: ffff8124f07f7d40
      2011-07-05 16:26:50 RBP: ffff813770b8bf18 R08: 0000000000001000 R09: ffff811032cb7e10
      2011-07-05 16:26:50 R10: ffff81206d101d88 R11: ffffffff800452a8 R12: ffff813770b8bec0
      2011-07-05 16:26:50 R13: 0000000000000001 R14: 0000000000001000 R15: 00000000000c3000
      2011-07-05 16:26:50 FS: 00002aaaabc8fb50(0000) GS:ffff8120381ab2c0(0000) knlGS:0000000000000000
      2011-07-05 16:26:50 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      2011-07-05 16:26:50 CR2: 00002aaaaace2b80 CR3: 0000002036af0000 CR4: 00000000000006e0
      2011-07-05 16:26:50 Process ksoftirqd/11 (pid: 36, threadinfo ffff8120381c6000, task ffff81103819c040)
      2011-07-05 16:26:50 Stack: ffffffff88383abc 0000000000000002 ffff813770b8bec0 0000000000000002
      2011-07-05 16:26:50 ffff811032cb7d80 0000000013730d08 ffffffff883857bb ffff8125a30636c0
      2011-07-05 16:26:50 ffff811037f2edc0 ffff8124507f7d40 ffff8110334066b0 ffff8130326104e8
      2011-07-05 16:26:50 Call Trace:
      2011-07-05 16:26:50 <IRQ> [<ffffffff88383abc>] :raid10:raid_end_bio_io+0x59/0x80
      2011-07-05 16:26:50 [<ffffffff883857bb>] :raid10:raid10_end_write_request+0xe6/0x126
      2011-07-05 16:26:50 [<ffffffff8002cecb>] __end_that_request_first+0x23c/0x5bf
      2011-07-05 16:26:50 [<ffffffff8005c444>] blk_run_queue+0x41/0x72
      2011-07-05 16:26:50 [<ffffffff881491f2>] :scsi_mod:scsi_end_request+0x27/0xcd
      2011-07-05 16:26:50 [<ffffffff881493e6>] :scsi_mod:scsi_io_completion+0x14e/0x324
      2011-07-05 16:26:50 [<ffffffff882b10f0>] :sd_mod:sd_rw_intr+0x25a/0x294
      2011-07-05 16:26:50 [<ffffffff8814967b>] :scsi_mod:scsi_device_unbusy+0x67/0x81
      2011-07-05 16:26:50 [<ffffffff80037fa0>] blk_done_softirq+0x5f/0x6d
      2011-07-05 16:26:50 [<ffffffff80012515>] __do_softirq+0x89/0x133
      2011-07-05 16:26:50 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
      2011-07-05 16:26:50 <EOI> [<ffffffff800963b1>] ksoftirqd+0x0/0xbf
      2011-07-05 16:26:50 [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d
      2011-07-05 16:26:50 [<ffffffff80096410>] ksoftirqd+0x5f/0xbf
      2011-07-05 16:26:50 [<ffffffff80032b1e>] kthread+0xfe/0x132
      2011-07-05 16:26:50 [<ffffffff8005dfb1>] child_rip+0xa/0x11
      2011-07-05 16:26:50 [<ffffffff80032a20>] kthread+0x0/0x132
      2011-07-05 16:26:50 [<ffffffff8005dfa7>] child_rip+0x0/0x11
      2011-07-05 16:26:50
      2011-07-05 16:26:50
      2011-07-05 16:26:50 Code: 0f 0b 68 24 a9 2b 80 c2 de 00 f0 ff 4a 50 0f 94 c0 84 c0 74
      2011-07-05 16:26:50 RIP [<ffffffff8002e266>] bio_put+0xa/0x32
      2011-07-05 16:26:50 RSP <ffff810138df3db0>
      2011-07-05 16:26:50 REWRITING MCP55 CFG REG
      2011-07-05 16:26:50 CFG = c1
      2011-07-05 16:26:50 Linux version 2.6.18-238.12.1.el5_lustre.g266a955 (jenkins@rhel5-64-build.lab.whamcloud.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Fri Jun 10 16:39:27 PDT 2011

      Has occurred now 6 times, easy to reproduce.

      Attachments

        Issue Links

          Activity

            [LU-489] Hyperion-mds1 - swraid crash in mkfs.lustre

            Might as well close, we haven't hit it again

            cliffw Cliff White (Inactive) added a comment - Might as well close, we haven't hit it again
            ys Yang Sheng added a comment - - edited

            Can we close this one? Looks like it just hit on rhel5 kernel.

            ys Yang Sheng added a comment - - edited Can we close this one? Looks like it just hit on rhel5 kernel.

            Have you tried with a simple dd? In any casse, mkfs.lustre does not require to load the kernel module, so you should be able to run it on an unpatched kernel.

            johann Johann Lombardi (Inactive) added a comment - Have you tried with a simple dd? In any casse, mkfs.lustre does not require to load the kernel module, so you should be able to run it on an unpatched kernel.

            I don't know what I would test with a stock kernel - the issue is a failure triggered buy running mkfs.lustre, and I cannot do this with a stock kernel. mkfs -t ext3 and mkfs -t ext4 have been tested on all these kernels, and do not fail. Please explain what tests you wish run with a stock kernel, and I'll see about finding the bits.

            cliffw Cliff White (Inactive) added a comment - I don't know what I would test with a stock kernel - the issue is a failure triggered buy running mkfs.lustre, and I cannot do this with a stock kernel. mkfs -t ext3 and mkfs -t ext4 have been tested on all these kernels, and do not fail. Please explain what tests you wish run with a stock kernel, and I'll see about finding the bits.
            ys Yang Sheng added a comment -

            I cannot make sure our patches whether cause this kind of issue. But i think we can test without our raid patches to ensure they aren't crash the kernel.

            ys Yang Sheng added a comment - I cannot make sure our patches whether cause this kind of issue. But i think we can test without our raid patches to ensure they aren't crash the kernel.

            All those kernels should be the same. The version string has changed just because i enabled/disabled slab debugging.
            Yangsheng, do we patch some common code which could be used by RAID10 too?
            Cliff, any chance to try with a stock Redhat kernel?

            johann Johann Lombardi (Inactive) added a comment - All those kernels should be the same. The version string has changed just because i enabled/disabled slab debugging. Yangsheng, do we patch some common code which could be used by RAID10 too? Cliff, any chance to try with a stock Redhat kernel?
            pjones Peter Jones added a comment -

            Yang Sheng

            Do you see anything in the raid patches in our patch series for the latest rhel kernel that might explain this?

            Thanks

            Peter

            pjones Peter Jones added a comment - Yang Sheng Do you see anything in the raid patches in our patch series for the latest rhel kernel that might explain this? Thanks Peter

            sorry, pasted wrong version - the non-crashing kernel is vmlinuz-2.6.18-238.12.1.el5_lustre.g529529a

            cliffw Cliff White (Inactive) added a comment - sorry, pasted wrong version - the non-crashing kernel is vmlinuz-2.6.18-238.12.1.el5_lustre.g529529a

            I built a new image, based on chaos 4.4-2 - Installed the same RPMS, had the same crash. I repeated the test with the image from last week,
            with kernel vmlinuz-2.6.18-238.12.1.el5_lustre.g266a955, and the crash did not repeat.

            cliffw Cliff White (Inactive) added a comment - I built a new image, based on chaos 4.4-2 - Installed the same RPMS, had the same crash. I repeated the test with the image from last week, with kernel vmlinuz-2.6.18-238.12.1.el5_lustre.g266a955, and the crash did not repeat.

            Also worth noting - the MDS is the only node using the mptbase and mptsas drivers. - The OSSs are HW (DDN) and
            do not have those cards.

            cliffw Cliff White (Inactive) added a comment - Also worth noting - the MDS is the only node using the mptbase and mptsas drivers. - The OSSs are HW (DDN) and do not have those cards.

            People

              ys Yang Sheng
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: