[LU-489] Hyperion-mds1 - swraid crash in mkfs.lustre Created: 05/Jul/11  Updated: 01/Jul/15  Resolved: 01/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: Yang Sheng
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Hyperion chaos distribute Linux version 2.6.18-238.12.1.el5_lustre.g266a955 (jenkins@rhel5-64-build.lab.whamcloud.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Fri Jun 10 16:39:27 PDT 2011


Severity: 4
Rank (Obsolete): 10607

 Description   

Ran command:
#mkfs.lustre --reformat --mgs --mdt --fsname lustre /dev/md0

Result:
---------------

2011-07-05 16:25:34 hyperion-mds1 login: ----------- [cut here ] --------- [please bite here ] ---------
2011-07-05 16:26:50 Kernel BUG at fs/bio.c:222
2011-07-05 16:26:50 invalid opcode: 0000 [1] SMP
2011-07-05 16:26:50 last sysfs file: /devices/pci0000:00/0000:00:0d.0/0000:05:00.0/0000:06:03.1/irq
2011-07-05 16:26:50 CPU 11
2011-07-05 16:26:50 Modules linked in: ext4(U) ldiskfs(U) jbd2(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptlrpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U) ib_srp(U) ib_ipoib(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ucm(U) ib_uverbs(U) ib_umad(U) mlx4_ib(U) mlx4_core(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) dm_mirror(U) dm_log(U) dm_multipath(U) scsi_dh(U) dm_mod(U) raid10(U) video(U) backlight(U) sbs(U) power_meter(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sd_mod(U) sg(U) floppy(U) mptsas(U) mptscsih(U) sata_nv(U) mptbase(U) i2c_nforce2(U) pcspkr(U) libata(U) ohci_hcd(U) i2c_core(U) k10temp(U) scsi_transport_sas(U) amd64_edac_mod(U) hwmon(U) tpm_tis(U) scsi_mod(U) edac_mc(U) shpchp(U) tpm(U) tpm_bios(U) serio_raw(U) ide_cd(U) cdrom(U) nfs(U) nfs_acl(U) lockd(U) fscache(U) sunrpc(U) e1000(U)
2011-07-05 16:26:50 Pid: 36, comm: ksoftirqd/11 Tainted: G 2.6.18-238.12.1.el5_lustre.g266a955 #1
2011-07-05 16:26:50 RIP: 0010:[<ffffffff8002e266>] [<ffffffff8002e266>] bio_put+0xa/0x32
2011-07-05 16:26:50 RSP: 0000:ffff810138df3db0 EFLAGS: 00010246
2011-07-05 16:26:50 RAX: 0000000000000000 RBX: ffff811032cb7d80 RCX: ffff813770b8bec0
2011-07-05 16:26:50 RDX: ffff8124f07f7d40 RSI: ffff810f6dab3b40 RDI: ffff8124f07f7d40
2011-07-05 16:26:50 RBP: ffff813770b8bf18 R08: 0000000000001000 R09: ffff811032cb7e10
2011-07-05 16:26:50 R10: ffff81206d101d88 R11: ffffffff800452a8 R12: ffff813770b8bec0
2011-07-05 16:26:50 R13: 0000000000000001 R14: 0000000000001000 R15: 00000000000c3000
2011-07-05 16:26:50 FS: 00002aaaabc8fb50(0000) GS:ffff8120381ab2c0(0000) knlGS:0000000000000000
2011-07-05 16:26:50 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
2011-07-05 16:26:50 CR2: 00002aaaaace2b80 CR3: 0000002036af0000 CR4: 00000000000006e0
2011-07-05 16:26:50 Process ksoftirqd/11 (pid: 36, threadinfo ffff8120381c6000, task ffff81103819c040)
2011-07-05 16:26:50 Stack: ffffffff88383abc 0000000000000002 ffff813770b8bec0 0000000000000002
2011-07-05 16:26:50 ffff811032cb7d80 0000000013730d08 ffffffff883857bb ffff8125a30636c0
2011-07-05 16:26:50 ffff811037f2edc0 ffff8124507f7d40 ffff8110334066b0 ffff8130326104e8
2011-07-05 16:26:50 Call Trace:
2011-07-05 16:26:50 <IRQ> [<ffffffff88383abc>] :raid10:raid_end_bio_io+0x59/0x80
2011-07-05 16:26:50 [<ffffffff883857bb>] :raid10:raid10_end_write_request+0xe6/0x126
2011-07-05 16:26:50 [<ffffffff8002cecb>] __end_that_request_first+0x23c/0x5bf
2011-07-05 16:26:50 [<ffffffff8005c444>] blk_run_queue+0x41/0x72
2011-07-05 16:26:50 [<ffffffff881491f2>] :scsi_mod:scsi_end_request+0x27/0xcd
2011-07-05 16:26:50 [<ffffffff881493e6>] :scsi_mod:scsi_io_completion+0x14e/0x324
2011-07-05 16:26:50 [<ffffffff882b10f0>] :sd_mod:sd_rw_intr+0x25a/0x294
2011-07-05 16:26:50 [<ffffffff8814967b>] :scsi_mod:scsi_device_unbusy+0x67/0x81
2011-07-05 16:26:50 [<ffffffff80037fa0>] blk_done_softirq+0x5f/0x6d
2011-07-05 16:26:50 [<ffffffff80012515>] __do_softirq+0x89/0x133
2011-07-05 16:26:50 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
2011-07-05 16:26:50 <EOI> [<ffffffff800963b1>] ksoftirqd+0x0/0xbf
2011-07-05 16:26:50 [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d
2011-07-05 16:26:50 [<ffffffff80096410>] ksoftirqd+0x5f/0xbf
2011-07-05 16:26:50 [<ffffffff80032b1e>] kthread+0xfe/0x132
2011-07-05 16:26:50 [<ffffffff8005dfb1>] child_rip+0xa/0x11
2011-07-05 16:26:50 [<ffffffff80032a20>] kthread+0x0/0x132
2011-07-05 16:26:50 [<ffffffff8005dfa7>] child_rip+0x0/0x11
2011-07-05 16:26:50
2011-07-05 16:26:50
2011-07-05 16:26:50 Code: 0f 0b 68 24 a9 2b 80 c2 de 00 f0 ff 4a 50 0f 94 c0 84 c0 74
2011-07-05 16:26:50 RIP [<ffffffff8002e266>] bio_put+0xa/0x32
2011-07-05 16:26:50 RSP <ffff810138df3db0>
2011-07-05 16:26:50 REWRITING MCP55 CFG REG
2011-07-05 16:26:50 CFG = c1
2011-07-05 16:26:50 Linux version 2.6.18-238.12.1.el5_lustre.g266a955 (jenkins@rhel5-64-build.lab.whamcloud.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Fri Jun 10 16:39:27 PDT 2011

Has occurred now 6 times, easy to reproduce.



 Comments   
Comment by Cliff White (Inactive) [ 05/Jul/11 ]

Also worth noting - the MDS is the only node using the mptbase and mptsas drivers. - The OSSs are HW (DDN) and
do not have those cards.

Comment by Cliff White (Inactive) [ 05/Jul/11 ]

I built a new image, based on chaos 4.4-2 - Installed the same RPMS, had the same crash. I repeated the test with the image from last week,
with kernel vmlinuz-2.6.18-238.12.1.el5_lustre.g266a955, and the crash did not repeat.

Comment by Cliff White (Inactive) [ 05/Jul/11 ]

sorry, pasted wrong version - the non-crashing kernel is vmlinuz-2.6.18-238.12.1.el5_lustre.g529529a

Comment by Peter Jones [ 06/Jul/11 ]

Yang Sheng

Do you see anything in the raid patches in our patch series for the latest rhel kernel that might explain this?

Thanks

Peter

Comment by Johann Lombardi (Inactive) [ 07/Jul/11 ]

All those kernels should be the same. The version string has changed just because i enabled/disabled slab debugging.
Yangsheng, do we patch some common code which could be used by RAID10 too?
Cliff, any chance to try with a stock Redhat kernel?

Comment by Yang Sheng [ 07/Jul/11 ]

I cannot make sure our patches whether cause this kind of issue. But i think we can test without our raid patches to ensure they aren't crash the kernel.

Comment by Cliff White (Inactive) [ 07/Jul/11 ]

I don't know what I would test with a stock kernel - the issue is a failure triggered buy running mkfs.lustre, and I cannot do this with a stock kernel. mkfs -t ext3 and mkfs -t ext4 have been tested on all these kernels, and do not fail. Please explain what tests you wish run with a stock kernel, and I'll see about finding the bits.

Comment by Johann Lombardi (Inactive) [ 07/Jul/11 ]

Have you tried with a simple dd? In any casse, mkfs.lustre does not require to load the kernel module, so you should be able to run it on an unpatched kernel.

Comment by Yang Sheng [ 01/Jul/15 ]

Can we close this one? Looks like it just hit on rhel5 kernel.

Comment by Cliff White (Inactive) [ 01/Jul/15 ]

Might as well close, we haven't hit it again

Generated at Sat Feb 10 01:07:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.