Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.5.0
-
None
-
SLES 11 SP2 and SP3 servers
-
3
-
11760
Description
One of the base kernel patches for SLES11 has a serious flaw. It leaves local variables in __dquot_initialize() uninited in some execution paths. This can cause kernel OOPS during calls to dquot_initialize() made from very common low level ldiskfs primitives. Often seen during user level mount or mkfs.lustre of lustre devices. example seen during a mount:
Nov 20 11:56:25 suse2 kernel: [222946.909459] BUG: unable to handle kernel paging request at 00000000000b012b Nov 20 11:56:25 suse2 kernel: [222946.909767] IP: [<ffffffff811a8e4b>] dqput+0x16b/0x1e0 Nov 20 11:56:25 suse2 kernel: [222946.911543] PGD 1eafc067 PUD 5ae3067 PMD 0 Nov 20 11:56:25 suse2 kernel: [222946.911548] Oops: 0000 [#1] SMP Nov 20 11:56:25 suse2 kernel: [222946.913328] CPU 0 Nov 20 11:56:25 suse2 kernel: [222946.913330] Modules linked in: osd_ldiskfs(N) lquota(N) ldiskfs(N) jbd2 lustre(N) lov(N) osc(N) mdc(N) fid(N) fld(N) ksocklnd(N) ptlrpc(N) obdclass(N) lnet(N) sha512_generic sha256_generic sha1_generic md5 crc32c libcfs(N) lp snd_pcm_oss snd_mixer_oss snd_seq_midi snd_seq_midi_event snd_seq edd mperf vmhgfs(X) vsock(X) acpiphp microcode fuse loop dm_mod btusb bluetooth rfkill ppdev crc16 parport_pc parport floppy ipv6 ipv6_lib snd_ens1371 gameport snd_rawmidi snd_seq_device acpi_memhotplug snd_ac97_codec vmw_balloon(X) ac97_bus snd_pcm pciehp pcspkr snd_timer snd ac soundcore e1000 sr_mod snd_page_alloc rtc_cmos cdrom sg container button i2c_piix4 mptctl intel_agp vmci(X) i2c_core shpchp intel_gtt pci_hotplug ext3 jbd mbcache usbhid hid sd_mod crc_t10dif processor thermal_sys hwmon uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_rdac scsi_dh_hp_sw scsi_dh_alua scsi_dh vmxnet(X) vmw_pvscsi vmxnet3 ata_generic ata_piix libata mptspi mptscsih mptbase scsi_transport_spi sc Nov 20 11:56:25 suse2 kernel: si_mod Nov 20 11:56:25 suse2 kernel: [222946.913391] Supported: No, Unsupported modules are loaded Nov 20 11:56:25 suse2 kernel: [222946.913393] Nov 20 11:56:25 suse2 kernel: [222946.913708] Pid: 37457, comm: mount.lustre Tainted: G NX 3.0.93-0.5_lustre.g8744d8b-default #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform Nov 20 11:56:25 suse2 kernel: [222946.913714] RIP: 0010:[<ffffffff811a8e4b>] [<ffffffff811a8e4b>] dqput+0x16b/0x1e0 Nov 20 11:56:25 suse2 kernel: [222946.913741] RSP: 0018:ffff880004265858 EFLAGS: 00010216 Nov 20 11:56:25 suse2 kernel: [222946.913744] RAX: 00000000000b000b RBX: ffff88003e021498 RCX: 00000000000147f8 Nov 20 11:56:25 suse2 kernel: [222946.913746] RDX: 000000000000000e RSI: 0000000000000001 RDI: ffffffff81a02900 Nov 20 11:56:25 suse2 kernel: [222946.913748] RBP: ffff88003e021400 R08: 0000000000000000 R09: 0000000000000001 Nov 20 11:56:25 suse2 kernel: [222946.913750] R10: 0000000000000004 R11: ffff8800042658c8 R12: ffff88003e021460 Nov 20 11:56:25 suse2 kernel: [222946.913752] R13: ffff88003e021430 R14: 00000000ffffffff R15: ffff8800035d2cd0 Nov 20 11:56:25 suse2 kernel: [222946.913754] FS: 00007fbf5d0a7700(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000 Nov 20 11:56:25 suse2 kernel: [222946.913756] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 20 11:56:25 suse2 kernel: [222946.913758] CR2: 00000000000b012b CR3: 00000000398ea000 CR4: 00000000001406f0 Nov 20 11:56:25 suse2 kernel: [222946.913779] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 20 11:56:25 suse2 kernel: [222946.913794] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 20 11:56:25 suse2 kernel: [222946.913797] Process mount.lustre (pid: 37457, threadinfo ffff880004264000, task ffff8800187cc1c0) Nov 20 11:56:25 suse2 kernel: [222946.913799] Stack: Nov 20 11:56:25 suse2 kernel: [222946.913800] ffff88000f116000 ffff8800035d2cb0 0000000000000010 0000000000000002 Nov 20 11:56:25 suse2 kernel: [222946.913804] 0000000000000006 ffffffff811a9c74 0000000000000000 ffff88003e0213e0 Nov 20 11:56:25 suse2 kernel: [222946.913806] ffff8800035d2cb0 0000000200000301 0000000000000000 000000003e090240 Nov 20 11:56:25 suse2 kernel: [222946.914290] Call Trace: Nov 20 11:56:25 suse2 kernel: [222946.916006] [<ffffffff811a9c74>] __dquot_initialize+0x224/0x2a0 Nov 20 11:56:25 suse2 kernel: [222946.916020] [<ffffffffa0cc9b60>] osd_ea_fid_set+0x80/0x390 [osd_ldiskfs] Nov 20 11:56:25 suse2 kernel: [222946.916050] [<ffffffffa0cf6ff0>] osd_scrub_setup+0x1e0/0x600 [osd_ldiskfs] Nov 20 11:56:25 suse2 kernel: [222946.916070] [<ffffffffa0ccad19>] osd_device_init0+0x3d9/0x5c0 [osd_ldiskfs] Nov 20 11:56:25 suse2 kernel: [222946.916080] [<ffffffffa0ccb066>] osd_device_alloc+0x166/0x2c0 [osd_ldiskfs] Nov 20 11:56:25 suse2 kernel: [222946.916111] [<ffffffffa067e1bb>] class_setup+0x61b/0xad0 [obdclass] Nov 20 11:56:25 suse2 kernel: [222946.916141] [<ffffffffa0685be5>] class_process_config+0xc95/0x18f0 [obdclass] Nov 20 11:56:25 suse2 kernel: [222946.916170] [<ffffffffa068ac32>] do_lcfg+0x142/0x460 [obdclass] Nov 20 11:56:25 suse2 kernel: [222946.916199] [<ffffffffa068afe4>] lustre_start_simple+0x94/0x210 [obdclass] Nov 20 11:56:25 suse2 kernel: [222946.916232] [<ffffffffa06b8e3a>] osd_start+0x4fa/0x7c0 [obdclass] Nov 20 11:56:25 suse2 kernel: [222946.916275] [<ffffffffa06c2bad>] server_fill_super+0xfd/0x550 [obdclass] Nov 20 11:56:25 suse2 kernel: [222946.916316] [<ffffffffa06907a8>] lustre_fill_super+0x178/0x530 [obdclass] Nov 20 11:56:25 suse2 kernel: [222946.916749] [<ffffffff811556e3>] mount_nodev+0x83/0xc0 Nov 20 11:56:25 suse2 kernel: [222946.917387] [<ffffffffa0688670>] lustre_mount+0x20/0x30 [obdclass] Nov 20 11:56:25 suse2 kernel: [222946.917408] [<ffffffff811551ee>] mount_fs+0x4e/0x1a0 Nov 20 11:56:25 suse2 kernel: [222946.917454] [<ffffffff811703f5>] vfs_kern_mount+0x65/0xd0 Nov 20 11:56:25 suse2 kernel: [222946.917477] [<ffffffff811704e3>] do_kern_mount+0x53/0x110 Nov 20 11:56:25 suse2 kernel: [222946.917481] [<ffffffff81171e2d>] do_mount+0x21d/0x260 Nov 20 11:56:25 suse2 kernel: [222946.917483] [<ffffffff81171f30>] sys_mount+0xc0/0xf0 Nov 20 11:56:25 suse2 kernel: [222946.918465] [<ffffffff81452192>] system_call_fastpath+0x16/0x1b Nov 20 11:56:25 suse2 kernel: [222946.918903] [<00007fbf5ca2347a>] 0x7fbf5ca23479 Nov 20 11:56:25 suse2 kernel: [222946.918905] Code: 00 29 a0 81 e8 b7 14 2a 00 3e 0f ba 33 00 19 c0 85 c0 75 6c 66 ff 05 c5 9a 85 00 e9 e0 fe ff ff 3e ff 4d 60 48 8b 85 80 00 00 00 <8b> 90 20 01 00 00 0f bf 85 a0 00 00 00 8d 0c 40 b8 01 00 00 00 Nov 20 11:56:25 suse2 kernel: [222946.918923] RIP [<ffffffff811a8e4b>] dqput+0x16b/0x1e0 Nov 20 11:56:25 suse2 kernel: [222946.918927] RSP <ffff880004265858> Nov 20 11:56:25 suse2 kernel: [222946.918929] CR2: 00000000000b012b Nov 20 11:56:25 suse2 kernel: [222946.919070] ---[ end trace 64d4ae18af89a329 ]-
I think this was due to the SLES version of the offending patch being too close a copy of the RHEL version, not taking into account slight differences in context between the relevant RHEL & SLES kernel code.
I will push a mod in the offending base kernel patch to fix it.