[LU-4285] kernel OOPs in __dquot_initialize() Created: 21/Nov/13  Updated: 16/Jan/14  Resolved: 16/Jan/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.1

Type: Bug Priority: Major
Reporter: Bob Glossman (Inactive) Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

SLES 11 SP2 and SP3 servers


Severity: 3
Rank (Obsolete): 11760

 Description   

One of the base kernel patches for SLES11 has a serious flaw. It leaves local variables in __dquot_initialize() uninited in some execution paths. This can cause kernel OOPS during calls to dquot_initialize() made from very common low level ldiskfs primitives. Often seen during user level mount or mkfs.lustre of lustre devices. example seen during a mount:

Nov 20 11:56:25 suse2 kernel: [222946.909459] BUG: unable to handle kernel
paging request at 00000000000b012b
Nov 20 11:56:25 suse2 kernel: [222946.909767] IP: [<ffffffff811a8e4b>]
dqput+0x16b/0x1e0
Nov 20 11:56:25 suse2 kernel: [222946.911543] PGD 1eafc067 PUD 5ae3067 PMD 0 
Nov 20 11:56:25 suse2 kernel: [222946.911548] Oops: 0000 [#1] SMP 
Nov 20 11:56:25 suse2 kernel: [222946.913328] CPU 0 
Nov 20 11:56:25 suse2 kernel: [222946.913330] Modules linked in:
osd_ldiskfs(N) lquota(N) ldiskfs(N) jbd2 lustre(N) lov(N) osc(N) mdc(N) fid(N)
fld(N) ksocklnd(N) ptlrpc(N) obdclass(N) lnet(N) sha512_generic sha256_generic
sha1_generic md5 crc32c libcfs(N) lp snd_pcm_oss snd_mixer_oss snd_seq_midi
snd_seq_midi_event snd_seq edd mperf vmhgfs(X) vsock(X) acpiphp microcode fuse
loop dm_mod btusb bluetooth rfkill ppdev crc16 parport_pc parport floppy ipv6
ipv6_lib snd_ens1371 gameport snd_rawmidi snd_seq_device acpi_memhotplug
snd_ac97_codec vmw_balloon(X) ac97_bus snd_pcm pciehp pcspkr snd_timer snd ac
soundcore e1000 sr_mod snd_page_alloc rtc_cmos cdrom sg container button
i2c_piix4 mptctl intel_agp vmci(X) i2c_core shpchp intel_gtt pci_hotplug ext3
jbd mbcache usbhid hid sd_mod crc_t10dif processor thermal_sys hwmon uhci_hcd
ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_rdac scsi_dh_hp_sw
scsi_dh_alua scsi_dh vmxnet(X) vmw_pvscsi vmxnet3 ata_generic ata_piix libata
mptspi mptscsih mptbase scsi_transport_spi sc
Nov 20 11:56:25 suse2 kernel: si_mod
Nov 20 11:56:25 suse2 kernel: [222946.913391] Supported: No, Unsupported
modules are loaded
Nov 20 11:56:25 suse2 kernel: [222946.913393] 
Nov 20 11:56:25 suse2 kernel: [222946.913708] Pid: 37457, comm: mount.lustre
Tainted: G           NX 3.0.93-0.5_lustre.g8744d8b-default #1 VMware, Inc.
VMware Virtual Platform/440BX Desktop Reference Platform
Nov 20 11:56:25 suse2 kernel: [222946.913714] RIP: 0010:[<ffffffff811a8e4b>] 
[<ffffffff811a8e4b>] dqput+0x16b/0x1e0
Nov 20 11:56:25 suse2 kernel: [222946.913741] RSP: 0018:ffff880004265858 
EFLAGS: 00010216
Nov 20 11:56:25 suse2 kernel: [222946.913744] RAX: 00000000000b000b RBX:
ffff88003e021498 RCX: 00000000000147f8
Nov 20 11:56:25 suse2 kernel: [222946.913746] RDX: 000000000000000e RSI:
0000000000000001 RDI: ffffffff81a02900
Nov 20 11:56:25 suse2 kernel: [222946.913748] RBP: ffff88003e021400 R08:
0000000000000000 R09: 0000000000000001
Nov 20 11:56:25 suse2 kernel: [222946.913750] R10: 0000000000000004 R11:
ffff8800042658c8 R12: ffff88003e021460
Nov 20 11:56:25 suse2 kernel: [222946.913752] R13: ffff88003e021430 R14:
00000000ffffffff R15: ffff8800035d2cd0
Nov 20 11:56:25 suse2 kernel: [222946.913754] FS:  00007fbf5d0a7700(0000)
GS:ffff88003f600000(0000) knlGS:0000000000000000
Nov 20 11:56:25 suse2 kernel: [222946.913756] CS:  0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Nov 20 11:56:25 suse2 kernel: [222946.913758] CR2: 00000000000b012b CR3:
00000000398ea000 CR4: 00000000001406f0
Nov 20 11:56:25 suse2 kernel: [222946.913779] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Nov 20 11:56:25 suse2 kernel: [222946.913794] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Nov 20 11:56:25 suse2 kernel: [222946.913797] Process mount.lustre (pid:
37457, threadinfo ffff880004264000, task ffff8800187cc1c0)
Nov 20 11:56:25 suse2 kernel: [222946.913799] Stack:
Nov 20 11:56:25 suse2 kernel: [222946.913800]  ffff88000f116000
ffff8800035d2cb0 0000000000000010 0000000000000002
Nov 20 11:56:25 suse2 kernel: [222946.913804]  0000000000000006
ffffffff811a9c74 0000000000000000 ffff88003e0213e0
Nov 20 11:56:25 suse2 kernel: [222946.913806]  ffff8800035d2cb0
0000000200000301 0000000000000000 000000003e090240
Nov 20 11:56:25 suse2 kernel: [222946.914290] Call Trace:
Nov 20 11:56:25 suse2 kernel: [222946.916006]  [<ffffffff811a9c74>]
__dquot_initialize+0x224/0x2a0
Nov 20 11:56:25 suse2 kernel: [222946.916020]  [<ffffffffa0cc9b60>]
osd_ea_fid_set+0x80/0x390 [osd_ldiskfs]
Nov 20 11:56:25 suse2 kernel: [222946.916050]  [<ffffffffa0cf6ff0>]
osd_scrub_setup+0x1e0/0x600 [osd_ldiskfs]
Nov 20 11:56:25 suse2 kernel: [222946.916070]  [<ffffffffa0ccad19>]
osd_device_init0+0x3d9/0x5c0 [osd_ldiskfs]
Nov 20 11:56:25 suse2 kernel: [222946.916080]  [<ffffffffa0ccb066>]
osd_device_alloc+0x166/0x2c0 [osd_ldiskfs]
Nov 20 11:56:25 suse2 kernel: [222946.916111]  [<ffffffffa067e1bb>]
class_setup+0x61b/0xad0 [obdclass]
Nov 20 11:56:25 suse2 kernel: [222946.916141]  [<ffffffffa0685be5>]
class_process_config+0xc95/0x18f0 [obdclass]
Nov 20 11:56:25 suse2 kernel: [222946.916170]  [<ffffffffa068ac32>]
do_lcfg+0x142/0x460 [obdclass]
Nov 20 11:56:25 suse2 kernel: [222946.916199]  [<ffffffffa068afe4>]
lustre_start_simple+0x94/0x210 [obdclass]
Nov 20 11:56:25 suse2 kernel: [222946.916232]  [<ffffffffa06b8e3a>]
osd_start+0x4fa/0x7c0 [obdclass]
Nov 20 11:56:25 suse2 kernel: [222946.916275]  [<ffffffffa06c2bad>]
server_fill_super+0xfd/0x550 [obdclass]
Nov 20 11:56:25 suse2 kernel: [222946.916316]  [<ffffffffa06907a8>]
lustre_fill_super+0x178/0x530 [obdclass]
Nov 20 11:56:25 suse2 kernel: [222946.916749]  [<ffffffff811556e3>]
mount_nodev+0x83/0xc0
Nov 20 11:56:25 suse2 kernel: [222946.917387]  [<ffffffffa0688670>]
lustre_mount+0x20/0x30 [obdclass]
Nov 20 11:56:25 suse2 kernel: [222946.917408]  [<ffffffff811551ee>]
mount_fs+0x4e/0x1a0
Nov 20 11:56:25 suse2 kernel: [222946.917454]  [<ffffffff811703f5>]
vfs_kern_mount+0x65/0xd0
Nov 20 11:56:25 suse2 kernel: [222946.917477]  [<ffffffff811704e3>]
do_kern_mount+0x53/0x110
Nov 20 11:56:25 suse2 kernel: [222946.917481]  [<ffffffff81171e2d>]
do_mount+0x21d/0x260
Nov 20 11:56:25 suse2 kernel: [222946.917483]  [<ffffffff81171f30>]
sys_mount+0xc0/0xf0
Nov 20 11:56:25 suse2 kernel: [222946.918465]  [<ffffffff81452192>]
system_call_fastpath+0x16/0x1b
Nov 20 11:56:25 suse2 kernel: [222946.918903]  [<00007fbf5ca2347a>]
0x7fbf5ca23479
Nov 20 11:56:25 suse2 kernel: [222946.918905] Code: 00 29 a0 81 e8 b7 14 2a 00
3e 0f ba 33 00 19 c0 85 c0 75 6c 66 ff 05 c5 9a 85 00 e9 e0 fe ff ff 3e ff 4d
60 48 8b 85 80 00 00 00 <8b> 90 20 01 00 00 0f bf 85 a0 00 00 00 8d 0c 40 b8
01 00 00 00 
Nov 20 11:56:25 suse2 kernel: [222946.918923] RIP  [<ffffffff811a8e4b>]
dqput+0x16b/0x1e0
Nov 20 11:56:25 suse2 kernel: [222946.918927]  RSP <ffff880004265858>
Nov 20 11:56:25 suse2 kernel: [222946.918929] CR2: 00000000000b012b
Nov 20 11:56:25 suse2 kernel: [222946.919070] ---[ end trace 64d4ae18af89a329
]-

I think this was due to the SLES version of the offending patch being too close a copy of the RHEL version, not taking into account slight differences in context between the relevant RHEL & SLES kernel code.

I will push a mod in the offending base kernel patch to fix it.



 Comments   
Comment by Bob Glossman (Inactive) [ 21/Nov/13 ]

http://review.whamcloud.com/8352

Comment by James A Simmons [ 03/Dec/13 ]

New patch at http://review.whamcloud.com/#/c/8418

Comment by Bob Glossman (Inactive) [ 03/Jan/14 ]

in b2_5: http://review.whamcloud.com/8716

Comment by Jodi Levi (Inactive) [ 16/Jan/14 ]

Patches have landed to Master and b2_5.

Generated at Sat Feb 10 01:41:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.