[LU-6922] Null pointer derefence in fld_local_lookup Created: 28/Jul/15  Updated: 28/Feb/20  Resolved: 28/Feb/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Roland Fehrenbacher Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Kernel 3.12.44, ZFS (0.6.3) based MDT/OSTs, Lustre 2.6.0


Attachments: File 0001-LU-6922-Add-assertions-to-prevent-Oops.patch     HTML File oops    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The Oops happened while unmounting the MDT. We created a patch that asserts the pointers that were involved.



 Comments   
Comment by Roland Fehrenbacher [ 28/Jul/15 ]

Assertion patch

Comment by Roland Fehrenbacher [ 28/Jul/15 ]

Here is the Oops:

Jul 27 13:41:16 cluster-head1 kernel: [1477915.691494] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
Jul 27 13:41:16 cluster-head1 kernel: [1477915.699551] IP: [<ffffffffa184a23d>] fld_local_lookup+0x4d/0x280 [fld]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.706286] PGD 0
Jul 27 13:41:16 cluster-head1 kernel: [1477915.708500] Oops: 0000 1 SMP
Jul 27 13:41:16 cluster-head1 kernel: [1477915.711951] Modules linked in: osp(O) mdd(O) lod(O) mdt(O) lfsck(O) mgs(O) mgc(O) nodemap(O) osd_zfs(O) fid(O) fld(O) lquota(O) ksocklnd(O) ko2iblnd(O) ptlrpc(O) obdclass(O) lnet(O) sha512_ssse3 sha512_generic sha256_ssse3 sha256_generic libcfs(O) drbd(O) libcrc32c ipmi_devintf ipmi_si ipmi_msghandler ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables rdma_ucm(O) rdma_cm(O) iw_cm(O) ib_uverbs(O) ib_umad(O) ib_ipoib(O) ib_cm(O) bonding mlx4_ib(O) ib_sa(O) ib_mad(O) ib_core(O) ib_addr(O) zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) coretemp kvm_intel kvm mlx4_core(O) ehci_pci ehci_hcd compat(O) microcode ioapic lpc_ich i7core_edac mfd_core edac_core acpi_cpufreq processor nfsd exportfs ipv6 ext4 jbd2 dm_mod sr_mod cdrom hid_generic usbhid crc32c_intel psmouse ahci uhci_hcd libahci e1000e qla2xxx usbcore usb_common scsi_transport_fc igb i2c_algo_bit scsi_tgt aacraid
Jul 27 13:41:16 cluster-head1 kernel: [1477915.796649] CPU: 9 PID: 11740 Comm: orph_cleanup_l- Tainted: P O 3.12.44-ql-generic-58 #1
Jul 27 13:41:16 cluster-head1 kernel: [1477915.805884] Hardware name: Supermicro X8DT6/X8DT6, BIOS 2.0a 09/14/2010
Jul 27 13:41:16 cluster-head1 kernel: [1477915.812951] task: ffff8802f0c035d0 ti: ffff880128914000 task.ti: ffff880128914000
Jul 27 13:41:16 cluster-head1 kernel: [1477915.820672] RIP: 0010:[<ffffffffa184a23d>] [<ffffffffa184a23d>] fld_local_lookup+0x4d/0x280 [fld]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.829950] RSP: 0018:ffff880128915c18 EFLAGS: 00010282
Jul 27 13:41:16 cluster-head1 kernel: [1477915.835501] RAX: ffff880344560a80 RBX: ffff880344560a80 RCX: ffff880128915ca0
Jul 27 13:41:16 cluster-head1 kernel: [1477915.842931] RDX: ffff88036fe01e00 RSI: ffffffffa1851fc0 RDI: ffff8804e66f57c0
Jul 27 13:41:16 cluster-head1 kernel: [1477915.850363] RBP: ffff880128915c50 R08: ffff880128915cec R09: ffff88012fc91258
Jul 27 13:41:16 cluster-head1 kernel: [1477915.857790] R10: 0000000000007da3 R11: 0000000000000004 R12: ffff880128915ca0
Jul 27 13:41:16 cluster-head1 kernel: [1477915.865221] R13: 0000000000000000 R14: 0000000200007da3 R15: 0000000000000003
Jul 27 13:41:16 cluster-head1 kernel: [1477915.872647] FS: 0000000000000000(0000) GS:ffff880333ca0000(0000) knlGS:0000000000000000
Jul 27 13:41:16 cluster-head1 kernel: [1477915.881032] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jul 27 13:41:16 cluster-head1 kernel: [1477915.887019] CR2: 0000000000000018 CR3: 00000000016b1000 CR4: 00000000000007e0
Jul 27 13:41:16 cluster-head1 kernel: [1477915.894445] Stack:
Jul 27 13:41:16 cluster-head1 kernel: [1477915.896690] 0000000000000001 ffff880491e14f70 0000000000000246 0000000000000000
Jul 27 13:41:16 cluster-head1 kernel: [1477915.904468] 0000000200007da3 ffff880128915ca0 ffff8804e66f57c0 ffff880128915c88
Jul 27 13:41:16 cluster-head1 kernel: [1477915.912255] ffffffffa184b58f ffffffffa109177b ffff8801303e2210 ffff8800065a4000
Jul 27 13:41:16 cluster-head1 kernel: [1477915.920029] Call Trace:
Jul 27 13:41:16 cluster-head1 kernel: [1477915.922718] [<ffffffffa184b58f>] fld_server_lookup+0x3f/0x2f0 [fld]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.929339] [<ffffffffa109177b>] ? zap_leaf_array_match+0xcb/0x210 [zfs]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.936368] [<ffffffffa1a64a06>] lod_fld_lookup+0x276/0x3e0 [lod]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.942809] [<ffffffffa1a797e6>] lod_object_init+0x96/0x380 [lod]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.949239] [<ffffffffa14aaf07>] lu_object_alloc+0xd7/0x320 [obdclass]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.956099] [<ffffffffa14ababb>] lu_object_find_at+0x20b/0x370 [obdclass]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.963226] [<ffffffffa1094bce>] ? zap_lookup+0x2e/0x30 [zfs]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.969305] [<ffffffffa14abc5a>] lu_object_find_slice+0x1a/0x90 [obdclass]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.976538] [<ffffffffa1adabeb>] mdd_object_find+0xb/0x60 [mdd]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.982782] [<ffffffffa1ae1c86>] __mdd_orphan_cleanup+0x4b6/0x11e0 [mdd]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.989809] [<ffffffffa1ae17d0>] ? orph_declare_index_delete+0x330/0x330 [mdd]
Jul 27 13:41:16 cluster-head1 kernel: [1477915.997420] [<ffffffff810702fb>] kthread+0xbb/0xc0
Jul 27 13:41:16 cluster-head1 kernel: [1477916.002544] [<ffffffff81070240>] ? kthread_create_on_node+0x120/0x120
Jul 27 13:41:16 cluster-head1 kernel: [1477916.009308] [<ffffffff81514428>] ret_from_fork+0x58/0x90
Jul 27 13:41:16 cluster-head1 kernel: [1477916.014942] [<ffffffff81070240>] ? kthread_create_on_node+0x120/0x120
Jul 27 13:41:16 cluster-head1 kernel: [1477916.021726] Code: 74 0e 8b 35 e6 94 b5 ff 85 f6 0f 88 de 00 00 00 48 89 df 48 c7 c6 c0 1f 85 a1 e8 3f fe c5 ff 48 85 c0 48 89 c3 0f 84 1c 01 00 00 <49> 8b 7d 18 48 8d 50 18 4c 89 f6 e8 c3 eb ff ff 85 c0 75 6f 8b
Jul 27 13:41:16 cluster-head1 kernel: [1477916.042102] RIP [<ffffffffa184a23d>] fld_local_lookup+0x4d/0x280 [fld]
Jul 27 13:41:16 cluster-head1 kernel: [1477916.048978] RSP <ffff880128915c18>
Jul 27 13:41:16 cluster-head1 kernel: [1477916.052708] CR2: 0000000000000018
Jul 27 13:41:16 cluster-head1 kernel: [1477916.056642] --[ end trace 734ddc1c6c69fac3 ]--

Comment by Andreas Dilger [ 06/Aug/15 ]

Roland, have you been able to trigger the LASSERT() with the patch applied?

Alternately, you can use gdb to see what line the oops is on, and which pointer is bad:

gdb fld.ko
gdb> list *(fld_local_lookup+0x4d)
Comment by Roland Fehrenbacher [ 07/Aug/15 ]

Not been able to reproduce.

With gdb I get the result below. But note that this is from a module different than the one used during the Oops (that one is without debugging symbols). It's the same code though, just recompiled. Don't know whether 0x4d would change due to that.

(gdb) list *(fld_local_lookup+0x4d)
0x523d is in fld_local_lookup (fld_handler.c:219).
214 info = lu_context_key_get(&env->le_ctx, &fld_thread_key);
215 LASSERT(info != NULL);
216 erange = &info->fti_lrange;
217
218 /* Lookup it in the cache. */
219 rc = fld_cache_lookup(fld->lsf_cache, seq, erange);
220 if (rc == 0) {
221 if (unlikely(fld_range_type(erange) != fld_range_type(range) &&
222 !fld_range_is_any(range))) {
223 CERROR("%s: FLD cache range "DRANGE" does not match"

Comment by Andreas Dilger [ 28/Feb/20 ]

Close old bug that hasn't been seen in a long time.

Generated at Sat Feb 10 07:28:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.