[LU-12707] mds-survey test 1 crashes with a NULL pointer dereference in echo_object_free() Created: 27/Aug/19 Updated: 10/Feb/20 Resolved: 20/Sep/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Alexey Lyashkov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
mds-survey test_1 started crashing with a NULL pointer dereference in echo_object_free() on 16 August 2019. Looking at the kernel-crash log, we see [37813.223580] Lustre: DEBUG MARKER: == mds-survey test 1: Metadata survey with zero-stripe =============================================== 15:50:41 (1565970641) [37813.430124] Lustre: DEBUG MARKER: /usr/sbin/lctl dl [37814.598393] Lustre: Echo OBD driver; http://www.lustre.org/ [37820.151690] LustreError: 14608:0:(echo_client.c:1821:echo_md_lookup()) lookup MDT0000-tests: rc = -2 [37820.152651] LustreError: 14608:0:(echo_client.c:2055:echo_md_destroy_internal()) Can't find child MDT0000-tests: rc = -2 [37820.674040] LustreError: 14674:0:(echo_client.c:1821:echo_md_lookup()) lookup MDT0000-tests2: rc = -2 [37820.675015] LustreError: 14674:0:(echo_client.c:1821:echo_md_lookup()) Skipped 1 previous similar message [37820.675915] LustreError: 14674:0:(echo_client.c:2055:echo_md_destroy_internal()) Can't find child MDT0000-tests2: rc = -2 [37820.676936] LustreError: 14674:0:(echo_client.c:2055:echo_md_destroy_internal()) Skipped 1 previous similar message [37822.418041] LustreError: 14873:0:(echo_client.c:1821:echo_md_lookup()) lookup MDT0002-tests: rc = -2 [37822.419018] LustreError: 14873:0:(echo_client.c:1821:echo_md_lookup()) Skipped 1 previous similar message [37822.419931] LustreError: 14873:0:(echo_client.c:2055:echo_md_destroy_internal()) Can't find child MDT0002-tests: rc = -2 [37822.421048] LustreError: 14873:0:(echo_client.c:2055:echo_md_destroy_internal()) Skipped 1 previous similar message [39010.965720] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 [39010.966793] IP: [<ffffffffc102ad18>] echo_object_free+0x28/0x490 [obdecho] [39010.967550] PGD 8000000016975067 PUD 16976067 PMD 0 [39010.968086] Oops: 0000 [#1] SMP [39010.968876] Modules linked in: obdecho(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr virtio_balloon parport_pc parport i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk ata_piix libata crct10dif_pclmul crct10dif_common [39010.976913] crc32c_intel 8139too serio_raw virtio_pci 8139cp virtio_ring virtio mii floppy [39010.977773] CPU: 0 PID: 16251 Comm: lctl Kdump: loaded Tainted: G OE ------------ 3.10.0-957.21.3.el7_lustre.x86_64 #1 [39010.978869] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [39010.979407] task: ffff88d947864100 ti: ffff88d936efc000 task.ti: ffff88d936efc000 [39010.980101] RIP: 0010:[<ffffffffc102ad18>] [<ffffffffc102ad18>] echo_object_free+0x28/0x490 [obdecho] [39010.980993] RSP: 0018:ffff88d936effb18 EFLAGS: 00010246 [39010.981484] RAX: 0000000000000000 RBX: ffff88d93aaa4cc0 RCX: dead000000000200 [39010.982152] RDX: ffff88d936effb40 RSI: ffff88d93aaa4cc0 RDI: ffff88d917df92d8 [39010.982815] RBP: ffff88d936effb30 R08: ffff88d93aaa4cd8 R09: 0000000000000010 [39010.983481] R10: 0000000000000223 R11: 000000000007ffff R12: ffff88d93aaa4cd8 [39010.984149] R13: ffff88d917df92d8 R14: ffff88d936effb40 R15: ffffa34d86641028 [39010.984810] FS: 00007fd52a7ec740(0000) GS:ffff88d97fc00000(0000) knlGS:0000000000000000 [39010.985566] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39010.986113] CR2: 0000000000000040 CR3: 0000000044ab4000 CR4: 00000000000606f0 [39010.986779] Call Trace: [39010.987156] [<ffffffffc0891592>] lu_object_free.isra.32+0xf2/0x170 [obdclass] [39010.987857] [<ffffffffc0895756>] lu_object_find_at+0x5f6/0xa60 [obdclass] [39010.988527] [<ffffffffa9db91de>] ? filemap_fault+0x17e/0x490 [39010.989080] [<ffffffffc1034012>] echo_md_handler.isra.43+0x362/0x2c60 [obdecho] [39010.989769] [<ffffffffc10377a8>] echo_client_iocontrol+0xe98/0x1860 [obdecho] [39010.990894] [<ffffffffc085cbd0>] ? obd_ioctl_getdata+0x200/0x11b0 [obdclass] [39010.991583] [<ffffffffc085f4aa>] class_handle_ioctl+0x192a/0x1e30 [obdclass] [39010.992268] [<ffffffffa9dec27d>] ? handle_mm_fault+0x39d/0x9b0 [39010.992836] [<ffffffffa9ef8fde>] ? security_capable+0x1e/0x20 [39010.993405] [<ffffffffc085fa25>] obd_class_ioctl+0x75/0x170 [obdclass] [39010.994038] [<ffffffffa9e569d0>] do_vfs_ioctl+0x3a0/0x5a0 [39010.994572] [<ffffffffa9e56c71>] SyS_ioctl+0xa1/0xc0 [39010.995075] [<ffffffffaa375d15>] ? system_call_after_swapgs+0xa2/0x146 [39010.995695] [<ffffffffaa375ddb>] system_call_fastpath+0x22/0x27 [39010.996265] [<ffffffffaa375d21>] ? system_call_after_swapgs+0xae/0x146 [39010.996875] Code: 00 00 00 66 66 66 66 90 55 48 85 f6 48 89 e5 41 55 41 54 53 48 89 f3 0f 85 1d 04 00 00 f6 05 c7 22 65 ff 01 48 8b 83 98 00 00 00 <4c> 8b 60 40 0f 85 8e 01 00 00 8b 83 b8 00 00 00 85 c0 0f 85 c5 [39011.000093] RIP [<ffffffffc102ad18>] echo_object_free+0x28/0x490 [obdecho] [39011.000758] RSP <ffff88d936effb18> [39011.001102] CR2: 0000000000000040 This test is crashing for ldiskfs, ZFS, DNE and non-DNE environments. Here are links to logs for several crashes: |
| Comments |
| Comment by Andreas Dilger [ 28/Aug/19 ] |
|
The most obvious candidate is patch https://review.whamcloud.com/35700 " |
| Comment by Andreas Dilger [ 28/Aug/19 ] |
|
Alex, Alexey, any thoughts on this crash? It appears related to the |
| Comment by Alexey Lyashkov [ 28/Aug/19 ] |
|
Andreas, I think it fail don't related to the looking to the obdecho code I found strange/bug in echo_resolve_path (gdb) l *(echo_md_handler+0x71c) 0xaccc is in echo_md_handler (lustre/obdecho/echo_client.c:2202). 2197 rc = -EINVAL; 2198 break; 2199 } 2200 2201 child = echo_md_lookup(env, ed, lu2md(ld_parent), lname); 2202 lu_object_put(env, parent); 2203 if (IS_ERR(child)) { 2204 rc = (int)PTR_ERR(child); 2205 CERROR("lookup %s under parent "DFID": rc = %d\n", 2206 lname->ln_name, PFID(lu_object_fid(ld_parent)), (gdb) it looks CERROR function tries to make access to the pointer after last put which called before 'IF'. I have no idea who resolve path can't found a work dir as it run short after lu_env_refill() so it have guarantee lu_env filled correctly. can someone change a lu_object_put place and check how it helps for Mallo ? |
| Comment by Alexey Lyashkov [ 28/Aug/19 ] |
|
I able to replicate it bug. Will provide some info short later, based on my logs - lu_object_find_at start to free an object where it don't have 'init' passed and it caused a null pointer deref. 00000020:00000001:0.0:1566998598.740494:0:20147:0:(lu_object.c:735:lu_object_find_at()) Process entered 00008000:00000001:0.0:1566998598.740495:0:20147:0:(echo_client.c:580:echo_object_alloc()) Process entered 00008000:00000010:0.0:1566998598.740495:0:20147:0:(echo_client.c:583:echo_object_alloc()) slab-alloced 'eco': 264 at ffff88003c438840. 00000020:00000001:0.0:1566998598.740496:0:20147:0:(cl_object.c:72:cl_object_header_init()) Process entered 00000020:00000001:0.0:1566998598.740498:0:20147:0:(cl_object.c:79:cl_object_header_init()) Process leaving (rc=0 : 0 : 0) 00008000:00000001:0.0:1566998598.740499:0:20147:0:(echo_client.c:597:echo_object_alloc()) Process leaving (rc=18446612133325277248 : -131940384274368 : ffff88003c438840) 00008000:00000001:0.0:1566998598.740500:0:20147:0:(echo_client.c:487:echo_object_init()) Process entered 00000004:00000010:0.0:1566998598.740501:0:20147:0:(mdd_object.c:282:mdd_object_alloc()) slab-alloced 'mdd_obj': 96 at ffff88003b6e5e40. 00008000:00000001:0.0:1566998598.740502:0:20147:0:(echo_client.c:524:echo_object_init()) Process leaving (rc=0 : 0 : 0) 00000004:00000001:0.0:1566998598.740503:0:20147:0:(mdd_object.c:304:mdd_object_init()) Process entered bad case 00000020:00000001:0.0:1566998604.027293:0:20164:0:(lu_object.c:735:lu_object_find_at()) Process entered 00008000:00000001:0.0:1566998604.027294:0:20164:0:(echo_client.c:580:echo_object_alloc()) Process entered 00008000:00000010:0.0:1566998604.027295:0:20164:0:(echo_client.c:583:echo_object_alloc()) slab-alloced 'eco': 264 at ffff88003b61a108. 00000020:00000001:0.0:1566998604.027296:0:20164:0:(cl_object.c:72:cl_object_header_init()) Process entered 00000020:00000001:0.0:1566998604.027298:0:20164:0:(cl_object.c:79:cl_object_header_init()) Process leaving (rc=0 : 0 : 0) 00008000:00000001:0.0:1566998604.027299:0:20164:0:(echo_client.c:597:echo_object_alloc()) Process leaving (rc=18446612133310472456 : -131940399079160 : ffff88003b61a108) >>>>> next line is panic in lu_object_free Second question - why "echo_md_lookup()) lookup MDT0000-tests: rc = -2" hit, don't invested now. |
| Comment by Alexey Lyashkov [ 28/Aug/19 ] |
|
provided patch allow md-survey to be finished fine. diff --git a/lustre/obdecho/echo_client.c b/lustre/obdecho/echo_client.c index 11191fb055..1970d17bc5 100644 --- a/lustre/obdecho/echo_client.c +++ b/lustre/obdecho/echo_client.c @@ -527,14 +527,18 @@ static int echo_object_init(const struct lu_env *env, struct lu_object *obj, static void echo_object_free(const struct lu_env *env, struct lu_object *obj) { struct echo_object *eco = cl2echo_obj(lu2cl(obj)); - struct echo_client_obd *ec = eco->eo_dev->ed_ec; + struct echo_client_obd *ec; ENTRY; - LASSERT(atomic_read(&eco->eo_npages) == 0); + if (eco->eo_dev != NULL) { + ec = eco->eo_dev->ed_ec; - spin_lock(&ec->ec_lock); - list_del_init(&eco->eo_obj_chain); - spin_unlock(&ec->ec_lock); + LASSERT(atomic_read(&eco->eo_npages) == 0); + + spin_lock(&ec->ec_lock); + list_del_init(&eco->eo_obj_chain); + spin_unlock(&ec->ec_lock); + } lu_object_fini(obj); lu_object_header_fini(obj->lo_header); |
| Comment by Gerrit Updater [ 28/Aug/19 ] |
|
Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/35950 |
| Comment by Gerrit Updater [ 20/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35950/ |
| Comment by Peter Jones [ 20/Sep/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 29/Sep/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36317 |
| Comment by Gerrit Updater [ 21/Nov/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36317/ |