[LU-12707] mds-survey test 1 crashes with a NULL pointer dereference in echo_object_free() Created: 27/Aug/19  Updated: 10/Feb/20  Resolved: 20/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12034 env allocation in ptlrpc_set_wait() c... Resolved
is related to LU-12578 obdfilter-survey w/echo-client does n... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

mds-survey test_1 started crashing with a NULL pointer dereference in echo_object_free() on 16 August 2019.

Looking at the kernel-crash log, we see

[37813.223580] Lustre: DEBUG MARKER: == mds-survey test 1: Metadata survey with zero-stripe =============================================== 15:50:41 (1565970641)
[37813.430124] Lustre: DEBUG MARKER: /usr/sbin/lctl dl
[37814.598393] Lustre: Echo OBD driver; http://www.lustre.org/
[37820.151690] LustreError: 14608:0:(echo_client.c:1821:echo_md_lookup()) lookup MDT0000-tests: rc = -2
[37820.152651] LustreError: 14608:0:(echo_client.c:2055:echo_md_destroy_internal()) Can't find child MDT0000-tests: rc = -2
[37820.674040] LustreError: 14674:0:(echo_client.c:1821:echo_md_lookup()) lookup MDT0000-tests2: rc = -2
[37820.675015] LustreError: 14674:0:(echo_client.c:1821:echo_md_lookup()) Skipped 1 previous similar message
[37820.675915] LustreError: 14674:0:(echo_client.c:2055:echo_md_destroy_internal()) Can't find child MDT0000-tests2: rc = -2
[37820.676936] LustreError: 14674:0:(echo_client.c:2055:echo_md_destroy_internal()) Skipped 1 previous similar message
[37822.418041] LustreError: 14873:0:(echo_client.c:1821:echo_md_lookup()) lookup MDT0002-tests: rc = -2
[37822.419018] LustreError: 14873:0:(echo_client.c:1821:echo_md_lookup()) Skipped 1 previous similar message
[37822.419931] LustreError: 14873:0:(echo_client.c:2055:echo_md_destroy_internal()) Can't find child MDT0002-tests: rc = -2
[37822.421048] LustreError: 14873:0:(echo_client.c:2055:echo_md_destroy_internal()) Skipped 1 previous similar message
[39010.965720] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[39010.966793] IP: [<ffffffffc102ad18>] echo_object_free+0x28/0x490 [obdecho]
[39010.967550] PGD 8000000016975067 PUD 16976067 PMD 0 
[39010.968086] Oops: 0000 [#1] SMP 
[39010.968876] Modules linked in: obdecho(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr virtio_balloon parport_pc parport i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk ata_piix libata crct10dif_pclmul crct10dif_common
[39010.976913]  crc32c_intel 8139too serio_raw virtio_pci 8139cp virtio_ring virtio mii floppy
[39010.977773] CPU: 0 PID: 16251 Comm: lctl Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.21.3.el7_lustre.x86_64 #1
[39010.978869] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[39010.979407] task: ffff88d947864100 ti: ffff88d936efc000 task.ti: ffff88d936efc000
[39010.980101] RIP: 0010:[<ffffffffc102ad18>]  [<ffffffffc102ad18>] echo_object_free+0x28/0x490 [obdecho]
[39010.980993] RSP: 0018:ffff88d936effb18  EFLAGS: 00010246
[39010.981484] RAX: 0000000000000000 RBX: ffff88d93aaa4cc0 RCX: dead000000000200
[39010.982152] RDX: ffff88d936effb40 RSI: ffff88d93aaa4cc0 RDI: ffff88d917df92d8
[39010.982815] RBP: ffff88d936effb30 R08: ffff88d93aaa4cd8 R09: 0000000000000010
[39010.983481] R10: 0000000000000223 R11: 000000000007ffff R12: ffff88d93aaa4cd8
[39010.984149] R13: ffff88d917df92d8 R14: ffff88d936effb40 R15: ffffa34d86641028
[39010.984810] FS:  00007fd52a7ec740(0000) GS:ffff88d97fc00000(0000) knlGS:0000000000000000
[39010.985566] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39010.986113] CR2: 0000000000000040 CR3: 0000000044ab4000 CR4: 00000000000606f0
[39010.986779] Call Trace:
[39010.987156]  [<ffffffffc0891592>] lu_object_free.isra.32+0xf2/0x170 [obdclass]
[39010.987857]  [<ffffffffc0895756>] lu_object_find_at+0x5f6/0xa60 [obdclass]
[39010.988527]  [<ffffffffa9db91de>] ? filemap_fault+0x17e/0x490
[39010.989080]  [<ffffffffc1034012>] echo_md_handler.isra.43+0x362/0x2c60 [obdecho]
[39010.989769]  [<ffffffffc10377a8>] echo_client_iocontrol+0xe98/0x1860 [obdecho]
[39010.990894]  [<ffffffffc085cbd0>] ? obd_ioctl_getdata+0x200/0x11b0 [obdclass]
[39010.991583]  [<ffffffffc085f4aa>] class_handle_ioctl+0x192a/0x1e30 [obdclass]
[39010.992268]  [<ffffffffa9dec27d>] ? handle_mm_fault+0x39d/0x9b0
[39010.992836]  [<ffffffffa9ef8fde>] ? security_capable+0x1e/0x20
[39010.993405]  [<ffffffffc085fa25>] obd_class_ioctl+0x75/0x170 [obdclass]
[39010.994038]  [<ffffffffa9e569d0>] do_vfs_ioctl+0x3a0/0x5a0
[39010.994572]  [<ffffffffa9e56c71>] SyS_ioctl+0xa1/0xc0
[39010.995075]  [<ffffffffaa375d15>] ? system_call_after_swapgs+0xa2/0x146
[39010.995695]  [<ffffffffaa375ddb>] system_call_fastpath+0x22/0x27
[39010.996265]  [<ffffffffaa375d21>] ? system_call_after_swapgs+0xae/0x146
[39010.996875] Code: 00 00 00 66 66 66 66 90 55 48 85 f6 48 89 e5 41 55 41 54 53 48 89 f3 0f 85 1d 04 00 00 f6 05 c7 22 65 ff 01 48 8b 83 98 00 00 00 <4c> 8b 60 40 0f 85 8e 01 00 00 8b 83 b8 00 00 00 85 c0 0f 85 c5 
[39011.000093] RIP  [<ffffffffc102ad18>] echo_object_free+0x28/0x490 [obdecho]
[39011.000758]  RSP <ffff88d936effb18>
[39011.001102] CR2: 0000000000000040

This test is crashing for ldiskfs, ZFS, DNE and non-DNE environments.

Here are links to logs for several crashes:
https://testing.whamcloud.com/test_sets/b53bc6f2-c07c-11e9-9fc9-52540065bddc
https://testing.whamcloud.com/test_sets/7512ef7a-bfe0-11e9-90ad-52540065bddc
https://testing.whamcloud.com/test_sets/b35271e6-c0e1-11e9-a2b6-52540065bddc
https://testing.whamcloud.com/test_sets/25c95ba8-c15b-11e9-a2b6-52540065bddc



 Comments   
Comment by Andreas Dilger [ 28/Aug/19 ]

The most obvious candidate is patch https://review.whamcloud.com/35700 "LU-12578 obdecho: reuse an cl env cache for obdecho survey", which landed on 2019-08-15.

Comment by Andreas Dilger [ 28/Aug/19 ]

Alex, Alexey, any thoughts on this crash? It appears related to the LU-12578 patch at first glance.

Comment by Alexey Lyashkov [ 28/Aug/19 ]

Andreas, I think it fail don't related to the LU-12578 patch.

looking to the obdecho code I found strange/bug in echo_resolve_path

(gdb) l *(echo_md_handler+0x71c)
0xaccc is in echo_md_handler (lustre/obdecho/echo_client.c:2202).
2197				rc = -EINVAL;
2198				break;
2199			}
2200
2201			child = echo_md_lookup(env, ed, lu2md(ld_parent), lname);
2202			lu_object_put(env, parent);
2203			if (IS_ERR(child)) {
2204				rc = (int)PTR_ERR(child);
2205				CERROR("lookup %s under parent "DFID": rc = %d\n",
2206					lname->ln_name, PFID(lu_object_fid(ld_parent)),
(gdb)

it looks CERROR function tries to make access to the pointer after last put which called before 'IF'.

I have no idea who resolve path can't found a work dir as it run short after lu_env_refill() so it have guarantee lu_env filled correctly.

can someone change a lu_object_put place and check how it helps for Mallo ?

Comment by Alexey Lyashkov [ 28/Aug/19 ]

I able to replicate it bug. Will provide some info short later, based on my logs - lu_object_find_at start to free an object where it don't have 'init' passed and it caused a null pointer deref.
good case

00000020:00000001:0.0:1566998598.740494:0:20147:0:(lu_object.c:735:lu_object_find_at()) Process entered
00008000:00000001:0.0:1566998598.740495:0:20147:0:(echo_client.c:580:echo_object_alloc()) Process entered
00008000:00000010:0.0:1566998598.740495:0:20147:0:(echo_client.c:583:echo_object_alloc()) slab-alloced 'eco': 264 at ffff88003c438840.
00000020:00000001:0.0:1566998598.740496:0:20147:0:(cl_object.c:72:cl_object_header_init()) Process entered
00000020:00000001:0.0:1566998598.740498:0:20147:0:(cl_object.c:79:cl_object_header_init()) Process leaving (rc=0 : 0 : 0)
00008000:00000001:0.0:1566998598.740499:0:20147:0:(echo_client.c:597:echo_object_alloc()) Process leaving (rc=18446612133325277248 : -131940384274368 : ffff88003c438840)
00008000:00000001:0.0:1566998598.740500:0:20147:0:(echo_client.c:487:echo_object_init()) Process entered
00000004:00000010:0.0:1566998598.740501:0:20147:0:(mdd_object.c:282:mdd_object_alloc()) slab-alloced 'mdd_obj': 96 at ffff88003b6e5e40.
00008000:00000001:0.0:1566998598.740502:0:20147:0:(echo_client.c:524:echo_object_init()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:0.0:1566998598.740503:0:20147:0:(mdd_object.c:304:mdd_object_init()) Process entered

bad case

00000020:00000001:0.0:1566998604.027293:0:20164:0:(lu_object.c:735:lu_object_find_at()) Process entered
00008000:00000001:0.0:1566998604.027294:0:20164:0:(echo_client.c:580:echo_object_alloc()) Process entered
00008000:00000010:0.0:1566998604.027295:0:20164:0:(echo_client.c:583:echo_object_alloc()) slab-alloced 'eco': 264 at ffff88003b61a108.
00000020:00000001:0.0:1566998604.027296:0:20164:0:(cl_object.c:72:cl_object_header_init()) Process entered
00000020:00000001:0.0:1566998604.027298:0:20164:0:(cl_object.c:79:cl_object_header_init()) Process leaving (rc=0 : 0 : 0)
00008000:00000001:0.0:1566998604.027299:0:20164:0:(echo_client.c:597:echo_object_alloc()) Process leaving (rc=18446612133310472456 : -131940399079160 : ffff88003b61a108)

>>>>> next line is panic in lu_object_free

Second question - why "echo_md_lookup()) lookup MDT0000-tests: rc = -2" hit, don't invested now.

Comment by Alexey Lyashkov [ 28/Aug/19 ]

provided patch allow md-survey to be finished fine.

diff --git a/lustre/obdecho/echo_client.c b/lustre/obdecho/echo_client.c
index 11191fb055..1970d17bc5 100644
--- a/lustre/obdecho/echo_client.c
+++ b/lustre/obdecho/echo_client.c
@@ -527,14 +527,18 @@ static int echo_object_init(const struct lu_env *env, struct lu_object *obj,
 static void echo_object_free(const struct lu_env *env, struct lu_object *obj)
 {
        struct echo_object *eco    = cl2echo_obj(lu2cl(obj));
-       struct echo_client_obd *ec = eco->eo_dev->ed_ec;
+       struct echo_client_obd *ec;

        ENTRY;
-       LASSERT(atomic_read(&eco->eo_npages) == 0);
+       if (eco->eo_dev != NULL) {
+               ec = eco->eo_dev->ed_ec;

-       spin_lock(&ec->ec_lock);
-       list_del_init(&eco->eo_obj_chain);
-       spin_unlock(&ec->ec_lock);
+               LASSERT(atomic_read(&eco->eo_npages) == 0);
+
+               spin_lock(&ec->ec_lock);
+               list_del_init(&eco->eo_obj_chain);
+               spin_unlock(&ec->ec_lock);
+       }

        lu_object_fini(obj);
        lu_object_header_fini(obj->lo_header);
Comment by Gerrit Updater [ 28/Aug/19 ]

Alexey Lyashkov (c17817@cray.com) uploaded a new patch: https://review.whamcloud.com/35950
Subject: LU-12707 obdecho: avoid panic with partially object init
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dedb59b36b7e243dd1a27e112838de1f67412a10

Comment by Gerrit Updater [ 20/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35950/
Subject: LU-12707 obdecho: avoid panic with partially object init
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1a9ca8417c60f04a9aa719b7254372e2d18a6b0a

Comment by Peter Jones [ 20/Sep/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 29/Sep/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36317
Subject: LU-12707 obdecho: avoid panic with partially object init
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a4284f3cd19e8b4f5f52df923869513209e32eb6

Comment by Gerrit Updater [ 21/Nov/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36317/
Subject: LU-12707 obdecho: avoid panic with partially object init
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 362e9365699abf08d17044c3c9dd60acfb332a9a

Generated at Sat Feb 10 02:54:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.