Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.13.0, Lustre 2.12.3
-
None
-
3
-
9223372036854775807
Description
mds-survey test_1 started crashing with a NULL pointer dereference in echo_object_free() on 16 August 2019.
Looking at the kernel-crash log, we see
[37813.223580] Lustre: DEBUG MARKER: == mds-survey test 1: Metadata survey with zero-stripe =============================================== 15:50:41 (1565970641) [37813.430124] Lustre: DEBUG MARKER: /usr/sbin/lctl dl [37814.598393] Lustre: Echo OBD driver; http://www.lustre.org/ [37820.151690] LustreError: 14608:0:(echo_client.c:1821:echo_md_lookup()) lookup MDT0000-tests: rc = -2 [37820.152651] LustreError: 14608:0:(echo_client.c:2055:echo_md_destroy_internal()) Can't find child MDT0000-tests: rc = -2 [37820.674040] LustreError: 14674:0:(echo_client.c:1821:echo_md_lookup()) lookup MDT0000-tests2: rc = -2 [37820.675015] LustreError: 14674:0:(echo_client.c:1821:echo_md_lookup()) Skipped 1 previous similar message [37820.675915] LustreError: 14674:0:(echo_client.c:2055:echo_md_destroy_internal()) Can't find child MDT0000-tests2: rc = -2 [37820.676936] LustreError: 14674:0:(echo_client.c:2055:echo_md_destroy_internal()) Skipped 1 previous similar message [37822.418041] LustreError: 14873:0:(echo_client.c:1821:echo_md_lookup()) lookup MDT0002-tests: rc = -2 [37822.419018] LustreError: 14873:0:(echo_client.c:1821:echo_md_lookup()) Skipped 1 previous similar message [37822.419931] LustreError: 14873:0:(echo_client.c:2055:echo_md_destroy_internal()) Can't find child MDT0002-tests: rc = -2 [37822.421048] LustreError: 14873:0:(echo_client.c:2055:echo_md_destroy_internal()) Skipped 1 previous similar message [39010.965720] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 [39010.966793] IP: [<ffffffffc102ad18>] echo_object_free+0x28/0x490 [obdecho] [39010.967550] PGD 8000000016975067 PUD 16976067 PMD 0 [39010.968086] Oops: 0000 [#1] SMP [39010.968876] Modules linked in: obdecho(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr virtio_balloon parport_pc parport i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk ata_piix libata crct10dif_pclmul crct10dif_common [39010.976913] crc32c_intel 8139too serio_raw virtio_pci 8139cp virtio_ring virtio mii floppy [39010.977773] CPU: 0 PID: 16251 Comm: lctl Kdump: loaded Tainted: G OE ------------ 3.10.0-957.21.3.el7_lustre.x86_64 #1 [39010.978869] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [39010.979407] task: ffff88d947864100 ti: ffff88d936efc000 task.ti: ffff88d936efc000 [39010.980101] RIP: 0010:[<ffffffffc102ad18>] [<ffffffffc102ad18>] echo_object_free+0x28/0x490 [obdecho] [39010.980993] RSP: 0018:ffff88d936effb18 EFLAGS: 00010246 [39010.981484] RAX: 0000000000000000 RBX: ffff88d93aaa4cc0 RCX: dead000000000200 [39010.982152] RDX: ffff88d936effb40 RSI: ffff88d93aaa4cc0 RDI: ffff88d917df92d8 [39010.982815] RBP: ffff88d936effb30 R08: ffff88d93aaa4cd8 R09: 0000000000000010 [39010.983481] R10: 0000000000000223 R11: 000000000007ffff R12: ffff88d93aaa4cd8 [39010.984149] R13: ffff88d917df92d8 R14: ffff88d936effb40 R15: ffffa34d86641028 [39010.984810] FS: 00007fd52a7ec740(0000) GS:ffff88d97fc00000(0000) knlGS:0000000000000000 [39010.985566] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [39010.986113] CR2: 0000000000000040 CR3: 0000000044ab4000 CR4: 00000000000606f0 [39010.986779] Call Trace: [39010.987156] [<ffffffffc0891592>] lu_object_free.isra.32+0xf2/0x170 [obdclass] [39010.987857] [<ffffffffc0895756>] lu_object_find_at+0x5f6/0xa60 [obdclass] [39010.988527] [<ffffffffa9db91de>] ? filemap_fault+0x17e/0x490 [39010.989080] [<ffffffffc1034012>] echo_md_handler.isra.43+0x362/0x2c60 [obdecho] [39010.989769] [<ffffffffc10377a8>] echo_client_iocontrol+0xe98/0x1860 [obdecho] [39010.990894] [<ffffffffc085cbd0>] ? obd_ioctl_getdata+0x200/0x11b0 [obdclass] [39010.991583] [<ffffffffc085f4aa>] class_handle_ioctl+0x192a/0x1e30 [obdclass] [39010.992268] [<ffffffffa9dec27d>] ? handle_mm_fault+0x39d/0x9b0 [39010.992836] [<ffffffffa9ef8fde>] ? security_capable+0x1e/0x20 [39010.993405] [<ffffffffc085fa25>] obd_class_ioctl+0x75/0x170 [obdclass] [39010.994038] [<ffffffffa9e569d0>] do_vfs_ioctl+0x3a0/0x5a0 [39010.994572] [<ffffffffa9e56c71>] SyS_ioctl+0xa1/0xc0 [39010.995075] [<ffffffffaa375d15>] ? system_call_after_swapgs+0xa2/0x146 [39010.995695] [<ffffffffaa375ddb>] system_call_fastpath+0x22/0x27 [39010.996265] [<ffffffffaa375d21>] ? system_call_after_swapgs+0xae/0x146 [39010.996875] Code: 00 00 00 66 66 66 66 90 55 48 85 f6 48 89 e5 41 55 41 54 53 48 89 f3 0f 85 1d 04 00 00 f6 05 c7 22 65 ff 01 48 8b 83 98 00 00 00 <4c> 8b 60 40 0f 85 8e 01 00 00 8b 83 b8 00 00 00 85 c0 0f 85 c5 [39011.000093] RIP [<ffffffffc102ad18>] echo_object_free+0x28/0x490 [obdecho] [39011.000758] RSP <ffff88d936effb18> [39011.001102] CR2: 0000000000000040
This test is crashing for ldiskfs, ZFS, DNE and non-DNE environments.
Here are links to logs for several crashes:
https://testing.whamcloud.com/test_sets/b53bc6f2-c07c-11e9-9fc9-52540065bddc
https://testing.whamcloud.com/test_sets/7512ef7a-bfe0-11e9-90ad-52540065bddc
https://testing.whamcloud.com/test_sets/b35271e6-c0e1-11e9-a2b6-52540065bddc
https://testing.whamcloud.com/test_sets/25c95ba8-c15b-11e9-a2b6-52540065bddc