Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12742

Memory corruption in fiemap tests

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.13.0, Lustre 2.12.3
    • None
    • 3
    • 9223372036854775807

    Description

      It seems we have common crashes in sanity test 130a: FIEMAP (1-stripe file):

      [ 7916.018181] Lustre: DEBUG MARKER: == sanity test 130a: FIEMAP (1-stripe file) ========================================================== 10:25:23 (1567851923)
      [ 7916.504757] general protection fault: 0000 [#1] SMP 
      [ 7916.505758] Modules linked in: lnet_selftest(OE) osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr virtio_balloon i2c_piix4 parport_pc parport ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk ata_piix 8139too crct10dif_pclmul crct10dif_common libata
      [ 7916.519533]  crc32c_intel serio_raw 8139cp virtio_pci virtio_ring mii virtio floppy
      [ 7916.520849] CPU: 1 PID: 11304 Comm: ps Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.21.3.el7_lustre.x86_64 #1
      [ 7916.522709] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 7916.523647] task: ffff99ba000b30c0 ti: ffff99ba38e80000 task.ti: ffff99ba38e80000
      [ 7916.524860] RIP: 0010:[<ffffffff9b41c384>]  [<ffffffff9b41c384>] kmem_cache_alloc+0x74/0x1f0
      [ 7916.526305] RSP: 0018:ffff99ba38e83d10  EFLAGS: 00010282
      [ 7916.527168] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000003aa1ad7
      [ 7916.528329] RDX: 0000000003aa1ad6 RSI: 00000000000080d0 RDI: ffff99ba3d001700
      [ 7916.529483] RBP: ffff99ba38e83d40 R08: 000000000001f120 R09: ffffffff9b443ebc
      [ 7916.530640] R10: 8080808080808080 R11: 0000000000000000 R12: b999ffff006a1dfb
      [ 7916.531786] R13: 00000000000080d0 R14: ffff99ba3d001700 R15: ffff99ba3d001700
      [ 7916.532942] FS:  00007f63db619880(0000) GS:ffff99ba3fd00000(0000) knlGS:0000000000000000
      [ 7916.534246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 7916.535175] CR2: 00007f63dab50060 CR3: 0000000005c3c000 CR4: 00000000000606e0
      [ 7916.536343] Call Trace:
      [ 7916.536790]  [<ffffffff9b443ebc>] ? get_empty_filp+0x5c/0x1a0
      [ 7916.537734]  [<ffffffff9b443ebc>] get_empty_filp+0x5c/0x1a0
      [ 7916.538656]  [<ffffffff9b452b2d>] path_openat+0x4d/0x640
      [ 7916.539523]  [<ffffffff9b454492>] ? user_path_at_empty+0x72/0xc0
      [ 7916.540509]  [<ffffffff9b43e82a>] ? __check_object_size+0x1ca/0x250
      [ 7916.541528]  [<ffffffff9b4545bd>] do_filp_open+0x4d/0xb0
      [ 7916.542403]  [<ffffffff9b461d94>] ? __alloc_fd+0xc4/0x170
      [ 7916.543287]  [<ffffffff9b440717>] do_sys_open+0x137/0x240
      [ 7916.544190]  [<ffffffff9b975d15>] ? system_call_after_swapgs+0xa2/0x146
      [ 7916.545267]  [<ffffffff9b44083e>] SyS_open+0x1e/0x20
      [ 7916.546081]  [<ffffffff9b975ddb>] system_call_fastpath+0x22/0x27
      [ 7916.547064]  [<ffffffff9b975d21>] ? system_call_after_swapgs+0xae/0x146
      [ 7916.548135] Code: 4e bf 64 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 28 01 00 00 48 85 c0 0f 84 1f 01 00 00 49 63 46 20 48 8d 4a 01 4d 8b 06 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 ba 49 63 
      [ 7916.553321] RIP  [<ffffffff9b41c384>] kmem_cache_alloc+0x74/0x1f0
      

      and in "sanityn test 71b: check fiemap support for stripecount > 1 ":

      [14234.469060] ------------[ cut here ]------------
      [14234.470412] WARNING: CPU: 0 PID: 27442 at lib/list_debug.c:62 __list_del_entry+0x82/0xd0
      [14234.471883] list_del corruption. next->prev should be ffff9d11473f60d8, but was 119dffffd8603f47
      [14234.473354] Modules linked in: dm_flakey osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev parport_pc virtio_balloon pcspkr parport i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk crct10dif_pclmul crct10dif_common ata_piix crc32c_intel 8139too libata serio_raw
      [14234.488294]  virtio_pci virtio_ring virtio 8139cp mii floppy [last unloaded: dm_flakey]
      [14234.489742] CPU: 0 PID: 27442 Comm: ll_ost00_035 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.21.3.el7_lustre.x86_64 #1
      [14234.491796] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [14234.492781] Call Trace:
      [14234.493287]  [<ffffffffa2763107>] dump_stack+0x19/0x1b
      [14234.494176]  [<ffffffffa2097768>] __warn+0xd8/0x100
      [14234.495023]  [<ffffffffa20977ef>] warn_slowpath_fmt+0x5f/0x80
      [14234.496241]  [<ffffffffc09131a1>] ? keys_fini+0xb1/0x1d0 [obdclass]
      [14234.497345]  [<ffffffffa23956d2>] __list_del_entry+0x82/0xd0
      [14234.498634]  [<ffffffffc0bf4b74>] ptlrpc_at_remove_timed+0x34/0xb0 [ptlrpc]
      [14234.499857]  [<ffffffffc0bf9212>] ptlrpc_server_drop_request+0x422/0x6d0 [ptlrpc]
      [14234.501179]  [<ffffffffc0c2bf94>] ? nrs_resource_put_safe+0x94/0xe0 [ptlrpc]
      [14234.502388]  [<ffffffffc0bf9552>] ptlrpc_server_finish_active_request+0x92/0x140 [ptlrpc]
      [14234.503811]  [<ffffffffc0bfb7c1>] ptlrpc_server_handle_request+0x401/0xab0 [ptlrpc]
      [14234.505123]  [<ffffffffc0bfef6c>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
      [14234.506214]  [<ffffffffa20d09f0>] ? finish_task_switch+0x50/0x1c0
      [14234.507257]  [<ffffffffa2768a7a>] ? __schedule+0x42a/0x860
      [14234.508232]  [<ffffffffc0bfe440>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [14234.509484]  [<ffffffffa20c1da1>] kthread+0xd1/0xe0
      [14234.510341]  [<ffffffffa20c1cd0>] ? insert_kthread_work+0x40/0x40
      [14234.511384]  [<ffffffffa2775c37>] ret_from_fork_nospec_begin+0x21/0x21
      [14234.512482]  [<ffffffffa20c1cd0>] ? insert_kthread_work+0x40/0x40
      [14234.513518] ---[ end trace 6a3a8db2f1509b8b ]---
      

      other obvious corruption was also observer in sanityn.

      both are pretty clear memory corruption issues.

      first sanity 130a failure was seen all the way back to at least Aug 21st 2018 in full master testing and the second in a review testing on Aug 22nd, so probably not a coincidence. It only happens on maloo and not in my testing.

      Both have githash 1ca1da7 as the common parent so possible patch list is:

      1ca1da79a9 LU-10686 tests: stop running sanity-pfl test 9
      ac40000d4b LU-11200 libcfs: handle DECLARE_TIMER reduced to two arguments
      511ea5850f LU-11014 mdc: remove obsolete intent opcodes
      2103f01616 LU-8066 lod: migrate from proc to sysfs
      2506fa2a42 LU-11121 mdt: take discard lock at cleanup stage
      5874da0b67 LU-11175 osc: serialize access to idle_timeout vs cleanup
      6d472fecd5 LU-6142 obdclass: Fix style issues for acl.c
      2acbc62d97 LU-6142 osd-ldiskfs: Fix style issues for osd_iam_lfix.c
      771ae2cdd7 LU-11116 llog: error handling cleanup
      82fe90a1d0 LU-11224 obd: use correct ip_compute_csum() version
      8f37d64b6b LU-9325 ptlrpc: replace simple_strtol with kstrtol
      

      but given the frequency it's possible something earlier was actually the reason too.

      The first sanityn 71b crash of this type was also observed on Aug 21th, 2018 githash 1ca1da7

      it should be noted that in all crashes 71b crash is always preceeded by 71a failure "sanityn test_71a: @@@@@@ FAIL: data is not flushed from client"

      it's not yet clear if 71a failure cannot lead to the 71b crash though

      Attachments

        Issue Links

          Activity

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: