Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14024

ofd_inconsistency_verification_main use after free on shutdown.

XMLWordPrintable

    • 3
    • 9223372036854775807

      It seems LU-12564 patch is exposing a weakness in ofd_inconsistency_verification_main:

              thread_set_flags(thread, SVC_STOPPED);
              wake_up_all(&thread->t_ctl_waitq);
              spin_unlock(&ofd->ofd_inconsistency_lock);
              lu_env_fini(&env);
      

      the spi-unlock then proceeds to crash on unmapped memory:

      [405815.935072] BUG: unable to handle kernel paging request at ffff8802d78127f4
      [405815.937427] IP: [<ffffffff8140a0e5>] do_raw_spin_unlock+0x5/0x90
      [405815.953412] PGD 241c067 PUD 33e9f9067 PMD 33e93c067 PTE 80000002d7812063
      [405815.955679] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [405815.957829] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod pcc_cpufreq loop zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic sb_edac edac_core iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd i2c_piix4 virtio_console virtio_balloon pcspkr ip_tables rpcsec_gss_krb5 ata_generic pata_acpi drm_kms_helper ttm drm ata_piix crct10dif_pclmul drm_panel_orientation_quirks crct10dif_common virtio_blk crc32c_intel libata serio_raw i2c_core floppy [last unloaded: libcfs]
      [405816.028386] 
      [405816.030183] CPU: 4 PID: 4908 Comm: inconsistency_v Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.7-debug #1
      [405816.048472] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [405816.050687] task: ffff8802dc486d00 ti: ffff8802ca4f8000 task.ti: ffff8802ca4f8000
      [405816.139729] RIP: 0010:[<ffffffff8140a0e5>]  [<ffffffff8140a0e5>] do_raw_spin_unlock+0x5/0x90
      [405816.154191] RSP: 0018:ffff8802ca4fbd60  EFLAGS: 00010292
      [405816.156260] RAX: 0000000000000000 RBX: ffff8802d78127e0 RCX: dead000000000200
      [405816.166949] RDX: 0000000000000004 RSI: 0000000000000286 RDI: ffff8802d78127f0
      [405816.171118] RBP: ffff8802ca4fbd68 R08: ffff8800ab47bb48 R09: ffffffff8221eb80
      [405816.175495] R10: 0000000000000000 R11: 0000000000000400 R12: ffff8802d7812000
      [405816.195126] R13: ffff8802d78127f0 R14: ffff8802dc486d00 R15: ffff88032514b680
      [405816.204188] FS:  0000000000000000(0000) GS:ffff88033db00000(0000) knlGS:0000000000000000
      [405816.209047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [405816.218345] CR2: ffff8802d78127f4 CR3: 0000000001c10000 CR4: 00000000001607e0
      [405816.234013] Call Trace:
      [405816.236025]  [<ffffffff817d662e>] _raw_spin_unlock+0xe/0x20
      [405816.252333]  [<ffffffffa0fd5472>] ofd_inconsistency_verification_main+0xd52/0xde0 [ofd]
      [405816.259324]  [<ffffffff8140a129>] ? do_raw_spin_unlock+0x49/0x90
      [405816.261588]  [<ffffffff810b93f0>] ? wake_up_atomic_t+0x30/0x30
      [405816.263625]  [<ffffffffa0fd4720>] ? ofd_cb_soft_sync+0x240/0x240 [ofd]
      [405816.265897]  [<ffffffff810b8254>] kthread+0xe4/0xf0
      [405816.268022]  [<ffffffff810b8170>] ? kthread_create_on_node+0x140/0x140
      [405816.270246]  [<ffffffff817e0ddd>] ret_from_fork_nospec_begin+0x7/0x21
      [405816.272514]  [<ffffffff810b8170>] ? kthread_create_on_node+0x140/0x140
      

      I am not 100% sure how it unfolds but at the time of crash two other CPUs are running vfree from delayed work

      It almost sounds like the parallel ofd_fini thread does the vfree that's kicked out to the delayed work that has a better chance to run than both the ofd_fini and the inconsistency threads for some reason.

      It seems we really should do that unlock before the wake up call though.

            green Oleg Drokin
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: