Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11204

mdt_reint_unlink->lu_object_put() crash

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0, Lustre 2.12.4
    • Lustre 2.12.0
    • None
    • 9223372036854775807

    Description

      Seeing these for some time in my testing now, in racer:

      [48792.659356] BUG: unable to handle kernel paging request at ffff88008278be60
      [48792.659356] IP: [<ffffffffa034f110>] lu_object_put+0x270/0x3c0 [obdclass]
      [48792.659356] PGD 23e3067 PUD 33fa01067 PMD 33f9ed067 PTE 800000008278b060
      [48792.659356] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [48792.659356] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) dm_flakey dm_mod loop zfs(PO) zunicode(PO) zlua(PO) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) jbd2 mbcache crc_t10dif crct10dif_generic crct10dif_common ata_generic ttm pata_acpi drm_kms_helper i2c_piix4 ata_piix drm virtio_balloon pcspkr serio_raw virtio_console virtio_blk i2c_core libata floppy ip_tables rpcsec_gss_krb5 [last unloaded: libcfs]
      [48792.686829] CPU: 1 PID: 21888 Comm: mdt00_002 Kdump: loaded Tainted: P           OE  ------------   3.10.0-7.5-debug #1
      [48792.686829] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [48792.686829] task: ffff88009d644c80 ti: ffff8800b93ac000 task.ti: ffff8800b93ac000
      [48792.686829] RIP: 0010:[<ffffffffa034f110>]  [<ffffffffa034f110>] lu_object_put+0x270/0x3c0 [obdclass]
      [48792.686829] RSP: 0018:ffff8800b93afb38  EFLAGS: 00010246
      [48792.686829] RAX: 0000000000000000 RBX: ffff88030ef74160 RCX: 0000000000000002
      [48792.686829] RDX: 0000000000000002 RSI: ffffc90007768000 RDI: ffff88008278be68
      [48792.686829] RBP: ffff8800b93afb88 R08: 00000000000000cc R09: 000000000000004f
      [48792.686829] R10: 0000000000000b01 R11: 00000000003fffff R12: ffff880291d79600
      [48792.686829] R13: ffff88008278bea0 R14: ffff88008278be50 R15: ffffc900077a8028
      [48792.686829] FS:  0000000000000000(0000) GS:ffff88033da40000(0000) knlGS:0000000000000000
      [48792.686829] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [48792.686829] CR2: ffff88008278be60 CR3: 000000024c172000 CR4: 00000000000006e0
      [48792.686829] Call Trace:
      [48792.686829]  [<ffffffffa0cbbb13>] mdt_reint_unlink+0x7c3/0x1410 [mdt]
      [48792.686829]  [<ffffffffa0cbfc10>] mdt_reint_rec+0x80/0x210 [mdt]
      [48792.686829]  [<ffffffffa0c9f6ab>] mdt_reint_internal+0x5fb/0x990 [mdt]
      [48792.686829]  [<ffffffffa0caa4a7>] mdt_reint+0x67/0x140 [mdt]
      [48792.686829]  [<ffffffffa05eca55>] tgt_request_handle+0xaf5/0x1590 [ptlrpc]
      [48792.686829]  [<ffffffffa01eaf97>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [48792.686829]  [<ffffffffa0590eb6>] ptlrpc_server_handle_request+0x256/0xad0 [ptlrpc]
      [48792.686829]  [<ffffffff810b9398>] ? __wake_up_common+0x58/0x90
      [48792.686829]  [<ffffffff813ccd2b>] ? do_raw_spin_unlock+0x4b/0x90
      [48792.686829]  [<ffffffffa0594cae>] ptlrpc_main+0xabe/0x1f80 [ptlrpc]
      [48792.686829]  [<ffffffffa05941f0>] ? ptlrpc_register_service+0xeb0/0xeb0 [ptlrpc]
      [48792.686829]  [<ffffffff810ae864>] kthread+0xe4/0xf0
      [48792.686829]  [<ffffffff810ae780>] ? kthread_create_on_node+0x140/0x140
      [48792.686829]  [<ffffffff81783777>] ret_from_fork_nospec_begin+0x21/0x21
      [48792.686829]  [<ffffffff810ae780>] ? kthread_create_on_node+0x140/0x140
      [48792.686829] Code: a0 31 c0 e8 53 be e9 ff 0f 1f 00 48 8b 03 be 01 00 00 00 48 8b 7d c0 48 8b 40 20 ff 50 18 e9 5a fe ff ff 0f 1f 84 00 00 00 00 00 <49> 8b 46 10 a8 01 0f 84 46 fe ff ff 48 8b 7d b0 31 c9 31 d2 be 
      [48792.686829] RIP  [<ffffffffa034f110>] lu_object_put+0x270/0x3c0 [obdclass]
      [48792.686829]  RSP <ffff8800b93afb38>
      [48792.686829] CR2: ffff88008278be60
      

      Attachments

        Issue Links

          Activity

            [LU-11204] mdt_reint_unlink->lu_object_put() crash

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36217/
            Subject: LU-11204 obdclass: remove unprotected access to lu_object
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: e548e31f3feac2831868fe01cc75bf111cf8f501

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36217/ Subject: LU-11204 obdclass: remove unprotected access to lu_object Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: e548e31f3feac2831868fe01cc75bf111cf8f501

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36217
            Subject: LU-11204 obdclass: remove unprotected access to lu_object
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 7db34b95768d7f2df3aa110275ea26d345431852

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36217 Subject: LU-11204 obdclass: remove unprotected access to lu_object Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 7db34b95768d7f2df3aa110275ea26d345431852
            pjones Peter Jones added a comment -

            ok. Should we consider this fix for b2_12?

            pjones Peter Jones added a comment - ok. Should we consider this fix for b2_12?

            that was alternative approach, I've abandoned it.

            tappro Mikhail Pershin added a comment - that was alternative approach, I've abandoned it.
            pjones Peter Jones added a comment -

            So what's the verdict from https://review.whamcloud.com/#/c/34961/ ? Is further work needed or can this ticket be marked as RESOLVED?

            pjones Peter Jones added a comment - So what's the verdict from https://review.whamcloud.com/#/c/34961/  ? Is further work needed or can this ticket be marked as RESOLVED?

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34960/
            Subject: LU-11204 obdclass: remove unprotected access to lu_object
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 336cf0f2f3a9ce5b11a34aeaeec062a5d5144213

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34960/ Subject: LU-11204 obdclass: remove unprotected access to lu_object Project: fs/lustre-release Branch: master Current Patch Set: Commit: 336cf0f2f3a9ce5b11a34aeaeec062a5d5144213

            I've pushed two patches, first is simple to prevent after-free access by using local variable, second patch is fortestonly to check if cl_object_put_last() is still needed. At quick view conditions described in bz22520 don't exist in current code, so whole bz22520 fix might be not needed.

            tappro Mikhail Pershin added a comment - I've pushed two patches, first is simple to prevent after-free access by using local variable, second patch is fortestonly to check if cl_object_put_last() is still needed. At quick view conditions described in bz22520 don't exist in current code, so whole bz22520 fix might be not needed.

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34961
            Subject: LU-11204 obdclass: remove unprotected access to lu_object
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7992d3d38e148f0f9c60a750ba5355413e8b1407

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34961 Subject: LU-11204 obdclass: remove unprotected access to lu_object Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7992d3d38e148f0f9c60a750ba5355413e8b1407

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34960
            Subject: LU-11204 obdclass: remove unprotected access to lu_object
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 98c1b95c49b79509c7d31f2cdebdc46eda54a8b4

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34960 Subject: LU-11204 obdclass: remove unprotected access to lu_object Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 98c1b95c49b79509c7d31f2cdebdc46eda54a8b4

            The reason is the accessing top after atomic_dec_and_lock() call, at that moment top dropped own reference and is not protected so can be freed by other thread. Issue is being seen mostly on onyx-68 with many virtual machines running on the same node.
            Solution can be just getting lu_object_is_dying() value before loh_ref decrement, moreover I am not sure we need this whole block of code with

            		if (lu_object_is_dying(top)) {
            			/*
            			 * somebody may be waiting for this, currently only
            			 * used for cl_object, see cl_object_put_last().
            			 */
            			wake_up_all(&bkt->lsb_marche_funebre);
            		}
            

            it is bz22520 https://bugzilla.lustre.org/show_bug.cgi?id=22520 and it is worth to review how things are working now and if that wake_up() in lu_object_put() is needed for every put really.

            tappro Mikhail Pershin added a comment - The reason is the accessing top after atomic_dec_and_lock() call, at that moment top dropped own reference and is not protected so can be freed by other thread. Issue is being seen mostly on onyx-68 with many virtual machines running on the same node. Solution can be just getting lu_object_is_dying() value before loh_ref decrement, moreover I am not sure we need this whole block of code with if (lu_object_is_dying(top)) { /* * somebody may be waiting for this , currently only * used for cl_object, see cl_object_put_last(). */ wake_up_all(&bkt->lsb_marche_funebre); } it is bz22520 https://bugzilla.lustre.org/show_bug.cgi?id=22520 and it is worth to review how things are working now and if that wake_up() in lu_object_put() is needed for every put really.

            People

              tappro Mikhail Pershin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: