[LU-12675] LBUG "(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1" running sanity.sh on latest master Created: 21/Aug/19  Updated: 25/Jan/22  Resolved: 25/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.7

Type: Bug Priority: Major
Reporter: Bruno Faccini (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Working on a new patch and trying to expose it to the latest master, I have intermittently triggered this LBUG when running different scenarios followed by a mount of MDT0000 :

[67989.064599] LustreError: 40531:0:(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
[67989.079942] LustreError: 40531:0:(lu_object.c:1196:lu_device_fini()) LBUG
[67989.088513] Pid: 40531, comm: umount 3.10.0-862.14.4.el7_lustre_63ee081_Client.x86_64 #1 SMP Sun Mar 17 19:01:48 UTC 2019
[67989.101749] Call Trace:
[67989.105484]  [<ffffffffc0c128bc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[67989.113782]  [<ffffffffc0c1296c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[67989.121694]  [<ffffffffc0f2979b>] lu_device_fini+0xbb/0xc0 [obdclass]
[67989.129918]  [<ffffffffc0f2edae>] dt_device_fini+0xe/0x10 [obdclass]
[67989.138001]  [<ffffffffc185ade2>] osp_device_free+0x42/0x1f0 [osp]
[67989.145869]  [<ffffffffc0efdcc2>] class_free_dev+0x4c2/0x720 [obdclass]
[67989.154220]  [<ffffffffc0efe110>] class_export_put+0x1f0/0x2c0 [obdclass]
[67989.162763]  [<ffffffffc0effb85>] class_unlink_export+0x135/0x170 [obdclass]
[67989.171594]  [<ffffffffc0f150a0>] class_decref+0x80/0x160 [obdclass]
[67989.179637]  [<ffffffffc0f15503>] class_detach+0x1b3/0x2e0 [obdclass]
[67989.187764]  [<ffffffffc0f1bb61>] class_process_config+0x1a91/0x2840 [obdclass]
[67989.196858]  [<ffffffffc0f1caf0>] class_manual_cleanup+0x1e0/0x720 [obdclass]
[67989.205747]  [<ffffffffc185b258>] osp_obd_disconnect+0x178/0x210 [osp]
[67989.213932]  [<ffffffffc1774076>] lod_putref+0x276/0x990 [lod]
[67989.221305]  [<ffffffffc177623d>] lod_fini_tgt+0xdd/0x3a0 [lod]
[67989.228800]  [<ffffffffc1768b3c>] lod_device_fini+0x7c/0x1f0 [lod]
[67989.236611]  [<ffffffffc0f196c1>] class_cleanup+0x861/0xc40 [obdclass]
[67989.244778]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
[67989.253713]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
[67989.262537]  [<ffffffffc1768993>] lod_obd_disconnect+0x93/0x1c0 [lod]
[67989.270556]  [<ffffffffc17f3379>] mdd_process_config+0x3a9/0x5f0 [mdd]
[67989.278689]  [<ffffffffc1699392>] mdt_stack_fini+0x2c2/0xca0 [mdt]
[67989.286440]  [<ffffffffc169a0bb>] mdt_device_fini+0x34b/0x930 [mdt]
[67989.294188]  [<ffffffffc0f19808>] class_cleanup+0x9a8/0xc40 [obdclass]
[67989.302241]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
[67989.311052]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
[67989.319823]  [<ffffffffc0f4e22e>] server_put_super+0x8de/0xcd0 [obdclass]
[67989.328157]  [<ffffffff87c21c6d>] generic_shutdown_super+0x6d/0x100
[67989.335908]  [<ffffffff87c22052>] kill_anon_super+0x12/0x20
[67989.342864]  [<ffffffffc0f1f6f2>] lustre_kill_super+0x32/0x50 [obdclass]
[67989.351035]  [<ffffffff87c2240e>] deactivate_locked_super+0x4e/0x70
[67989.358797]  [<ffffffff87c22b96>] deactivate_super+0x46/0x60
[67989.365849]  [<ffffffff87c40aaf>] cleanup_mnt+0x3f/0x80
[67989.372412]  [<ffffffff87c40b42>] __cleanup_mnt+0x12/0x20
[67989.379051]  [<ffffffff87abab8b>] task_work_run+0xbb/0xe0
[67989.385754]  [<ffffffff87a2bc55>] do_notify_resume+0xa5/0xc0
[67989.392745]  [<ffffffff88125ae7>] int_signal+0x12/0x17
[67989.399057]  [<ffffffffffffffff>] 0xffffffffffffffff
[67989.405243] Kernel panic - not syncing: LBUG
[67989.410764] CPU: 45 PID: 40531 Comm: umount Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-862.14.4.el7_lustre_63ee081_Client.x86_64 #1
[67989.426896] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0022.062820171903 06/28/2017
[67989.439356] Call Trace:
[67989.442712]  [<ffffffff88113754>] dump_stack+0x19/0x1b
[67989.448985]  [<ffffffff8810d29f>] panic+0xe8/0x21f
[67989.454886]  [<ffffffffc0c129bb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[67989.462416]  [<ffffffffc0f2979b>] lu_device_fini+0xbb/0xc0 [obdclass]
[67989.470132]  [<ffffffffc0f2edae>] dt_device_fini+0xe/0x10 [obdclass]
[67989.477775]  [<ffffffffc185ade2>] osp_device_free+0x42/0x1f0 [osp]
[67989.485215]  [<ffffffffc0efdcc2>] class_free_dev+0x4c2/0x720 [obdclass]
[67989.493113]  [<ffffffffc0efe110>] class_export_put+0x1f0/0x2c0 [obdclass]
[67989.501228]  [<ffffffffc0effb85>] class_unlink_export+0x135/0x170 [obdclass]
[67989.509668]  [<ffffffffc0f150a0>] class_decref+0x80/0x160 [obdclass]
[67989.517313]  [<ffffffffc0f15503>] class_detach+0x1b3/0x2e0 [obdclass]
[67989.524999]  [<ffffffffc0f1bb61>] class_process_config+0x1a91/0x2840 [obdclass]
[67989.533721]  [<ffffffff87aceb76>] ? __cond_resched+0x26/0x30
[67989.540631]  [<ffffffffc0f1caf0>] class_manual_cleanup+0x1e0/0x720 [obdclass]
[67989.549113]  [<ffffffffc185b258>] osp_obd_disconnect+0x178/0x210 [osp]
[67989.556905]  [<ffffffffc1774076>] lod_putref+0x276/0x990 [lod]
[67989.563910]  [<ffffffffc177623d>] lod_fini_tgt+0xdd/0x3a0 [lod]
[67989.571008]  [<ffffffffc1768b3c>] lod_device_fini+0x7c/0x1f0 [lod]
[67989.578501]  [<ffffffffc0f196c1>] class_cleanup+0x861/0xc40 [obdclass]
[67989.586341]  [<ffffffffc0c18e18>] ? libcfs_debug_msg+0x688/0xab0 [libcfs]
[67989.594519]  [<ffffffffc0f00406>] ? class_name2dev_nolock+0x46/0xb0 [obdclass]
[67989.603112]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
[67989.611737]  [<ffffffff87aceb76>] ? __cond_resched+0x26/0x30
[67989.618654]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
[67989.627146]  [<ffffffffc1768993>] lod_obd_disconnect+0x93/0x1c0 [lod]
[67989.634883]  [<ffffffffc17f3379>] mdd_process_config+0x3a9/0x5f0 [mdd]
[67989.642730]  [<ffffffffc1699392>] mdt_stack_fini+0x2c2/0xca0 [mdt]
[67989.650163]  [<ffffffffc169a0bb>] mdt_device_fini+0x34b/0x930 [mdt]
[67989.657733]  [<ffffffffc0f19808>] class_cleanup+0x9a8/0xc40 [obdclass]
[67989.665624]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
[67989.674239]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
[67989.682787]  [<ffffffffc0f4e22e>] server_put_super+0x8de/0xcd0 [obdclass]
[67989.690893]  [<ffffffff87c21c6d>] generic_shutdown_super+0x6d/0x100
[67989.698486]  [<ffffffff87c22052>] kill_anon_super+0x12/0x20
[67989.705257]  [<ffffffffc0f1f6f2>] lustre_kill_super+0x32/0x50 [obdclass]
[67989.713288]  [<ffffffff87c2240e>] deactivate_locked_super+0x4e/0x70
[67989.720828]  [<ffffffff87c22b96>] deactivate_super+0x46/0x60
[67989.727724]  [<ffffffff87c40aaf>] cleanup_mnt+0x3f/0x80
[67989.734053]  [<ffffffff87c40b42>] __cleanup_mnt+0x12/0x20
[67989.740663]  [<ffffffff87abab8b>] task_work_run+0xbb/0xe0
[67989.747225]  [<ffffffff87a2bc55>] do_notify_resume+0xa5/0xc0
[67989.754037]  [<ffffffff88125ae7>] int_signal+0x12/0x17

I first thought it could be related to my own patch/code, but after I have made some investigations on the crash-dump content, I think problem (reference left/leaked on a server object) should be caused by a somewhat hidden bug in Lustre master code :

.................
static int mdt_reint_unlink(struct mdt_thread_info *info,
                            struct mdt_lock_handle *lhc)
{
.................
        /* We will lock the child regardless it is local or remote. No harm. */
        mc = mdt_object_find(info->mti_env, info->mti_mdt, child_fid);
        if (IS_ERR(mc))
                GOTO(unlock_parent, rc = PTR_ERR(mc));

        if (!cos_incompat) {
                rc = mdt_object_striped(info, mc);
                if (rc < 0)
                        GOTO(unlock_parent, rc);  =====> with mc reference set

                cos_incompat = rc;
                if (cos_incompat) {
                        mdt_object_put(info->mti_env, mc);
                        mdt_object_unlock(info, mp, parent_lh, -EAGAIN);
                        goto relock;
                }
        }

        child_lh = &info->mti_lh[MDT_LH_CHILD];
        mdt_lock_reg_init(child_lh, LCK_EX);
        if (info->mti_spec.sp_rm_entry) {
                struct lu_ucred *uc  = mdt_ucred(info);

                if (!mdt_is_dne_client(req->rq_export))
                        /* Return -ENOTSUPP for old client */
                        GOTO(put_child, rc = -ENOTSUPP);
.................
put_child:
        mdt_object_put(info->mti_env, mc); <===== to release reference !!
unlock_parent:
        mdt_object_unlock(info, mp, parent_lh, rc); <====== will not release reference :-(
put_parent:
        mdt_object_put(info->mti_env, mp);
        return rc;
}
.................
"lustre/mdt/mdt_reint.c" [readonly] line 1073 of 2840 --37%-- col 1

I will push a patch/fix soon.



 Comments   
Comment by Gerrit Updater [ 21/Aug/19 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/35845
Subject: LU-12675 mdt: release object reference upon error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: efce731e3ef9471f91c07fdbca9350eeca685923

Comment by Bruno Faccini (Inactive) [ 21/Aug/19 ]

Would like to add Lai as a watcher, but seems I am not allowed to do this ...

Comment by Patrick Farrell (Inactive) [ 21/Aug/19 ]

You aren't, bruno (sry), so I did.

Comment by Bruno Faccini (Inactive) [ 21/Aug/19 ]

Thx Patrick !

Comment by Alex Zhuravlev [ 26/Aug/19 ]

with this patch racer is doing much much better, but finally:

LustreError: 28408:0:(lu_object.c:1265:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 2
LustreError: 28408:0:(lu_object.c:1265:lu_device_fini()) LBUG
Pid: 28408, comm: umount 4.18.0 #12 SMP Sun Aug 4 18:48:21 MSK 2019
Call Trace:
 libcfs_call_trace+0x71/0x90 [libcfs]
 lbug_with_loc+0x3e/0x80 [libcfs]
 lu_device_fini+0x75/0xb0 [obdclass]
 osp_device_free+0x60/0x190 [osp]
 class_free_dev+0x336/0x560 [obdclass]
 class_detach+0x249/0x290 [obdclass]
 class_process_config+0x21f6/0x31a0 [obdclass]
 class_manual_cleanup+0x1aa/0x650 [obdclass]
 osp_obd_disconnect+0x180/0x1f0 [osp]
 lod_putref+0x338/0x7e0 [lod]
 lod_fini_tgt+0xa9/0x290 [lod]
 lod_device_fini+0xf2/0x1f0 [lod]
 class_cleanup+0x3d5/0xb50 [obdclass]
Comment by Gerrit Updater [ 27/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35845/
Subject: LU-12675 mdt: release object reference upon error
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4649899fbba095c7c3eb7ce1c8893040ed6e2494

Comment by Gerrit Updater [ 07/Jun/21 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43940
Subject: LU-12675 mdt: release object reference upon error
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: c7dea2639f3568541396d9bb162f112702cbaad2

Comment by Gerrit Updater [ 15/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43940/
Subject: LU-12675 mdt: release object reference upon error
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: caa11626afaea06e110d58c2f9c5125ad5e61025

Generated at Sat Feb 10 02:54:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.