[LU-12675] LBUG "(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1" running sanity.sh on latest master Created: 21/Aug/19 Updated: 25/Jan/22 Resolved: 25/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.7 |
| Type: | Bug | Priority: | Major |
| Reporter: | Bruno Faccini (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Working on a new patch and trying to expose it to the latest master, I have intermittently triggered this LBUG when running different scenarios followed by a mount of MDT0000 : [67989.064599] LustreError: 40531:0:(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 [67989.079942] LustreError: 40531:0:(lu_object.c:1196:lu_device_fini()) LBUG [67989.088513] Pid: 40531, comm: umount 3.10.0-862.14.4.el7_lustre_63ee081_Client.x86_64 #1 SMP Sun Mar 17 19:01:48 UTC 2019 [67989.101749] Call Trace: [67989.105484] [<ffffffffc0c128bc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [67989.113782] [<ffffffffc0c1296c>] lbug_with_loc+0x4c/0xa0 [libcfs] [67989.121694] [<ffffffffc0f2979b>] lu_device_fini+0xbb/0xc0 [obdclass] [67989.129918] [<ffffffffc0f2edae>] dt_device_fini+0xe/0x10 [obdclass] [67989.138001] [<ffffffffc185ade2>] osp_device_free+0x42/0x1f0 [osp] [67989.145869] [<ffffffffc0efdcc2>] class_free_dev+0x4c2/0x720 [obdclass] [67989.154220] [<ffffffffc0efe110>] class_export_put+0x1f0/0x2c0 [obdclass] [67989.162763] [<ffffffffc0effb85>] class_unlink_export+0x135/0x170 [obdclass] [67989.171594] [<ffffffffc0f150a0>] class_decref+0x80/0x160 [obdclass] [67989.179637] [<ffffffffc0f15503>] class_detach+0x1b3/0x2e0 [obdclass] [67989.187764] [<ffffffffc0f1bb61>] class_process_config+0x1a91/0x2840 [obdclass] [67989.196858] [<ffffffffc0f1caf0>] class_manual_cleanup+0x1e0/0x720 [obdclass] [67989.205747] [<ffffffffc185b258>] osp_obd_disconnect+0x178/0x210 [osp] [67989.213932] [<ffffffffc1774076>] lod_putref+0x276/0x990 [lod] [67989.221305] [<ffffffffc177623d>] lod_fini_tgt+0xdd/0x3a0 [lod] [67989.228800] [<ffffffffc1768b3c>] lod_device_fini+0x7c/0x1f0 [lod] [67989.236611] [<ffffffffc0f196c1>] class_cleanup+0x861/0xc40 [obdclass] [67989.244778] [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass] [67989.253713] [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass] [67989.262537] [<ffffffffc1768993>] lod_obd_disconnect+0x93/0x1c0 [lod] [67989.270556] [<ffffffffc17f3379>] mdd_process_config+0x3a9/0x5f0 [mdd] [67989.278689] [<ffffffffc1699392>] mdt_stack_fini+0x2c2/0xca0 [mdt] [67989.286440] [<ffffffffc169a0bb>] mdt_device_fini+0x34b/0x930 [mdt] [67989.294188] [<ffffffffc0f19808>] class_cleanup+0x9a8/0xc40 [obdclass] [67989.302241] [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass] [67989.311052] [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass] [67989.319823] [<ffffffffc0f4e22e>] server_put_super+0x8de/0xcd0 [obdclass] [67989.328157] [<ffffffff87c21c6d>] generic_shutdown_super+0x6d/0x100 [67989.335908] [<ffffffff87c22052>] kill_anon_super+0x12/0x20 [67989.342864] [<ffffffffc0f1f6f2>] lustre_kill_super+0x32/0x50 [obdclass] [67989.351035] [<ffffffff87c2240e>] deactivate_locked_super+0x4e/0x70 [67989.358797] [<ffffffff87c22b96>] deactivate_super+0x46/0x60 [67989.365849] [<ffffffff87c40aaf>] cleanup_mnt+0x3f/0x80 [67989.372412] [<ffffffff87c40b42>] __cleanup_mnt+0x12/0x20 [67989.379051] [<ffffffff87abab8b>] task_work_run+0xbb/0xe0 [67989.385754] [<ffffffff87a2bc55>] do_notify_resume+0xa5/0xc0 [67989.392745] [<ffffffff88125ae7>] int_signal+0x12/0x17 [67989.399057] [<ffffffffffffffff>] 0xffffffffffffffff [67989.405243] Kernel panic - not syncing: LBUG [67989.410764] CPU: 45 PID: 40531 Comm: umount Kdump: loaded Tainted: G W OE ------------ 3.10.0-862.14.4.el7_lustre_63ee081_Client.x86_64 #1 [67989.426896] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0022.062820171903 06/28/2017 [67989.439356] Call Trace: [67989.442712] [<ffffffff88113754>] dump_stack+0x19/0x1b [67989.448985] [<ffffffff8810d29f>] panic+0xe8/0x21f [67989.454886] [<ffffffffc0c129bb>] lbug_with_loc+0x9b/0xa0 [libcfs] [67989.462416] [<ffffffffc0f2979b>] lu_device_fini+0xbb/0xc0 [obdclass] [67989.470132] [<ffffffffc0f2edae>] dt_device_fini+0xe/0x10 [obdclass] [67989.477775] [<ffffffffc185ade2>] osp_device_free+0x42/0x1f0 [osp] [67989.485215] [<ffffffffc0efdcc2>] class_free_dev+0x4c2/0x720 [obdclass] [67989.493113] [<ffffffffc0efe110>] class_export_put+0x1f0/0x2c0 [obdclass] [67989.501228] [<ffffffffc0effb85>] class_unlink_export+0x135/0x170 [obdclass] [67989.509668] [<ffffffffc0f150a0>] class_decref+0x80/0x160 [obdclass] [67989.517313] [<ffffffffc0f15503>] class_detach+0x1b3/0x2e0 [obdclass] [67989.524999] [<ffffffffc0f1bb61>] class_process_config+0x1a91/0x2840 [obdclass] [67989.533721] [<ffffffff87aceb76>] ? __cond_resched+0x26/0x30 [67989.540631] [<ffffffffc0f1caf0>] class_manual_cleanup+0x1e0/0x720 [obdclass] [67989.549113] [<ffffffffc185b258>] osp_obd_disconnect+0x178/0x210 [osp] [67989.556905] [<ffffffffc1774076>] lod_putref+0x276/0x990 [lod] [67989.563910] [<ffffffffc177623d>] lod_fini_tgt+0xdd/0x3a0 [lod] [67989.571008] [<ffffffffc1768b3c>] lod_device_fini+0x7c/0x1f0 [lod] [67989.578501] [<ffffffffc0f196c1>] class_cleanup+0x861/0xc40 [obdclass] [67989.586341] [<ffffffffc0c18e18>] ? libcfs_debug_msg+0x688/0xab0 [libcfs] [67989.594519] [<ffffffffc0f00406>] ? class_name2dev_nolock+0x46/0xb0 [obdclass] [67989.603112] [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass] [67989.611737] [<ffffffff87aceb76>] ? __cond_resched+0x26/0x30 [67989.618654] [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass] [67989.627146] [<ffffffffc1768993>] lod_obd_disconnect+0x93/0x1c0 [lod] [67989.634883] [<ffffffffc17f3379>] mdd_process_config+0x3a9/0x5f0 [mdd] [67989.642730] [<ffffffffc1699392>] mdt_stack_fini+0x2c2/0xca0 [mdt] [67989.650163] [<ffffffffc169a0bb>] mdt_device_fini+0x34b/0x930 [mdt] [67989.657733] [<ffffffffc0f19808>] class_cleanup+0x9a8/0xc40 [obdclass] [67989.665624] [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass] [67989.674239] [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass] [67989.682787] [<ffffffffc0f4e22e>] server_put_super+0x8de/0xcd0 [obdclass] [67989.690893] [<ffffffff87c21c6d>] generic_shutdown_super+0x6d/0x100 [67989.698486] [<ffffffff87c22052>] kill_anon_super+0x12/0x20 [67989.705257] [<ffffffffc0f1f6f2>] lustre_kill_super+0x32/0x50 [obdclass] [67989.713288] [<ffffffff87c2240e>] deactivate_locked_super+0x4e/0x70 [67989.720828] [<ffffffff87c22b96>] deactivate_super+0x46/0x60 [67989.727724] [<ffffffff87c40aaf>] cleanup_mnt+0x3f/0x80 [67989.734053] [<ffffffff87c40b42>] __cleanup_mnt+0x12/0x20 [67989.740663] [<ffffffff87abab8b>] task_work_run+0xbb/0xe0 [67989.747225] [<ffffffff87a2bc55>] do_notify_resume+0xa5/0xc0 [67989.754037] [<ffffffff88125ae7>] int_signal+0x12/0x17 I first thought it could be related to my own patch/code, but after I have made some investigations on the crash-dump content, I think problem (reference left/leaked on a server object) should be caused by a somewhat hidden bug in Lustre master code : .................
static int mdt_reint_unlink(struct mdt_thread_info *info,
struct mdt_lock_handle *lhc)
{
.................
/* We will lock the child regardless it is local or remote. No harm. */
mc = mdt_object_find(info->mti_env, info->mti_mdt, child_fid);
if (IS_ERR(mc))
GOTO(unlock_parent, rc = PTR_ERR(mc));
if (!cos_incompat) {
rc = mdt_object_striped(info, mc);
if (rc < 0)
GOTO(unlock_parent, rc); =====> with mc reference set
cos_incompat = rc;
if (cos_incompat) {
mdt_object_put(info->mti_env, mc);
mdt_object_unlock(info, mp, parent_lh, -EAGAIN);
goto relock;
}
}
child_lh = &info->mti_lh[MDT_LH_CHILD];
mdt_lock_reg_init(child_lh, LCK_EX);
if (info->mti_spec.sp_rm_entry) {
struct lu_ucred *uc = mdt_ucred(info);
if (!mdt_is_dne_client(req->rq_export))
/* Return -ENOTSUPP for old client */
GOTO(put_child, rc = -ENOTSUPP);
.................
put_child:
mdt_object_put(info->mti_env, mc); <===== to release reference !!
unlock_parent:
mdt_object_unlock(info, mp, parent_lh, rc); <====== will not release reference :-(
put_parent:
mdt_object_put(info->mti_env, mp);
return rc;
}
.................
"lustre/mdt/mdt_reint.c" [readonly] line 1073 of 2840 --37%-- col 1
I will push a patch/fix soon. |
| Comments |
| Comment by Gerrit Updater [ 21/Aug/19 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/35845 |
| Comment by Bruno Faccini (Inactive) [ 21/Aug/19 ] |
|
Would like to add Lai as a watcher, but seems I am not allowed to do this ... |
| Comment by Patrick Farrell (Inactive) [ 21/Aug/19 ] |
|
You aren't, bruno (sry), so I did. |
| Comment by Bruno Faccini (Inactive) [ 21/Aug/19 ] |
|
Thx Patrick ! |
| Comment by Alex Zhuravlev [ 26/Aug/19 ] |
|
with this patch racer is doing much much better, but finally: LustreError: 28408:0:(lu_object.c:1265:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 2 LustreError: 28408:0:(lu_object.c:1265:lu_device_fini()) LBUG Pid: 28408, comm: umount 4.18.0 #12 SMP Sun Aug 4 18:48:21 MSK 2019 Call Trace: libcfs_call_trace+0x71/0x90 [libcfs] lbug_with_loc+0x3e/0x80 [libcfs] lu_device_fini+0x75/0xb0 [obdclass] osp_device_free+0x60/0x190 [osp] class_free_dev+0x336/0x560 [obdclass] class_detach+0x249/0x290 [obdclass] class_process_config+0x21f6/0x31a0 [obdclass] class_manual_cleanup+0x1aa/0x650 [obdclass] osp_obd_disconnect+0x180/0x1f0 [osp] lod_putref+0x338/0x7e0 [lod] lod_fini_tgt+0xa9/0x290 [lod] lod_device_fini+0xf2/0x1f0 [lod] class_cleanup+0x3d5/0xb50 [obdclass] |
| Comment by Gerrit Updater [ 27/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35845/ |
| Comment by Gerrit Updater [ 07/Jun/21 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43940 |
| Comment by Gerrit Updater [ 15/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43940/ |