Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
3
-
9223372036854775807
Description
Working on a new patch and trying to expose it to the latest master, I have intermittently triggered this LBUG when running different scenarios followed by a mount of MDT0000 :
[67989.064599] LustreError: 40531:0:(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 [67989.079942] LustreError: 40531:0:(lu_object.c:1196:lu_device_fini()) LBUG [67989.088513] Pid: 40531, comm: umount 3.10.0-862.14.4.el7_lustre_63ee081_Client.x86_64 #1 SMP Sun Mar 17 19:01:48 UTC 2019 [67989.101749] Call Trace: [67989.105484] [<ffffffffc0c128bc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [67989.113782] [<ffffffffc0c1296c>] lbug_with_loc+0x4c/0xa0 [libcfs] [67989.121694] [<ffffffffc0f2979b>] lu_device_fini+0xbb/0xc0 [obdclass] [67989.129918] [<ffffffffc0f2edae>] dt_device_fini+0xe/0x10 [obdclass] [67989.138001] [<ffffffffc185ade2>] osp_device_free+0x42/0x1f0 [osp] [67989.145869] [<ffffffffc0efdcc2>] class_free_dev+0x4c2/0x720 [obdclass] [67989.154220] [<ffffffffc0efe110>] class_export_put+0x1f0/0x2c0 [obdclass] [67989.162763] [<ffffffffc0effb85>] class_unlink_export+0x135/0x170 [obdclass] [67989.171594] [<ffffffffc0f150a0>] class_decref+0x80/0x160 [obdclass] [67989.179637] [<ffffffffc0f15503>] class_detach+0x1b3/0x2e0 [obdclass] [67989.187764] [<ffffffffc0f1bb61>] class_process_config+0x1a91/0x2840 [obdclass] [67989.196858] [<ffffffffc0f1caf0>] class_manual_cleanup+0x1e0/0x720 [obdclass] [67989.205747] [<ffffffffc185b258>] osp_obd_disconnect+0x178/0x210 [osp] [67989.213932] [<ffffffffc1774076>] lod_putref+0x276/0x990 [lod] [67989.221305] [<ffffffffc177623d>] lod_fini_tgt+0xdd/0x3a0 [lod] [67989.228800] [<ffffffffc1768b3c>] lod_device_fini+0x7c/0x1f0 [lod] [67989.236611] [<ffffffffc0f196c1>] class_cleanup+0x861/0xc40 [obdclass] [67989.244778] [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass] [67989.253713] [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass] [67989.262537] [<ffffffffc1768993>] lod_obd_disconnect+0x93/0x1c0 [lod] [67989.270556] [<ffffffffc17f3379>] mdd_process_config+0x3a9/0x5f0 [mdd] [67989.278689] [<ffffffffc1699392>] mdt_stack_fini+0x2c2/0xca0 [mdt] [67989.286440] [<ffffffffc169a0bb>] mdt_device_fini+0x34b/0x930 [mdt] [67989.294188] [<ffffffffc0f19808>] class_cleanup+0x9a8/0xc40 [obdclass] [67989.302241] [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass] [67989.311052] [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass] [67989.319823] [<ffffffffc0f4e22e>] server_put_super+0x8de/0xcd0 [obdclass] [67989.328157] [<ffffffff87c21c6d>] generic_shutdown_super+0x6d/0x100 [67989.335908] [<ffffffff87c22052>] kill_anon_super+0x12/0x20 [67989.342864] [<ffffffffc0f1f6f2>] lustre_kill_super+0x32/0x50 [obdclass] [67989.351035] [<ffffffff87c2240e>] deactivate_locked_super+0x4e/0x70 [67989.358797] [<ffffffff87c22b96>] deactivate_super+0x46/0x60 [67989.365849] [<ffffffff87c40aaf>] cleanup_mnt+0x3f/0x80 [67989.372412] [<ffffffff87c40b42>] __cleanup_mnt+0x12/0x20 [67989.379051] [<ffffffff87abab8b>] task_work_run+0xbb/0xe0 [67989.385754] [<ffffffff87a2bc55>] do_notify_resume+0xa5/0xc0 [67989.392745] [<ffffffff88125ae7>] int_signal+0x12/0x17 [67989.399057] [<ffffffffffffffff>] 0xffffffffffffffff [67989.405243] Kernel panic - not syncing: LBUG [67989.410764] CPU: 45 PID: 40531 Comm: umount Kdump: loaded Tainted: G W OE ------------ 3.10.0-862.14.4.el7_lustre_63ee081_Client.x86_64 #1 [67989.426896] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0022.062820171903 06/28/2017 [67989.439356] Call Trace: [67989.442712] [<ffffffff88113754>] dump_stack+0x19/0x1b [67989.448985] [<ffffffff8810d29f>] panic+0xe8/0x21f [67989.454886] [<ffffffffc0c129bb>] lbug_with_loc+0x9b/0xa0 [libcfs] [67989.462416] [<ffffffffc0f2979b>] lu_device_fini+0xbb/0xc0 [obdclass] [67989.470132] [<ffffffffc0f2edae>] dt_device_fini+0xe/0x10 [obdclass] [67989.477775] [<ffffffffc185ade2>] osp_device_free+0x42/0x1f0 [osp] [67989.485215] [<ffffffffc0efdcc2>] class_free_dev+0x4c2/0x720 [obdclass] [67989.493113] [<ffffffffc0efe110>] class_export_put+0x1f0/0x2c0 [obdclass] [67989.501228] [<ffffffffc0effb85>] class_unlink_export+0x135/0x170 [obdclass] [67989.509668] [<ffffffffc0f150a0>] class_decref+0x80/0x160 [obdclass] [67989.517313] [<ffffffffc0f15503>] class_detach+0x1b3/0x2e0 [obdclass] [67989.524999] [<ffffffffc0f1bb61>] class_process_config+0x1a91/0x2840 [obdclass] [67989.533721] [<ffffffff87aceb76>] ? __cond_resched+0x26/0x30 [67989.540631] [<ffffffffc0f1caf0>] class_manual_cleanup+0x1e0/0x720 [obdclass] [67989.549113] [<ffffffffc185b258>] osp_obd_disconnect+0x178/0x210 [osp] [67989.556905] [<ffffffffc1774076>] lod_putref+0x276/0x990 [lod] [67989.563910] [<ffffffffc177623d>] lod_fini_tgt+0xdd/0x3a0 [lod] [67989.571008] [<ffffffffc1768b3c>] lod_device_fini+0x7c/0x1f0 [lod] [67989.578501] [<ffffffffc0f196c1>] class_cleanup+0x861/0xc40 [obdclass] [67989.586341] [<ffffffffc0c18e18>] ? libcfs_debug_msg+0x688/0xab0 [libcfs] [67989.594519] [<ffffffffc0f00406>] ? class_name2dev_nolock+0x46/0xb0 [obdclass] [67989.603112] [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass] [67989.611737] [<ffffffff87aceb76>] ? __cond_resched+0x26/0x30 [67989.618654] [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass] [67989.627146] [<ffffffffc1768993>] lod_obd_disconnect+0x93/0x1c0 [lod] [67989.634883] [<ffffffffc17f3379>] mdd_process_config+0x3a9/0x5f0 [mdd] [67989.642730] [<ffffffffc1699392>] mdt_stack_fini+0x2c2/0xca0 [mdt] [67989.650163] [<ffffffffc169a0bb>] mdt_device_fini+0x34b/0x930 [mdt] [67989.657733] [<ffffffffc0f19808>] class_cleanup+0x9a8/0xc40 [obdclass] [67989.665624] [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass] [67989.674239] [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass] [67989.682787] [<ffffffffc0f4e22e>] server_put_super+0x8de/0xcd0 [obdclass] [67989.690893] [<ffffffff87c21c6d>] generic_shutdown_super+0x6d/0x100 [67989.698486] [<ffffffff87c22052>] kill_anon_super+0x12/0x20 [67989.705257] [<ffffffffc0f1f6f2>] lustre_kill_super+0x32/0x50 [obdclass] [67989.713288] [<ffffffff87c2240e>] deactivate_locked_super+0x4e/0x70 [67989.720828] [<ffffffff87c22b96>] deactivate_super+0x46/0x60 [67989.727724] [<ffffffff87c40aaf>] cleanup_mnt+0x3f/0x80 [67989.734053] [<ffffffff87c40b42>] __cleanup_mnt+0x12/0x20 [67989.740663] [<ffffffff87abab8b>] task_work_run+0xbb/0xe0 [67989.747225] [<ffffffff87a2bc55>] do_notify_resume+0xa5/0xc0 [67989.754037] [<ffffffff88125ae7>] int_signal+0x12/0x17
I first thought it could be related to my own patch/code, but after I have made some investigations on the crash-dump content, I think problem (reference left/leaked on a server object) should be caused by a somewhat hidden bug in Lustre master code :
................. static int mdt_reint_unlink(struct mdt_thread_info *info, struct mdt_lock_handle *lhc) { ................. /* We will lock the child regardless it is local or remote. No harm. */ mc = mdt_object_find(info->mti_env, info->mti_mdt, child_fid); if (IS_ERR(mc)) GOTO(unlock_parent, rc = PTR_ERR(mc)); if (!cos_incompat) { rc = mdt_object_striped(info, mc); if (rc < 0) GOTO(unlock_parent, rc); =====> with mc reference set cos_incompat = rc; if (cos_incompat) { mdt_object_put(info->mti_env, mc); mdt_object_unlock(info, mp, parent_lh, -EAGAIN); goto relock; } } child_lh = &info->mti_lh[MDT_LH_CHILD]; mdt_lock_reg_init(child_lh, LCK_EX); if (info->mti_spec.sp_rm_entry) { struct lu_ucred *uc = mdt_ucred(info); if (!mdt_is_dne_client(req->rq_export)) /* Return -ENOTSUPP for old client */ GOTO(put_child, rc = -ENOTSUPP); ................. put_child: mdt_object_put(info->mti_env, mc); <===== to release reference !! unlock_parent: mdt_object_unlock(info, mp, parent_lh, rc); <====== will not release reference :-( put_parent: mdt_object_put(info->mti_env, mp); return rc; } ................. "lustre/mdt/mdt_reint.c" [readonly] line 1073 of 2840 --37%-- col 1
I will push a patch/fix soon.