Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12675

LBUG "(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1" running sanity.sh on latest master

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      Working on a new patch and trying to expose it to the latest master, I have intermittently triggered this LBUG when running different scenarios followed by a mount of MDT0000 :

      [67989.064599] LustreError: 40531:0:(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
      [67989.079942] LustreError: 40531:0:(lu_object.c:1196:lu_device_fini()) LBUG
      [67989.088513] Pid: 40531, comm: umount 3.10.0-862.14.4.el7_lustre_63ee081_Client.x86_64 #1 SMP Sun Mar 17 19:01:48 UTC 2019
      [67989.101749] Call Trace:
      [67989.105484]  [<ffffffffc0c128bc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [67989.113782]  [<ffffffffc0c1296c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [67989.121694]  [<ffffffffc0f2979b>] lu_device_fini+0xbb/0xc0 [obdclass]
      [67989.129918]  [<ffffffffc0f2edae>] dt_device_fini+0xe/0x10 [obdclass]
      [67989.138001]  [<ffffffffc185ade2>] osp_device_free+0x42/0x1f0 [osp]
      [67989.145869]  [<ffffffffc0efdcc2>] class_free_dev+0x4c2/0x720 [obdclass]
      [67989.154220]  [<ffffffffc0efe110>] class_export_put+0x1f0/0x2c0 [obdclass]
      [67989.162763]  [<ffffffffc0effb85>] class_unlink_export+0x135/0x170 [obdclass]
      [67989.171594]  [<ffffffffc0f150a0>] class_decref+0x80/0x160 [obdclass]
      [67989.179637]  [<ffffffffc0f15503>] class_detach+0x1b3/0x2e0 [obdclass]
      [67989.187764]  [<ffffffffc0f1bb61>] class_process_config+0x1a91/0x2840 [obdclass]
      [67989.196858]  [<ffffffffc0f1caf0>] class_manual_cleanup+0x1e0/0x720 [obdclass]
      [67989.205747]  [<ffffffffc185b258>] osp_obd_disconnect+0x178/0x210 [osp]
      [67989.213932]  [<ffffffffc1774076>] lod_putref+0x276/0x990 [lod]
      [67989.221305]  [<ffffffffc177623d>] lod_fini_tgt+0xdd/0x3a0 [lod]
      [67989.228800]  [<ffffffffc1768b3c>] lod_device_fini+0x7c/0x1f0 [lod]
      [67989.236611]  [<ffffffffc0f196c1>] class_cleanup+0x861/0xc40 [obdclass]
      [67989.244778]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
      [67989.253713]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
      [67989.262537]  [<ffffffffc1768993>] lod_obd_disconnect+0x93/0x1c0 [lod]
      [67989.270556]  [<ffffffffc17f3379>] mdd_process_config+0x3a9/0x5f0 [mdd]
      [67989.278689]  [<ffffffffc1699392>] mdt_stack_fini+0x2c2/0xca0 [mdt]
      [67989.286440]  [<ffffffffc169a0bb>] mdt_device_fini+0x34b/0x930 [mdt]
      [67989.294188]  [<ffffffffc0f19808>] class_cleanup+0x9a8/0xc40 [obdclass]
      [67989.302241]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
      [67989.311052]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
      [67989.319823]  [<ffffffffc0f4e22e>] server_put_super+0x8de/0xcd0 [obdclass]
      [67989.328157]  [<ffffffff87c21c6d>] generic_shutdown_super+0x6d/0x100
      [67989.335908]  [<ffffffff87c22052>] kill_anon_super+0x12/0x20
      [67989.342864]  [<ffffffffc0f1f6f2>] lustre_kill_super+0x32/0x50 [obdclass]
      [67989.351035]  [<ffffffff87c2240e>] deactivate_locked_super+0x4e/0x70
      [67989.358797]  [<ffffffff87c22b96>] deactivate_super+0x46/0x60
      [67989.365849]  [<ffffffff87c40aaf>] cleanup_mnt+0x3f/0x80
      [67989.372412]  [<ffffffff87c40b42>] __cleanup_mnt+0x12/0x20
      [67989.379051]  [<ffffffff87abab8b>] task_work_run+0xbb/0xe0
      [67989.385754]  [<ffffffff87a2bc55>] do_notify_resume+0xa5/0xc0
      [67989.392745]  [<ffffffff88125ae7>] int_signal+0x12/0x17
      [67989.399057]  [<ffffffffffffffff>] 0xffffffffffffffff
      [67989.405243] Kernel panic - not syncing: LBUG
      [67989.410764] CPU: 45 PID: 40531 Comm: umount Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-862.14.4.el7_lustre_63ee081_Client.x86_64 #1
      [67989.426896] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0022.062820171903 06/28/2017
      [67989.439356] Call Trace:
      [67989.442712]  [<ffffffff88113754>] dump_stack+0x19/0x1b
      [67989.448985]  [<ffffffff8810d29f>] panic+0xe8/0x21f
      [67989.454886]  [<ffffffffc0c129bb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [67989.462416]  [<ffffffffc0f2979b>] lu_device_fini+0xbb/0xc0 [obdclass]
      [67989.470132]  [<ffffffffc0f2edae>] dt_device_fini+0xe/0x10 [obdclass]
      [67989.477775]  [<ffffffffc185ade2>] osp_device_free+0x42/0x1f0 [osp]
      [67989.485215]  [<ffffffffc0efdcc2>] class_free_dev+0x4c2/0x720 [obdclass]
      [67989.493113]  [<ffffffffc0efe110>] class_export_put+0x1f0/0x2c0 [obdclass]
      [67989.501228]  [<ffffffffc0effb85>] class_unlink_export+0x135/0x170 [obdclass]
      [67989.509668]  [<ffffffffc0f150a0>] class_decref+0x80/0x160 [obdclass]
      [67989.517313]  [<ffffffffc0f15503>] class_detach+0x1b3/0x2e0 [obdclass]
      [67989.524999]  [<ffffffffc0f1bb61>] class_process_config+0x1a91/0x2840 [obdclass]
      [67989.533721]  [<ffffffff87aceb76>] ? __cond_resched+0x26/0x30
      [67989.540631]  [<ffffffffc0f1caf0>] class_manual_cleanup+0x1e0/0x720 [obdclass]
      [67989.549113]  [<ffffffffc185b258>] osp_obd_disconnect+0x178/0x210 [osp]
      [67989.556905]  [<ffffffffc1774076>] lod_putref+0x276/0x990 [lod]
      [67989.563910]  [<ffffffffc177623d>] lod_fini_tgt+0xdd/0x3a0 [lod]
      [67989.571008]  [<ffffffffc1768b3c>] lod_device_fini+0x7c/0x1f0 [lod]
      [67989.578501]  [<ffffffffc0f196c1>] class_cleanup+0x861/0xc40 [obdclass]
      [67989.586341]  [<ffffffffc0c18e18>] ? libcfs_debug_msg+0x688/0xab0 [libcfs]
      [67989.594519]  [<ffffffffc0f00406>] ? class_name2dev_nolock+0x46/0xb0 [obdclass]
      [67989.603112]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
      [67989.611737]  [<ffffffff87aceb76>] ? __cond_resched+0x26/0x30
      [67989.618654]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
      [67989.627146]  [<ffffffffc1768993>] lod_obd_disconnect+0x93/0x1c0 [lod]
      [67989.634883]  [<ffffffffc17f3379>] mdd_process_config+0x3a9/0x5f0 [mdd]
      [67989.642730]  [<ffffffffc1699392>] mdt_stack_fini+0x2c2/0xca0 [mdt]
      [67989.650163]  [<ffffffffc169a0bb>] mdt_device_fini+0x34b/0x930 [mdt]
      [67989.657733]  [<ffffffffc0f19808>] class_cleanup+0x9a8/0xc40 [obdclass]
      [67989.665624]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
      [67989.674239]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
      [67989.682787]  [<ffffffffc0f4e22e>] server_put_super+0x8de/0xcd0 [obdclass]
      [67989.690893]  [<ffffffff87c21c6d>] generic_shutdown_super+0x6d/0x100
      [67989.698486]  [<ffffffff87c22052>] kill_anon_super+0x12/0x20
      [67989.705257]  [<ffffffffc0f1f6f2>] lustre_kill_super+0x32/0x50 [obdclass]
      [67989.713288]  [<ffffffff87c2240e>] deactivate_locked_super+0x4e/0x70
      [67989.720828]  [<ffffffff87c22b96>] deactivate_super+0x46/0x60
      [67989.727724]  [<ffffffff87c40aaf>] cleanup_mnt+0x3f/0x80
      [67989.734053]  [<ffffffff87c40b42>] __cleanup_mnt+0x12/0x20
      [67989.740663]  [<ffffffff87abab8b>] task_work_run+0xbb/0xe0
      [67989.747225]  [<ffffffff87a2bc55>] do_notify_resume+0xa5/0xc0
      [67989.754037]  [<ffffffff88125ae7>] int_signal+0x12/0x17
      

      I first thought it could be related to my own patch/code, but after I have made some investigations on the crash-dump content, I think problem (reference left/leaked on a server object) should be caused by a somewhat hidden bug in Lustre master code :

      .................
      static int mdt_reint_unlink(struct mdt_thread_info *info,
                                  struct mdt_lock_handle *lhc)
      {
      .................
              /* We will lock the child regardless it is local or remote. No harm. */
              mc = mdt_object_find(info->mti_env, info->mti_mdt, child_fid);
              if (IS_ERR(mc))
                      GOTO(unlock_parent, rc = PTR_ERR(mc));
      
              if (!cos_incompat) {
                      rc = mdt_object_striped(info, mc);
                      if (rc < 0)
                              GOTO(unlock_parent, rc);  =====> with mc reference set
      
                      cos_incompat = rc;
                      if (cos_incompat) {
                              mdt_object_put(info->mti_env, mc);
                              mdt_object_unlock(info, mp, parent_lh, -EAGAIN);
                              goto relock;
                      }
              }
      
              child_lh = &info->mti_lh[MDT_LH_CHILD];
              mdt_lock_reg_init(child_lh, LCK_EX);
              if (info->mti_spec.sp_rm_entry) {
                      struct lu_ucred *uc  = mdt_ucred(info);
      
                      if (!mdt_is_dne_client(req->rq_export))
                              /* Return -ENOTSUPP for old client */
                              GOTO(put_child, rc = -ENOTSUPP);
      .................
      put_child:
              mdt_object_put(info->mti_env, mc); <===== to release reference !!
      unlock_parent:
              mdt_object_unlock(info, mp, parent_lh, rc); <====== will not release reference :-(
      put_parent:
              mdt_object_put(info->mti_env, mp);
              return rc;
      }
      .................
      "lustre/mdt/mdt_reint.c" [readonly] line 1073 of 2840 --37%-- col 1
      

      I will push a patch/fix soon.

      Attachments

        Activity

          People

            bruno Bruno Faccini (Inactive)
            bruno Bruno Faccini (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: