Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12675

LBUG "(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1" running sanity.sh on latest master

Details

    • 3
    • 9223372036854775807

    Description

      Working on a new patch and trying to expose it to the latest master, I have intermittently triggered this LBUG when running different scenarios followed by a mount of MDT0000 :

      [67989.064599] LustreError: 40531:0:(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
      [67989.079942] LustreError: 40531:0:(lu_object.c:1196:lu_device_fini()) LBUG
      [67989.088513] Pid: 40531, comm: umount 3.10.0-862.14.4.el7_lustre_63ee081_Client.x86_64 #1 SMP Sun Mar 17 19:01:48 UTC 2019
      [67989.101749] Call Trace:
      [67989.105484]  [<ffffffffc0c128bc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [67989.113782]  [<ffffffffc0c1296c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [67989.121694]  [<ffffffffc0f2979b>] lu_device_fini+0xbb/0xc0 [obdclass]
      [67989.129918]  [<ffffffffc0f2edae>] dt_device_fini+0xe/0x10 [obdclass]
      [67989.138001]  [<ffffffffc185ade2>] osp_device_free+0x42/0x1f0 [osp]
      [67989.145869]  [<ffffffffc0efdcc2>] class_free_dev+0x4c2/0x720 [obdclass]
      [67989.154220]  [<ffffffffc0efe110>] class_export_put+0x1f0/0x2c0 [obdclass]
      [67989.162763]  [<ffffffffc0effb85>] class_unlink_export+0x135/0x170 [obdclass]
      [67989.171594]  [<ffffffffc0f150a0>] class_decref+0x80/0x160 [obdclass]
      [67989.179637]  [<ffffffffc0f15503>] class_detach+0x1b3/0x2e0 [obdclass]
      [67989.187764]  [<ffffffffc0f1bb61>] class_process_config+0x1a91/0x2840 [obdclass]
      [67989.196858]  [<ffffffffc0f1caf0>] class_manual_cleanup+0x1e0/0x720 [obdclass]
      [67989.205747]  [<ffffffffc185b258>] osp_obd_disconnect+0x178/0x210 [osp]
      [67989.213932]  [<ffffffffc1774076>] lod_putref+0x276/0x990 [lod]
      [67989.221305]  [<ffffffffc177623d>] lod_fini_tgt+0xdd/0x3a0 [lod]
      [67989.228800]  [<ffffffffc1768b3c>] lod_device_fini+0x7c/0x1f0 [lod]
      [67989.236611]  [<ffffffffc0f196c1>] class_cleanup+0x861/0xc40 [obdclass]
      [67989.244778]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
      [67989.253713]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
      [67989.262537]  [<ffffffffc1768993>] lod_obd_disconnect+0x93/0x1c0 [lod]
      [67989.270556]  [<ffffffffc17f3379>] mdd_process_config+0x3a9/0x5f0 [mdd]
      [67989.278689]  [<ffffffffc1699392>] mdt_stack_fini+0x2c2/0xca0 [mdt]
      [67989.286440]  [<ffffffffc169a0bb>] mdt_device_fini+0x34b/0x930 [mdt]
      [67989.294188]  [<ffffffffc0f19808>] class_cleanup+0x9a8/0xc40 [obdclass]
      [67989.302241]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
      [67989.311052]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
      [67989.319823]  [<ffffffffc0f4e22e>] server_put_super+0x8de/0xcd0 [obdclass]
      [67989.328157]  [<ffffffff87c21c6d>] generic_shutdown_super+0x6d/0x100
      [67989.335908]  [<ffffffff87c22052>] kill_anon_super+0x12/0x20
      [67989.342864]  [<ffffffffc0f1f6f2>] lustre_kill_super+0x32/0x50 [obdclass]
      [67989.351035]  [<ffffffff87c2240e>] deactivate_locked_super+0x4e/0x70
      [67989.358797]  [<ffffffff87c22b96>] deactivate_super+0x46/0x60
      [67989.365849]  [<ffffffff87c40aaf>] cleanup_mnt+0x3f/0x80
      [67989.372412]  [<ffffffff87c40b42>] __cleanup_mnt+0x12/0x20
      [67989.379051]  [<ffffffff87abab8b>] task_work_run+0xbb/0xe0
      [67989.385754]  [<ffffffff87a2bc55>] do_notify_resume+0xa5/0xc0
      [67989.392745]  [<ffffffff88125ae7>] int_signal+0x12/0x17
      [67989.399057]  [<ffffffffffffffff>] 0xffffffffffffffff
      [67989.405243] Kernel panic - not syncing: LBUG
      [67989.410764] CPU: 45 PID: 40531 Comm: umount Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-862.14.4.el7_lustre_63ee081_Client.x86_64 #1
      [67989.426896] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0022.062820171903 06/28/2017
      [67989.439356] Call Trace:
      [67989.442712]  [<ffffffff88113754>] dump_stack+0x19/0x1b
      [67989.448985]  [<ffffffff8810d29f>] panic+0xe8/0x21f
      [67989.454886]  [<ffffffffc0c129bb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [67989.462416]  [<ffffffffc0f2979b>] lu_device_fini+0xbb/0xc0 [obdclass]
      [67989.470132]  [<ffffffffc0f2edae>] dt_device_fini+0xe/0x10 [obdclass]
      [67989.477775]  [<ffffffffc185ade2>] osp_device_free+0x42/0x1f0 [osp]
      [67989.485215]  [<ffffffffc0efdcc2>] class_free_dev+0x4c2/0x720 [obdclass]
      [67989.493113]  [<ffffffffc0efe110>] class_export_put+0x1f0/0x2c0 [obdclass]
      [67989.501228]  [<ffffffffc0effb85>] class_unlink_export+0x135/0x170 [obdclass]
      [67989.509668]  [<ffffffffc0f150a0>] class_decref+0x80/0x160 [obdclass]
      [67989.517313]  [<ffffffffc0f15503>] class_detach+0x1b3/0x2e0 [obdclass]
      [67989.524999]  [<ffffffffc0f1bb61>] class_process_config+0x1a91/0x2840 [obdclass]
      [67989.533721]  [<ffffffff87aceb76>] ? __cond_resched+0x26/0x30
      [67989.540631]  [<ffffffffc0f1caf0>] class_manual_cleanup+0x1e0/0x720 [obdclass]
      [67989.549113]  [<ffffffffc185b258>] osp_obd_disconnect+0x178/0x210 [osp]
      [67989.556905]  [<ffffffffc1774076>] lod_putref+0x276/0x990 [lod]
      [67989.563910]  [<ffffffffc177623d>] lod_fini_tgt+0xdd/0x3a0 [lod]
      [67989.571008]  [<ffffffffc1768b3c>] lod_device_fini+0x7c/0x1f0 [lod]
      [67989.578501]  [<ffffffffc0f196c1>] class_cleanup+0x861/0xc40 [obdclass]
      [67989.586341]  [<ffffffffc0c18e18>] ? libcfs_debug_msg+0x688/0xab0 [libcfs]
      [67989.594519]  [<ffffffffc0f00406>] ? class_name2dev_nolock+0x46/0xb0 [obdclass]
      [67989.603112]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
      [67989.611737]  [<ffffffff87aceb76>] ? __cond_resched+0x26/0x30
      [67989.618654]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
      [67989.627146]  [<ffffffffc1768993>] lod_obd_disconnect+0x93/0x1c0 [lod]
      [67989.634883]  [<ffffffffc17f3379>] mdd_process_config+0x3a9/0x5f0 [mdd]
      [67989.642730]  [<ffffffffc1699392>] mdt_stack_fini+0x2c2/0xca0 [mdt]
      [67989.650163]  [<ffffffffc169a0bb>] mdt_device_fini+0x34b/0x930 [mdt]
      [67989.657733]  [<ffffffffc0f19808>] class_cleanup+0x9a8/0xc40 [obdclass]
      [67989.665624]  [<ffffffffc0f1a747>] class_process_config+0x677/0x2840 [obdclass]
      [67989.674239]  [<ffffffffc0f1cad6>] class_manual_cleanup+0x1c6/0x720 [obdclass]
      [67989.682787]  [<ffffffffc0f4e22e>] server_put_super+0x8de/0xcd0 [obdclass]
      [67989.690893]  [<ffffffff87c21c6d>] generic_shutdown_super+0x6d/0x100
      [67989.698486]  [<ffffffff87c22052>] kill_anon_super+0x12/0x20
      [67989.705257]  [<ffffffffc0f1f6f2>] lustre_kill_super+0x32/0x50 [obdclass]
      [67989.713288]  [<ffffffff87c2240e>] deactivate_locked_super+0x4e/0x70
      [67989.720828]  [<ffffffff87c22b96>] deactivate_super+0x46/0x60
      [67989.727724]  [<ffffffff87c40aaf>] cleanup_mnt+0x3f/0x80
      [67989.734053]  [<ffffffff87c40b42>] __cleanup_mnt+0x12/0x20
      [67989.740663]  [<ffffffff87abab8b>] task_work_run+0xbb/0xe0
      [67989.747225]  [<ffffffff87a2bc55>] do_notify_resume+0xa5/0xc0
      [67989.754037]  [<ffffffff88125ae7>] int_signal+0x12/0x17
      

      I first thought it could be related to my own patch/code, but after I have made some investigations on the crash-dump content, I think problem (reference left/leaked on a server object) should be caused by a somewhat hidden bug in Lustre master code :

      .................
      static int mdt_reint_unlink(struct mdt_thread_info *info,
                                  struct mdt_lock_handle *lhc)
      {
      .................
              /* We will lock the child regardless it is local or remote. No harm. */
              mc = mdt_object_find(info->mti_env, info->mti_mdt, child_fid);
              if (IS_ERR(mc))
                      GOTO(unlock_parent, rc = PTR_ERR(mc));
      
              if (!cos_incompat) {
                      rc = mdt_object_striped(info, mc);
                      if (rc < 0)
                              GOTO(unlock_parent, rc);  =====> with mc reference set
      
                      cos_incompat = rc;
                      if (cos_incompat) {
                              mdt_object_put(info->mti_env, mc);
                              mdt_object_unlock(info, mp, parent_lh, -EAGAIN);
                              goto relock;
                      }
              }
      
              child_lh = &info->mti_lh[MDT_LH_CHILD];
              mdt_lock_reg_init(child_lh, LCK_EX);
              if (info->mti_spec.sp_rm_entry) {
                      struct lu_ucred *uc  = mdt_ucred(info);
      
                      if (!mdt_is_dne_client(req->rq_export))
                              /* Return -ENOTSUPP for old client */
                              GOTO(put_child, rc = -ENOTSUPP);
      .................
      put_child:
              mdt_object_put(info->mti_env, mc); <===== to release reference !!
      unlock_parent:
              mdt_object_unlock(info, mp, parent_lh, rc); <====== will not release reference :-(
      put_parent:
              mdt_object_put(info->mti_env, mp);
              return rc;
      }
      .................
      "lustre/mdt/mdt_reint.c" [readonly] line 1073 of 2840 --37%-- col 1
      

      I will push a patch/fix soon.

      Attachments

        Activity

          [LU-12675] LBUG "(lu_object.c:1196:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1" running sanity.sh on latest master

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43940/
          Subject: LU-12675 mdt: release object reference upon error
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set:
          Commit: caa11626afaea06e110d58c2f9c5125ad5e61025

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43940/ Subject: LU-12675 mdt: release object reference upon error Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: caa11626afaea06e110d58c2f9c5125ad5e61025

          Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43940
          Subject: LU-12675 mdt: release object reference upon error
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: c7dea2639f3568541396d9bb162f112702cbaad2

          gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43940 Subject: LU-12675 mdt: release object reference upon error Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: c7dea2639f3568541396d9bb162f112702cbaad2

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35845/
          Subject: LU-12675 mdt: release object reference upon error
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 4649899fbba095c7c3eb7ce1c8893040ed6e2494

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35845/ Subject: LU-12675 mdt: release object reference upon error Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4649899fbba095c7c3eb7ce1c8893040ed6e2494

          with this patch racer is doing much much better, but finally:

          LustreError: 28408:0:(lu_object.c:1265:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 2
          LustreError: 28408:0:(lu_object.c:1265:lu_device_fini()) LBUG
          Pid: 28408, comm: umount 4.18.0 #12 SMP Sun Aug 4 18:48:21 MSK 2019
          Call Trace:
           libcfs_call_trace+0x71/0x90 [libcfs]
           lbug_with_loc+0x3e/0x80 [libcfs]
           lu_device_fini+0x75/0xb0 [obdclass]
           osp_device_free+0x60/0x190 [osp]
           class_free_dev+0x336/0x560 [obdclass]
           class_detach+0x249/0x290 [obdclass]
           class_process_config+0x21f6/0x31a0 [obdclass]
           class_manual_cleanup+0x1aa/0x650 [obdclass]
           osp_obd_disconnect+0x180/0x1f0 [osp]
           lod_putref+0x338/0x7e0 [lod]
           lod_fini_tgt+0xa9/0x290 [lod]
           lod_device_fini+0xf2/0x1f0 [lod]
           class_cleanup+0x3d5/0xb50 [obdclass]
          
          bzzz Alex Zhuravlev added a comment - with this patch racer is doing much much better, but finally: LustreError: 28408:0:(lu_object.c:1265:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 2 LustreError: 28408:0:(lu_object.c:1265:lu_device_fini()) LBUG Pid: 28408, comm: umount 4.18.0 #12 SMP Sun Aug 4 18:48:21 MSK 2019 Call Trace: libcfs_call_trace+0x71/0x90 [libcfs] lbug_with_loc+0x3e/0x80 [libcfs] lu_device_fini+0x75/0xb0 [obdclass] osp_device_free+0x60/0x190 [osp] class_free_dev+0x336/0x560 [obdclass] class_detach+0x249/0x290 [obdclass] class_process_config+0x21f6/0x31a0 [obdclass] class_manual_cleanup+0x1aa/0x650 [obdclass] osp_obd_disconnect+0x180/0x1f0 [osp] lod_putref+0x338/0x7e0 [lod] lod_fini_tgt+0xa9/0x290 [lod] lod_device_fini+0xf2/0x1f0 [lod] class_cleanup+0x3d5/0xb50 [obdclass]

          Thx Patrick !

          bruno Bruno Faccini (Inactive) added a comment - Thx Patrick !

          You aren't, bruno (sry), so I did.

          pfarrell Patrick Farrell (Inactive) added a comment - You aren't, bruno (sry), so I did.

          Would like to add Lai as a watcher, but seems I am not allowed to do this ...

          bruno Bruno Faccini (Inactive) added a comment - Would like to add Lai as a watcher, but seems I am not allowed to do this ...

          Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/35845
          Subject: LU-12675 mdt: release object reference upon error
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: efce731e3ef9471f91c07fdbca9350eeca685923

          gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/35845 Subject: LU-12675 mdt: release object reference upon error Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: efce731e3ef9471f91c07fdbca9350eeca685923

          People

            bruno Bruno Faccini (Inactive)
            bruno Bruno Faccini (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: