Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7184

(lod_dev.c:1493:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff88010cf8a000

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Setting the security flavor, similar to below causes an LBUG when the MDT is mounted again:

      /usr/lib64/lustre/tests/llmount.sh
      lctl conf_param lustre.srpc.flavor.default=skpi
      umount -a -f -t lustre
      mount -o loop -t lustre /tmp/lustre-mdt1 /mnt/mds1

      <4>Lustre: server umount lustre-MDT0000 complete
      <6>LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts:
      <4>Lustre: 4563:0:(llog_cat.c:620:llog_cat_process_or_fork()) catlog 0x2:1 crosses index zero
      <3>LustreError: 4560:0:(gss_keyring.c:805:gss_sec_lookup_ctx_kr()) failed request key: -126
      <3>LustreError: 4560:0:(gss_keyring.c:805:gss_sec_lookup_ctx_kr()) Skipped 1 previous similar message
      <3>LustreError: 4560:0:(sec.c:444:sptlrpc_req_get_ctx()) req ffff88011a183cc0: fail to get context
      <3>LustreError: 4560:0:(sec.c:444:sptlrpc_req_get_ctx()) Skipped 1 previous similar message
      <3>LustreError: 4560:0:(osp_dev.c:1437:osp_obd_connect()) lustre-OST0000-osc-MDT0000: can't connect obd: rc = -111
      <3>LustreError: 4560:0:(lod_lov.c:293:lod_add_device()) lustre-OST0000-osc-MDT0000: cannot connect to next dev lustre-OST0000_UUID (-111)
      <3>LustreError: 4560:0:(obd_config.c:1624:class_config_llog_handler()) MGC192.168.1.107@tcp: cfg command failed: rc = -111
      <4>Lustre: cmd=cf00d 0:lustre-MDT0000-mdtlov 1:lustre-OST0000_UUID 2:0 3:1
      <4>
      <3>LustreError: 15c-8: MGC192.168.1.107@tcp: The configuration from log 'lustre-MDT0000' failed (-111). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      <3>LustreError: 4513:0:(obd_mount_server.c:1306:server_start_targets()) failed to start server lustre-MDT0000: -111
      <3>LustreError: 4513:0:(obd_mount_server.c:1790:server_fill_super()) Unable to start targets: -111
      <4>Lustre: Failing over lustre-MDT0000
      <0>LustreError: 2659:0:(lod_dev.c:1493:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff88010cf8a000
      <0>LustreError: 2659:0:(lod_dev.c:1493:lod_device_free()) LBUG
      <4>Pid: 2659, comm: obd_zombid
      <4>
      <4>Call Trace:
      <4> [<ffffffffa02d6875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa02d6e77>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa0eda121>] lod_device_free+0x2c1/0x330 [lod]
      <4> [<ffffffffa03ef9bd>] class_decref+0x3ed/0x4d0 [obdclass]
      <4> [<ffffffffa03d9afc>] obd_zombie_impexp_cull+0x61c/0xac0 [obdclass]
      <4> [<ffffffffa03da005>] obd_zombie_impexp_thread+0x65/0x190 [obdclass]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa03d9fa0>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 2659, comm: obd_zombid Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff8152896c>] ? panic+0xa7/0x16f
      <4> [<ffffffffa02d6ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa0eda121>] ? lod_device_free+0x2c1/0x330 [lod]
      <4> [<ffffffffa03ef9bd>] ? class_decref+0x3ed/0x4d0 [obdclass]
      <4> [<ffffffffa03d9afc>] ? obd_zombie_impexp_cull+0x61c/0xac0 [obdclass]
      <4> [<ffffffffa03da005>] ? obd_zombie_impexp_thread+0x65/0x190 [obdclass]
      <4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
      <4> [<ffffffffa03d9fa0>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

      Attachments

        Issue Links

          Activity

            [LU-7184] (lod_dev.c:1493:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff88010cf8a000

            Landed for 2.8

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16635/
            Subject: LU-7184 lod: cleanup unused OSP devices on error
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d2d725d2e2d31899f0453c967f5707a72e796fa0

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16635/ Subject: LU-7184 lod: cleanup unused OSP devices on error Project: fs/lustre-release Branch: master Current Patch Set: Commit: d2d725d2e2d31899f0453c967f5707a72e796fa0

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/16635
            Subject: LU-7184 lod: cleanup unused OSP devices on error
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 415bf6d50231dc8e804b6ddfa6b0aa2c2a5c92b1

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/16635 Subject: LU-7184 lod: cleanup unused OSP devices on error Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 415bf6d50231dc8e804b6ddfa6b0aa2c2a5c92b1

            It looks like osp_shutdown() is not being called in this case, since the OSP is not being added successfully. So it seems not just osp_last_used_fini(), neither osp_sync_fini() nor osp_precreate_fini() are executed either. So How about call ldo_process_config(env, next, CLEANUP); in lod_add_device() error handler path?

            di.wang Di Wang (Inactive) added a comment - It looks like osp_shutdown() is not being called in this case, since the OSP is not being added successfully. So it seems not just osp_last_used_fini(), neither osp_sync_fini() nor osp_precreate_fini() are executed either. So How about call ldo_process_config(env, next, CLEANUP); in lod_add_device() error handler path?
            jhammond John Hammond added a comment -

            Di, during MDT mount if osp_init() succeeds but lod_add_device() fails before adding the OSP device to the LOD then we hit this since the OSP device still holds references to two objects from the MDT site (opd_last_used_oid_file and opd_last_used_seq_file). Can the finding/creation of these two objects be moved out of osp_init0() and into some function later in the setup path?

            jhammond John Hammond added a comment - Di, during MDT mount if osp_init() succeeds but lod_add_device() fails before adding the OSP device to the LOD then we hit this since the OSP device still holds references to two objects from the MDT site (opd_last_used_oid_file and opd_last_used_seq_file). Can the finding/creation of these two objects be moved out of osp_init0() and into some function later in the setup path?
            jhammond John Hammond added a comment -

            In progress. The references are from opd_last_used_oid_file and opd_last_used_seq_file.

            jhammond John Hammond added a comment - In progress. The references are from opd_last_used_oid_file and opd_last_used_seq_file.

            Hi John,
            Can you take a look at the one?
            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi John, Can you take a look at the one? Thanks. Joe
            green Oleg Drokin added a comment -

            So it looks like error path for either key or in osp_obd_connect or somewhere along the call chain forgets to release a reference to lu device.

            Somebody need to go through there and fidn the place and add the decref that is missing.

            green Oleg Drokin added a comment - So it looks like error path for either key or in osp_obd_connect or somewhere along the call chain forgets to release a reference to lu device. Somebody need to go through there and fidn the place and add the decref that is missing.
            jfilizetti Jeremy Filizetti added a comment - - edited

            Forgot to mention counter for ld_ref is 2:

            crash> struct lu_device 0xffff88010cf8a000
            struct lu_device {
              ld_ref = {
                counter = 2
              }, 
              ld_type = 0xffffffffa0f12440, 
              ld_ops = 0xffffffffa0f08e40, 
              ld_site = 0xffff880118546098, 
              ld_proc_entry = 0x0, 
              ld_obd = 0xffff88011a186038, 
              ld_reference = {<No data fields>}, 
              ld_linkage = {
                next = 0xffff88010cf8a030, 
                prev = 0xffff88010cf8a030
              }
            }
            
            jfilizetti Jeremy Filizetti added a comment - - edited Forgot to mention counter for ld_ref is 2: crash> struct lu_device 0xffff88010cf8a000 struct lu_device { ld_ref = { counter = 2 }, ld_type = 0xffffffffa0f12440, ld_ops = 0xffffffffa0f08e40, ld_site = 0xffff880118546098, ld_proc_entry = 0x0, ld_obd = 0xffff88011a186038, ld_reference = {<No data fields>}, ld_linkage = { next = 0xffff88010cf8a030, prev = 0xffff88010cf8a030 } }

            People

              jhammond John Hammond
              jfilizetti Jeremy Filizetti
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: