[LU-7184] (lod_dev.c:1493:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff88010cf8a000 Created: 18/Sep/15  Updated: 01/Jun/16  Resolved: 14/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Jeremy Filizetti Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: SSK, kerberos

Issue Links:
Related
is related to LU-3289 IU Shared Secret Key authentication a... Resolved
is related to LU-7546 conf-sanity conf-sanity: lod_device_f... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Setting the security flavor, similar to below causes an LBUG when the MDT is mounted again:

/usr/lib64/lustre/tests/llmount.sh
lctl conf_param lustre.srpc.flavor.default=skpi
umount -a -f -t lustre
mount -o loop -t lustre /tmp/lustre-mdt1 /mnt/mds1

<4>Lustre: server umount lustre-MDT0000 complete
<6>LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts:
<4>Lustre: 4563:0:(llog_cat.c:620:llog_cat_process_or_fork()) catlog 0x2:1 crosses index zero
<3>LustreError: 4560:0:(gss_keyring.c:805:gss_sec_lookup_ctx_kr()) failed request key: -126
<3>LustreError: 4560:0:(gss_keyring.c:805:gss_sec_lookup_ctx_kr()) Skipped 1 previous similar message
<3>LustreError: 4560:0:(sec.c:444:sptlrpc_req_get_ctx()) req ffff88011a183cc0: fail to get context
<3>LustreError: 4560:0:(sec.c:444:sptlrpc_req_get_ctx()) Skipped 1 previous similar message
<3>LustreError: 4560:0:(osp_dev.c:1437:osp_obd_connect()) lustre-OST0000-osc-MDT0000: can't connect obd: rc = -111
<3>LustreError: 4560:0:(lod_lov.c:293:lod_add_device()) lustre-OST0000-osc-MDT0000: cannot connect to next dev lustre-OST0000_UUID (-111)
<3>LustreError: 4560:0:(obd_config.c:1624:class_config_llog_handler()) MGC192.168.1.107@tcp: cfg command failed: rc = -111
<4>Lustre: cmd=cf00d 0:lustre-MDT0000-mdtlov 1:lustre-OST0000_UUID 2:0 3:1
<4>
<3>LustreError: 15c-8: MGC192.168.1.107@tcp: The configuration from log 'lustre-MDT0000' failed (-111). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
<3>LustreError: 4513:0:(obd_mount_server.c:1306:server_start_targets()) failed to start server lustre-MDT0000: -111
<3>LustreError: 4513:0:(obd_mount_server.c:1790:server_fill_super()) Unable to start targets: -111
<4>Lustre: Failing over lustre-MDT0000
<0>LustreError: 2659:0:(lod_dev.c:1493:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff88010cf8a000
<0>LustreError: 2659:0:(lod_dev.c:1493:lod_device_free()) LBUG
<4>Pid: 2659, comm: obd_zombid
<4>
<4>Call Trace:
<4> [<ffffffffa02d6875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa02d6e77>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0eda121>] lod_device_free+0x2c1/0x330 [lod]
<4> [<ffffffffa03ef9bd>] class_decref+0x3ed/0x4d0 [obdclass]
<4> [<ffffffffa03d9afc>] obd_zombie_impexp_cull+0x61c/0xac0 [obdclass]
<4> [<ffffffffa03da005>] obd_zombie_impexp_thread+0x65/0x190 [obdclass]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa03d9fa0>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 2659, comm: obd_zombid Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff8152896c>] ? panic+0xa7/0x16f
<4> [<ffffffffa02d6ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa0eda121>] ? lod_device_free+0x2c1/0x330 [lod]
<4> [<ffffffffa03ef9bd>] ? class_decref+0x3ed/0x4d0 [obdclass]
<4> [<ffffffffa03d9afc>] ? obd_zombie_impexp_cull+0x61c/0xac0 [obdclass]
<4> [<ffffffffa03da005>] ? obd_zombie_impexp_thread+0x65/0x190 [obdclass]
<4> [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
<4> [<ffffffffa03d9fa0>] ? obd_zombie_impexp_thread+0x0/0x190 [obdclass]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20



 Comments   
Comment by Jeremy Filizetti [ 18/Sep/15 ]

Forgot to mention counter for ld_ref is 2:

crash> struct lu_device 0xffff88010cf8a000
struct lu_device {
  ld_ref = {
    counter = 2
  }, 
  ld_type = 0xffffffffa0f12440, 
  ld_ops = 0xffffffffa0f08e40, 
  ld_site = 0xffff880118546098, 
  ld_proc_entry = 0x0, 
  ld_obd = 0xffff88011a186038, 
  ld_reference = {<No data fields>}, 
  ld_linkage = {
    next = 0xffff88010cf8a030, 
    prev = 0xffff88010cf8a030
  }
}
Comment by Oleg Drokin [ 18/Sep/15 ]

So it looks like error path for either key or in osp_obd_connect or somewhere along the call chain forgets to release a reference to lu device.

Somebody need to go through there and fidn the place and add the decref that is missing.

Comment by Joseph Gmitter (Inactive) [ 18/Sep/15 ]

Hi John,
Can you take a look at the one?
Thanks.
Joe

Comment by John Hammond [ 21/Sep/15 ]

In progress. The references are from opd_last_used_oid_file and opd_last_used_seq_file.

Comment by John Hammond [ 22/Sep/15 ]

Di, during MDT mount if osp_init() succeeds but lod_add_device() fails before adding the OSP device to the LOD then we hit this since the OSP device still holds references to two objects from the MDT site (opd_last_used_oid_file and opd_last_used_seq_file). Can the finding/creation of these two objects be moved out of osp_init0() and into some function later in the setup path?

Comment by Di Wang [ 22/Sep/15 ]

It looks like osp_shutdown() is not being called in this case, since the OSP is not being added successfully. So it seems not just osp_last_used_fini(), neither osp_sync_fini() nor osp_precreate_fini() are executed either. So How about call ldo_process_config(env, next, CLEANUP); in lod_add_device() error handler path?

Comment by Gerrit Updater [ 24/Sep/15 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/16635
Subject: LU-7184 lod: cleanup unused OSP devices on error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 415bf6d50231dc8e804b6ddfa6b0aa2c2a5c92b1

Comment by Gerrit Updater [ 14/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16635/
Subject: LU-7184 lod: cleanup unused OSP devices on error
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d2d725d2e2d31899f0453c967f5707a72e796fa0

Comment by Joseph Gmitter (Inactive) [ 14/Oct/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:06:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.