[LU-6744] conf-sanity test_51:(lov_object.c:945:lov_object_delete()) LBUG Created: 18/Jun/15  Updated: 02/Jul/15  Resolved: 02/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocker
is blocking LU-6501 nidstrings always true comparisons Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for wangdi <di.wang@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/94928f0e-1585-11e5-bb64-5254006e85c2.

The sub-test test_51 failed with the following error:

test failed to respond and timed out
03:59:13:LustreError: 6986:0:(lov_ea.c:186:lsm_unpackmd_common()) OST index 1 more than OST count 1
03:59:13:Lustre: 6986:0:(lov_pack.c:62:lov_dump_lmm_common()) objid 0x4:1028, magic 0x0bd10bd0, pattern 0x1
03:59:13:Lustre: 6986:0:(lov_pack.c:66:lov_dump_lmm_common()) stripe_size 1048576, stripe_count 2, layout_gen 0
03:59:13:Lustre: 6986:0:(lov_pack.c:85:lov_dump_lmm_objects()) stripe 0 idx 0 subobj 0x0:3
03:59:13:Lustre: 6986:0:(lov_pack.c:85:lov_dump_lmm_objects()) stripe 1 idx 1 subobj 0x0:2
03:59:13:LustreError: 6986:0:(lov_object.c:945:lov_object_delete()) ASSERTION( 0 <= __llt && __llt < (sizeof(lov_dispatch) / sizeof((lov_dispatch)[0]) + (sizeof(struct { int:-!!(__builtin_types_compatible_p(typeof(lov_dispatch), typeof(&lov_dispatch[0]))); }))) ) failed: 
03:59:13:LustreError: 6986:0:(lov_object.c:945:lov_object_delete()) LBUG
03:59:13:Pid: 6986, comm: touch
03:59:13:
03:59:13:Call Trace:
03:59:13: [<ffffffffa0470875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
03:59:13: [<ffffffffa0470e77>] lbug_with_loc+0x47/0xb0 [libcfs]
03:59:13: [<ffffffffa09bbbb3>] lov_object_delete+0x183/0x190 [lov]
03:59:13: [<ffffffffa05b70f1>] lu_object_free+0x81/0x1a0 [obdclass]
03:59:13: [<ffffffffa05b52cc>] ? lu_object_add+0x2c/0x30 [obdclass]
03:59:13: [<ffffffffa05b73b6>] lu_object_alloc+0x1a6/0x320 [obdclass]
03:59:13: [<ffffffffa05b876d>] lu_object_find_try+0xcd/0x260 [obdclass]
03:59:13: [<ffffffffa04817c2>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
03:59:13: [<ffffffffa05b89b1>] lu_object_find_at+0xb1/0xe0 [obdclass]
03:59:13: [<ffffffffa0a64bdb>] ? cl_file_inode_init+0x7b/0x340 [lustre]
03:59:13: [<ffffffffa047cc01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:59:13: [<ffffffffa05b8a1f>] lu_object_find_slice+0x1f/0x80 [obdclass]
03:59:13: [<ffffffffa05beff2>] cl_object_find+0x42/0xb0 [obdclass]
03:59:13: [<ffffffffa0a64d8e>] cl_file_inode_init+0x22e/0x340 [lustre]
03:59:13: [<ffffffffa0a2ffed>] ll_update_inode+0x45d/0x14d0 [lustre]
03:59:13: [<ffffffffa0a49370>] ? ll_test_inode+0x0/0x80 [lustre]
03:59:13: [<ffffffff81174cda>] ? kmem_cache_alloc+0x18a/0x190
03:59:13: [<ffffffffa047cc01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:59:13: [<ffffffffa0a310cd>] ll_read_inode2+0x6d/0x410 [lustre]
03:59:13: [<ffffffffa0a4981b>] ll_iget+0x12b/0x2e0 [lustre]
03:59:13: [<ffffffffa04791f8>] ? libcfs_log_return+0x28/0x40 [libcfs]
03:59:13: [<ffffffffa0a33c5d>] ll_prep_inode+0x5bd/0xc50 [lustre]
03:59:13: [<ffffffffa0a4fdb0>] ll_create_nd+0x3b0/0xf10 [lustre]
03:59:13: [<ffffffffa0a1d277>] ? ll_inode_permission+0x167/0x3f0 [lustre]
03:59:13: [<ffffffff8119d8c6>] vfs_create+0xe6/0x110
03:59:13: [<ffffffff811a174e>] do_filp_open+0xa8e/0xd20
03:59:13: [<ffffffff81153700>] ? __vma_link_rb+0x30/0x40
03:59:13: [<ffffffff811537ab>] ? vma_link+0x9b/0xf0
03:59:13: [<ffffffff8129935a>] ? strncpy_from_user+0x4a/0x90
03:59:13: [<ffffffff811ae272>] ? alloc_fd+0x92/0x160
03:59:13: [<ffffffff8118b0b7>] do_sys_open+0x67/0x130
03:59:13: [<ffffffff8118b1c0>] sys_open+0x20/0x30
03:59:13: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
03:59:13:
03:59:13:Kernel panic - not syncing: LBUG
03:59:13:Pid: 6986, comm: touch Tainted: G        W  ---------------    2.6.32-504.16.2.el6.x86_64 #1
03:59:13:Call Trace:
03:59:13: [<ffffffff8152971c>] ? panic+0xa7/0x16f
03:59:13: [<ffffffffa0470ecb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
03:59:13: [<ffffffffa09bbbb3>] ? lov_object_delete+0x183/0x190 [lov]
03:59:13: [<ffffffffa05b70f1>] ? lu_object_free+0x81/0x1a0 [obdclass]
03:59:13: [<ffffffffa05b52cc>] ? lu_object_add+0x2c/0x30 [obdclass]
03:59:13: [<ffffffffa05b73b6>] ? lu_object_alloc+0x1a6/0x320 [obdclass]
03:59:13: [<ffffffffa05b876d>] ? lu_object_find_try+0xcd/0x260 [obdclass]
03:59:13: [<ffffffffa04817c2>] ? cfs_hash_bd_add_locked+0x62/0x90 [libcfs]
03:59:13: [<ffffffffa05b89b1>] ? lu_object_find_at+0xb1/0xe0 [obdclass]
03:59:13: [<ffffffffa0a64bdb>] ? cl_file_inode_init+0x7b/0x340 [lustre]
03:59:13: [<ffffffffa047cc01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:59:13: [<ffffffffa05b8a1f>] ? lu_object_find_slice+0x1f/0x80 [obdclass]
03:59:13: [<ffffffffa05beff2>] ? cl_object_find+0x42/0xb0 [obdclass]
03:59:13: [<ffffffffa0a64d8e>] ? cl_file_inode_init+0x22e/0x340 [lustre]
03:59:13: [<ffffffffa0a2ffed>] ? ll_update_inode+0x45d/0x14d0 [lustre]
03:59:13: [<ffffffffa0a49370>] ? ll_test_inode+0x0/0x80 [lustre]
03:59:13: [<ffffffff81174cda>] ? kmem_cache_alloc+0x18a/0x190
03:59:13: [<ffffffffa047cc01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:59:13: [<ffffffffa0a310cd>] ? ll_read_inode2+0x6d/0x410 [lustre]
03:59:13: [<ffffffffa0a4981b>] ? ll_iget+0x12b/0x2e0 [lustre]
03:59:13: [<ffffffffa04791f8>] ? libcfs_log_return+0x28/0x40 [libcfs]
03:59:13: [<ffffffffa0a33c5d>] ? ll_prep_inode+0x5bd/0xc50 [lustre]
03:59:13: [<ffffffffa0a4fdb0>] ? ll_create_nd+0x3b0/0xf10 [lustre]
03:59:13: [<ffffffffa0a1d277>] ? ll_inode_permission+0x167/0x3f0 [lustre]
03:59:13: [<ffffffff8119d8c6>] ? vfs_create+0xe6/0x110
03:59:13: [<ffffffff811a174e>] ? do_filp_open+0xa8e/0xd20
03:59:13: [<ffffffff81153700>] ? __vma_link_rb+0x30/0x40
03:59:13: [<ffffffff811537ab>] ? vma_link+0x9b/0xf0
03:59:13: [<ffffffff8129935a>] ? strncpy_from_user+0x4a/0x90
03:59:13: [<ffffffff811ae272>] ? alloc_fd+0x92/0x160
03:59:13: [<ffffffff8118b0b7>] ? do_sys_open+0x67/0x130
03:59:13: [<ffffffff8118b1c0>] ? sys_open+0x20/0x30
03:59:13: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
03:59:13:Initializing cgroup subsys cpuset
03:59:13:Initializing cgroup subsys cpu


 Comments   
Comment by Andreas Dilger [ 18/Jun/15 ]

There looks to be at least three bugs here:

  • the lov_unpackmd() shouldn't fail if OSTs are still being connected, which appears to be the case here, since test_51() is reformatting the filesystem and creating a new stripe_count=-1 (==2) file OK on the MDS but the client doesn't know about it. Either that means that mount should at least wait for config processing to finish (even if not all OSTs are connected
  • lov_unpackmd_common() should print the OBD device name at the start, per standard CERROR() formatting. It probably also shouldn't be an -EINVAL error if loi_ost_idx is NULL, since that may mean an OST was removed. At worst the client can get an IO error during read/write when accessing that stripe, but not when unpacking it. The stripe should just be skipped (continue) instead of returning an error.
  • the client shouldn't LASSERT() on data returned from the network. This looks to be a problem in the CLIO code in lu_object_alloc():
    static struct lu_object *lu_object_alloc(const struct lu_env *env,
                                             struct lu_device *dev,
                                             const struct lu_fid *f,
                                             const struct lu_object_conf *conf)
    {
            do {
                    /*
                     * Call ->loo_object_init() repeatedly, until no more new
                     * object slices are created.
                     */
                    clean = 1; 
                    init_flag = 1;
                    list_for_each_entry(scan, layers, lo_linkage) {
                            :
                            result = scan->lo_ops->loo_object_init(env, scan, conf);
                            if (result != 0) {
                                    lu_object_free(env, top);
                                    RETURN(ERR_PTR(result));
                            }
                            init_mask |= init_flag;
    next:
                            init_flag <<= 1;
                    }
            } while (!clean);
    

so if there is an error returned from >lov_object_init>lov_unpackmd():

int lov_object_init(const struct lu_env *env, struct lu_object *obj,
                    const struct lu_object_conf *conf)
{
        :
        if (cconf->u.coc_layout.lb_buf != NULL) {
                lsm = lov_unpackmd(dev->ld_lov,
                                   cconf->u.coc_layout.lb_buf,
                                   cconf->u.coc_layout.lb_len);
                if (IS_ERR(lsm))
                        RETURN(PTR_ERR(lsm));
        }

        /* no locking is necessary, as object is being created */
        lov->lo_type = lov_type(lsm);
        ops = &lov_dispatch[lov->lo_type];
        rc = ops->llo_init(env, dev, lov, lsm, cconf, set);

it doesn't set lo_type nor before the error is returned, or somehow it leaves the partially initialized slice in place and that causes the LASSERT() to trigger? It doesn't really make sense that lov_object_delete() is being called on an object that isn't properly initialized.

Comment by Jinshan Xiong (Inactive) [ 18/Jun/15 ]

We can probably assign lo_type to LLT_EMPTY before unpacking md.

Comment by nasf (Inactive) [ 19/Jun/15 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/9607faec-169b-11e5-8436-5254006e85c2

Comment by Gerrit Updater [ 20/Jun/15 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15362
Subject: LU-6744 lov: init LOV stripe type beforehand
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6df09cdc0c0a60343b122dec47e1be5228d429e3

Comment by Bob Glossman (Inactive) [ 20/Jun/15 ]

another on master
https://testing.hpdd.intel.com/test_sets/05d02c8a-16df-11e5-8436-5254006e85c2

Comment by Gregoire Pichon [ 30/Jun/15 ]

new instance on master
https://testing.hpdd.intel.com/test_sessions/75949d9e-1ed2-11e5-8a29-5254006e85c2

Comment by Frederic Saunier [ 30/Jun/15 ]

yet another one on master
https://testing.hpdd.intel.com/test_sets/111e205e-1e9d-11e5-ae50-5254006e85c2

Comment by James Nunez (Inactive) [ 01/Jul/15 ]

This bug is causing review-dne-part-1 to fail one to three times nearly everyday since June 17, 2015. Can we increase the priority of this ticket?

Failures from the past two days all on master and all in review-dne-part-1:
2015-06-29 15:53:48 - https://testing.hpdd.intel.com/test_sets/111e205e-1e9d-11e5-ae50-5254006e85c2
2015-06-29 19:39:12 - https://testing.hpdd.intel.com/test_sets/f29f685a-1ed2-11e5-8a29-5254006e85c2
2015-06-29 21:57:19 - https://testing.hpdd.intel.com/test_sets/5bf3f57a-1ee5-11e5-aba3-5254006e85c2
2015-06-30 04:16:16 - https://testing.hpdd.intel.com/test_sets/e16e9054-1f06-11e5-a4d6-5254006e85c2
2015-06-30 06:49:47 - https://testing.hpdd.intel.com/test_sets/40b8a6f2-1f1c-11e5-a4d6-5254006e85c2
2015-06-30 12:03:14 - https://testing.hpdd.intel.com/test_sets/069bc0dc-1f47-11e5-a4d6-5254006e85c2

Comment by Gerrit Updater [ 02/Jul/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15362/
Subject: LU-6744 lov: init LOV stripe type beforehand
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7f75a4d5b955072c426f1723ff86bb7d67f55f19

Comment by Peter Jones [ 02/Jul/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:02:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.