Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.6.0
-
Hyperion/LLNL
-
3
-
11723
Description
Running SWL - miranda IO test.
2013-11-16 17:27:14 LustreError: 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: 2013-11-16 17:27:14 LustreError: 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG 2013-11-16 17:27:14 Pid: 105183, comm: miranda_io 2013-11-16 17:27:14 Nov 16 17:27:14 2013-11-16 17:27:14 Call Trace: 2013-11-16 17:27:14 iwc48 kernel: LustreError: 10518 [<ffffffffa056b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-11-16 17:27:14 3:0:(ldlm_lock.c:222:ldlm_lock_p [<ffffffffa056be97>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-11-16 17:27:14 ut()) ASSERTION( (((( lock))->l_flags & (1ULL << [<ffffffffa080f94d>] ldlm_lock_put+0x44d/0x560 [ptlrpc] 2013-11-16 17:27:14 50)) != 0) ) failed: 2013-11-16 17:27:14 Nov 16 17:27:14 iwc48 kernel: LustreError [<ffffffffa0822d22>] ldlm_cli_cancel_list+0xf2/0x3e0 [ptlrpc] 2013-11-16 17:27:14 : 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG 2013-11-16 17:27:14 [<ffffffffa0824952>] ldlm_cli_cancel+0x132/0x360 [ptlrpc] 2013-11-16 17:27:14 [<ffffffffa09fdbce>] osc_lock_cancel+0xfe/0x1c0 [osc] 2013-11-16 17:27:14 [<ffffffffa0708525>] cl_lock_cancel0+0x75/0x160 [obdclass] 2013-11-16 17:27:14 [<ffffffffa07090db>] cl_lock_cancel+0x13b/0x140 [obdclass] 2013-11-16 17:27:14 [<ffffffffa070d7e4>] cl_lock_enqueue_wait+0xc4/0x2d0 [obdclass] 2013-11-16 17:27:14 [<ffffffffa070aa17>] ? cl_lock_mutex_put+0x77/0x90 [obdclass] 2013-11-16 17:27:14 [<ffffffffa0a9126b>] lov_lock_enqueue+0x3fb/0x850 [lov] 2013-11-16 17:27:14 [<ffffffffa070c65c>] cl_enqueue_try+0xfc/0x300 [obdclass] 2013-11-16 17:27:14 [<ffffffffa070da5f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass] 2013-11-16 17:27:14 [<ffffffffa070e6ae>] cl_lock_request+0x7e/0x270 [obdclass] 2013-11-16 17:27:14 [<ffffffffa071391c>] cl_io_lock+0x3cc/0x560 [obdclass] 2013-11-16 17:27:14 [<ffffffffa0713b52>] cl_io_loop+0xa2/0x1b0 [obdclass] 2013-11-16 17:27:14 [<ffffffffa0b0e290>] ll_file_io_generic+0x460/0x610 [lustre] 2013-11-16 17:27:14 [<ffffffffa0701e39>] ? cl_env_get+0x29/0x350 [obdclass] 2013-11-16 17:27:14 [<ffffffffa0b0ecb2>] ll_file_aio_write+0x142/0x2c0 [lustre] 2013-11-16 17:27:14 [<ffffffffa0b0ef9c>] ll_file_write+0x16c/0x2a0 [lustre] 2013-11-16 17:27:14 [<ffffffff81181398>] vfs_write+0xb8/0x1a0 2013-11-16 17:27:14 [<ffffffff81181c91>] sys_write+0x51/0x90 2013-11-16 17:27:14 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 2013-11-16 17:27:14
Lctl dump attached
Attachments
- analysis.txt
- 334 kB
- iwc.console.gz
- 29 kB
- iwc106.sysrq.txt
- 375 kB
- iwc106.sysrq.txt
- 375 kB
- iwc106.sysrq.txt.gz
- 29 kB
- iwc106.sysrq.txt.gz
- 29 kB
- iwc48.lctl.log.txt.gz
- 0.2 kB
- log.bz2
- 525 kB
- LU-4268-140123-log.tar.gz
- 1.75 MB
- vmcore-dmesg.txt
- 78 kB
Activity
Ah, yeah, the right lock order is lock->l_lock, res->lr_lock, ns->ns_lock. Thanks for poiting that error out.
We are able to reproduce the problem steadily too. If you could push a patch, we can check whether it helps. I feel that there are more than one place where lock->l_flags is changed without the protection of lock->l_lock.
Hi Li Xi,
That looks true. Somehow I looked at the code wrong. The dlm lock was set destroyed flag before taking out of LRU, so ldlm_clear_skipped() in ldlm_lock_remove_from_lru_nolock() is indeed the root cause of this problem.
Your patch will cause deadlock because we should take res lock first and then ns_lock. My idea to fix this problem is to move ldlm_clear_skipped() out of ldlm_lock_remove_from_lru_nolock() into ldlm_lock_add_to_lru_nolock(). I think this will fix the problem as well.
Hi Jinshan,
According to your test results, this problem really looks like a race problem. Would you please try following patch to check that the lock is always held when clearing/setting the flag?
The flags should be protected by ldlm_lock::l_lock. When I saw this issue, my first reaction was to check if there are unprotected writing to l_flags and the only place is:
/** * Removes LDLM lock \a lock from LRU. Assumes LRU is already locked. */ int ldlm_lock_remove_from_lru_nolock(struct ldlm_lock *lock) { int rc = 0; if (!cfs_list_empty(&lock->l_lru)) { struct ldlm_namespace *ns = ldlm_lock_to_ns(lock); LASSERT(lock->l_resource->lr_type != LDLM_FLOCK); cfs_list_del_init(&lock->l_lru); ldlm_clear_skipped(lock); LASSERT(ns->ns_nr_unused > 0); ns->ns_nr_unused--; rc = 1; } return rc; }
However, it was protected by ns_lock and if the lock is being destroyed, it must have been taken out of LRU list.
Jinshan, according to the comment, the flags should be protected by lr_flags. Are there flag changes not protected by this lock?
Also, I have some concern that using set_bit() will not set the same bits as the Lustre network protocol on all architectures. Do you know if there is a specific bit ordering used in all cases?
I confirmed that changing the dlm flag operation to bitops fixed the problem.
Check the attachment for the concept patch. The real patch should be worked out by modifying lustre/contrib/bit-masks/lustre_dlm_flags.tpl.
diff --git a/lustre/include/lustre_dlm.h b/lustre/include/lustre_dlm.h index 98e03c9..bf4fb65 100644 --- a/lustre/include/lustre_dlm.h +++ b/lustre/include/lustre_dlm.h @@ -772,7 +772,7 @@ struct ldlm_lock { * Lock state flags. Protected by lr_lock. * \see lustre_dlm_flags.h where the bits are defined. */ - __u64 l_flags; + volatile __u64 l_flags; /** * Lock r/w usage counters. diff --git a/lustre/include/lustre_dlm_flags.h b/lustre/include/lustre_dlm_flags.h index 283546d..4be532f 100644 --- a/lustre/include/lustre_dlm_flags.h +++ b/lustre/include/lustre_dlm_flags.h @@ -57,91 +57,91 @@ /** extent, mode, or resource changed */ #define LDLM_FL_LOCK_CHANGED 0x0000000000000001ULL // bit 0 -#define ldlm_is_lock_changed(_l) LDLM_TEST_FLAG(( _l), 1ULL << 0) -#define ldlm_set_lock_changed(_l) LDLM_SET_FLAG(( _l), 1ULL << 0) -#define ldlm_clear_lock_changed(_l) LDLM_CLEAR_FLAG((_l), 1ULL << 0) +#define ldlm_is_lock_changed(_l) LDLM_TEST_FLAG(( _l), 0) +#define ldlm_set_lock_changed(_l) LDLM_SET_FLAG(( _l), 0) +#define ldlm_clear_lock_changed(_l) LDLM_CLEAR_FLAG((_l), 0) ...... ...... /** test for ldlm_lock flag bit set */ -#define LDLM_TEST_FLAG(_l, _b) (((_l)->l_flags & (_b)) != 0) +#define LDLM_TEST_FLAG(_l, _b) test_bit(_b, (volatile unsigned long *)&(_l)->l_flags) /** multi-bit test: are any of mask bits set? */ #define LDLM_HAVE_MASK(_l, _m) (((_l)->l_flags & LDLM_FL_##_m##_MASK) != 0) /** set a ldlm_lock flag bit */ -#define LDLM_SET_FLAG(_l, _b) ((_l)->l_flags |= (_b)) +#define LDLM_SET_FLAG(_l, _b) set_bit(_b, (volatile unsigned long *)&(_l)->l_flags) /** clear a ldlm_lock flag bit */ -#define LDLM_CLEAR_FLAG(_l, _b) ((_l)->l_flags &= ~(_b)) +#define LDLM_CLEAR_FLAG(_l, _b) clear_bit(_b, (volatile unsigned long *)&(_l)->l_flags) /** @} subgroup */ /** @} group */
I think the prime suspect owuld be some place that updates lock flags without taking lock spinlock. There are several such places.
I happen to be able to reproduce this issue steadily while I was working another one.
After I applied the following patch:
diff --git a/lustre/ldlm/ldlm_lock.c b/lustre/ldlm/ldlm_lock.c index 214ee8c..2e69ad0 100644 --- a/lustre/ldlm/ldlm_lock.c +++ b/lustre/ldlm/ldlm_lock.c @@ -372,12 +372,14 @@ int ldlm_lock_destroy_internal(struct ldlm_lock *lock) LBUG(); } + LDLM_DEBUG(lock, "destroy"); if (ldlm_is_destroyed(lock)) { LASSERT(cfs_list_empty(&lock->l_lru)); EXIT; return 0; } ldlm_set_destroyed(lock); + LDLM_DEBUG(lock, "destroy"); if (lock->l_export && lock->l_export->exp_lock_hash) { /* NB: it's safe to call cfs_hash_del() even lock isn't @@ -2169,6 +2171,7 @@ void ldlm_lock_cancel(struct ldlm_lock *lock) ldlm_resource_unlink_lock(lock); ldlm_lock_destroy_nolock(lock); + LDLM_DEBUG(lock, "destroyed"); if (lock->l_granted_mode == lock->l_req_mode) ldlm_pool_del(&ns->ns_pool, lock); diff --git a/lustre/ldlm/ldlm_request.c b/lustre/ldlm/ldlm_request.c index e140e0f..def3ee5 100644 --- a/lustre/ldlm/ldlm_request.c +++ b/lustre/ldlm/ldlm_request.c @@ -1141,6 +1141,7 @@ static __u64 ldlm_cli_cancel_local(struct ldlm_lock *lock) rc = LDLM_FL_LOCAL_ONLY; } ldlm_lock_cancel(lock); + LDLM_DEBUG(lock, "client-side cancel completed, rc = %lld", rc); } else { if (ns_is_client(ldlm_lock_to_ns(lock))) { LDLM_ERROR(lock, "Trying to cancel local lock"); diff --git a/lustre/llite/llite_internal.h b/lustre/llite/llite_internal.h index 92b278e..e9e7053 100644 --- a/lustre/llite/llite_internal.h +++ b/lustre/llite/llite_internal.h @@ -1469,7 +1469,7 @@ static inline void cl_isize_unlock(struct inode *inode) static inline void cl_isize_write_nolock(struct inode *inode, loff_t kms) { - LASSERT(down_trylock(&ll_i2info(inode)->lli_size_sem) != 0); + //LASSERT(down_trylock(&ll_i2info(inode)->lli_size_sem) != 0); i_size_write(inode, kms); } diff --git a/lustre/llite/llite_lib.c b/lustre/llite/llite_lib.c index df421d6..ae53f3d 100644 --- a/lustre/llite/llite_lib.c +++ b/lustre/llite/llite_lib.c @@ -1700,25 +1700,29 @@ int ll_statfs(struct dentry *de, struct kstatfs *sfs) void ll_inode_size_lock(struct inode *inode) { +#if 0 struct ll_inode_info *lli; LASSERT(!S_ISDIR(inode->i_mode)); lli = ll_i2info(inode); LASSERT(lli->lli_size_sem_owner != current); - down(&lli->lli_size_sem); + //down(&lli->lli_size_sem); LASSERT(lli->lli_size_sem_owner == NULL); lli->lli_size_sem_owner = current; +#endif } void ll_inode_size_unlock(struct inode *inode) { +#if 0 struct ll_inode_info *lli; lli = ll_i2info(inode); LASSERT(lli->lli_size_sem_owner == current); lli->lli_size_sem_owner = NULL; - up(&lli->lli_size_sem); + //up(&lli->lli_size_sem); +#endif } void ll_update_inode(struct inode *inode, struct lustre_md *md) diff --git a/lustre/obdclass/cl_lock.c b/lustre/obdclass/cl_lock.c index d440da9..b8c2e04 100644 --- a/lustre/obdclass/cl_lock.c +++ b/lustre/obdclass/cl_lock.c @@ -493,6 +493,7 @@ static struct cl_lock *cl_lock_lookup(const struct lu_env *env, const struct cl_io *io, const struct cl_lock_descr *need) { +#if 0 struct cl_lock *lock; struct cl_object_header *head; @@ -518,6 +519,7 @@ static struct cl_lock *cl_lock_lookup(const struct lu_env *env, RETURN(lock); } } +#endif RETURN(NULL); } @@ -1964,6 +1966,8 @@ static struct cl_lock *cl_lock_hold_mutex(const struct lu_env *env, cl_lock_hold_mod(env, lock, +1); lu_ref_add(&lock->cll_holders, scope, source); lu_ref_add(&lock->cll_reference, scope, source); + + cl_lock_cancel(env, lock); break; } cl_lock_mutex_put(env, lock);
And uses 32 processes to read the same file on the same time. The reading process hit the LBUG at the same place:
Call Trace: [<ffffffffa03fd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa03fde97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0774fb7>] ldlm_lock_put+0x547/0x630 [ptlrpc] [<ffffffffa0793df2>] ldlm_cli_cancel_list+0xf2/0x3e0 [ptlrpc] [<ffffffffa0795a22>] ldlm_cli_cancel+0x132/0x360 [ptlrpc] [<ffffffffa0a8e72e>] osc_lock_cancel+0x10e/0x1d0 [osc] [<ffffffffa059c045>] cl_lock_cancel0+0x75/0x160 [obdclass] [<ffffffffa059d9d6>] cl_lock_hold_release+0x1c6/0x2a0 [obdclass] [<ffffffffa059ecc7>] cl_lock_unhold+0x37/0x130 [obdclass] [<ffffffffa0af8a68>] lov_sublock_release+0x1a8/0x280 [lov] [<ffffffffa0afa4fe>] lov_lock_unuse+0x1be/0x290 [lov] [<ffffffffa059c205>] cl_unuse_try_internal+0x55/0xe0 [obdclass] [<ffffffffa059f0a9>] cl_unuse_try+0x199/0x320 [obdclass] [<ffffffffa059f267>] cl_unuse_locked+0x37/0x1a0 [obdclass] [<ffffffffa059f40e>] cl_unuse+0x3e/0x120 [obdclass] [<ffffffffa05a5e57>] cl_io_unlock+0x127/0x2b0 [obdclass] [<ffffffffa05a6aed>] cl_io_loop+0xcd/0x1b0 [obdclass] [<ffffffffa0f7c9d6>] ll_file_io_generic+0x2b6/0x710 [lustre] [<ffffffffa0596db9>] ? cl_env_get+0x29/0x350 [obdclass] [<ffffffffa0f7cf6f>] ll_file_aio_read+0x13f/0x2c0 [lustre] [<ffffffffa0f7d42c>] ll_file_read+0x16c/0x2a0 [lustre] [<ffffffff81181a95>] vfs_read+0xb5/0x1a0 [<ffffffff81181bd1>] sys_read+0x51/0x90 [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b LustreError: dumping log to /tmp/lustre-log.1391207221.6283
However, from the log, I can see this:
00000020:00010000:5.0:1391207220.451167:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) unuse lock: ffff88081acf48e8@(2 ffff8807e6fd0080 1 3 0 1 1 2)(ffff88080d4a0db8/0/1) at cl_unuse_try():1363 00000020:00010000:5.0:1391207220.451172:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) enclosure lock: ffff88081acf4738@(1 (null) 0 3 0 1 1 2)(ffff88080d4a0db8/1/1) at cl_lock_enclosure():1691 00000020:00010000:5.0:1391207220.451175:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) enclosure lock: ffff88081acf48e8@(3 ffff8807e6fd0080 1 4 0 2 1 2)(ffff88080d4a0db8/0/2) at cl_lock_enclosure():1691 00000020:00010000:5.0:1391207220.451178:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) unuse lock: ffff88081acf4738@(2 ffff8807e6fd0080 1 3 0 1 1 2)(ffff88080d4a0c68/1/0) at cl_unuse_try():1363 00010000:00010000:5.0:1391207220.451185:0:6294:0:(ldlm_lock.c:852:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 4/1,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x10020000000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.451191:0:6294:0:(ldlm_lock.c:920:ldlm_lock_decref_internal()) ### add lock into lru list ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x10020000000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00000020:00010000:5.0:1391207220.451197:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) hold release lock: ffff88081acf4738@(3 ffff8807e6fd0080 1 5 0 2 0 2)(ffff88080d4a0c68/1/0) at cl_lock_hold_release():903 00000020:00010000:5.0:1391207220.451200:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) hold release lock: ffff88081acf4738@(2 ffff8807e6fd0080 1 5 0 1 0 2)(ffff88080d4a0db8/1/2) at cl_lock_hold_release():903 00000080:00200000:5.0:1391207220.451207:0:6294:0:(vvp_io.c:1163:vvp_io_init()) [0x200000400:0x2:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0 00000080:00200000:5.0:1391207220.452460:0:6294:0:(vvp_io.c:153:vvp_io_fini()) [0x200000400:0x2:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0 00010000:00010000:5.0:1391207220.452485:0:6294:0:(ldlm_request.c:1127:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x28400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00000020:00010000:5.0:1391207220.452493:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) cancel lock: ffff88081acf4738@(3 ffff8807e6fd0080 2 5 0 0 0 1)(ffff8807e0547920/1/1) at cl_lock_cancel():1836 00000020:00010000:5.0:1391207220.452496:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) delete lock: ffff88081acf4738@(3 ffff8807e6fd0080 2 5 0 0 0 1)(ffff8807e0547920/1/1) at cl_lock_delete():1783 00010000:00010000:5.0:1391207220.452523:0:6294:0:(ldlm_lock.c:375:ldlm_lock_destroy_internal()) ### destroy ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 2/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x10869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.452529:0:6294:0:(ldlm_lock.c:382:ldlm_lock_destroy_internal()) ### destroy ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 2/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x10869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.452534:0:6294:0:(ldlm_lock.c:2174:ldlm_lock_cancel()) ### destroyed ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 1/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x10869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.452544:0:6294:0:(ldlm_request.c:1144:ldlm_cli_cancel_local()) ### client-side cancel completed, rc = 549755813888 ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 1/0,0 mode: --/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.452554:0:6294:0:(ldlm_request.c:1187:ldlm_cancel_pack()) ### packing ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 1/0,0 mode: --/PR res: [0x2:0x0:0x0].0 rrc: 14 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.452559:0:6294:0:(ldlm_request.c:1191:ldlm_cancel_pack()) 1 locks packed 00010000:00010000:5.0:1391207220.452565:0:6294:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 0/0,0 mode: --/PR res: [0x2:0x0:0x0].0 rrc: 14 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00040000:5.0:1391207220.452570:0:6294:0:(ldlm_lock.c:222:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: 00010000:00040000:5.0:1391207220.577549:0:6294:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG
Both line 375 and 382 in were printed. and the source is as follows:
361 int ldlm_lock_destroy_internal(struct ldlm_lock *lock) 362 { 363 ENTRY; 364 365 if (lock->l_readers || lock->l_writers) { 366 LDLM_ERROR(lock, "lock still has references"); 367 LBUG(); 368 } 369 370 if (!cfs_list_empty(&lock->l_res_link)) { 371 LDLM_ERROR(lock, "lock still on resource"); 372 LBUG(); 373 } 374 375 LDLM_DEBUG(lock, "destroy"); 376 if (ldlm_is_destroyed(lock)) { 377 LASSERT(cfs_list_empty(&lock->l_lru)); 378 EXIT; 379 return 0; 380 } 381 ldlm_set_destroyed(lock); 382 LDLM_DEBUG(lock, "destroy");
However, the destroyed flag was failed to set. This is beyond of my understanding, and we should escalate this issue.
Hi Jinshan,
When a top lock enqueue failed, top lock will go to cl_lock_hold_release() and then unhold a lovsub lock, which will go to osc_lock_cancel(), if at this time the osc lock is used by another IO, then the ldlm lock is not canceled.
do_cancel = (dlmlock->l_readers == 0 &&
dlmlock->l_writers == 0);
ldlm_set_cbpending(dlmlock);
unlock_res_and_lock(dlmlock);
if (do_cancel)
result = ldlm_cli_cancel(&olck->ols_handle, LCF_ASYNC);
}
and the osc lock is detached from the lobsub lock in cl_lock_hold_release()
if (lock->cll_holds == 0) { CL_LOCK_ASSERT(lock->cll_state != CLS_HELD, env, lock); if (lock->cll_descr.cld_mode == CLM_PHANTOM || lock->cll_descr.cld_mode == CLM_GROUP || lock->cll_state != CLS_CACHED) /* * If lock is still phantom or grouplock when user is * done with it---destroy the lock. */ lock->cll_flags |= CLF_CANCELPEND|CLF_DOOMED; if (lock->cll_flags & CLF_CANCELPEND) { lock->cll_flags &= ~CLF_CANCELPEND; cl_lock_cancel0(env, lock); } if (lock->cll_flags & CLF_DOOMED) { /* no longer doomed: it's dead... Jim. */ lock->cll_flags &= ~CLF_DOOMED; cl_lock_delete0(env, lock); } }
Is it possible that the ldlm lock somehow missed the cancel process somewhere so that its last dereference is still missing the destroy flag?
Li, Jinshan - Are either of you able to share how you're reproducing this?