Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.6.0
-
Hyperion/LLNL
-
3
-
11723
Description
Running SWL - miranda IO test.
2013-11-16 17:27:14 LustreError: 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: 2013-11-16 17:27:14 LustreError: 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG 2013-11-16 17:27:14 Pid: 105183, comm: miranda_io 2013-11-16 17:27:14 Nov 16 17:27:14 2013-11-16 17:27:14 Call Trace: 2013-11-16 17:27:14 iwc48 kernel: LustreError: 10518 [<ffffffffa056b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-11-16 17:27:14 3:0:(ldlm_lock.c:222:ldlm_lock_p [<ffffffffa056be97>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-11-16 17:27:14 ut()) ASSERTION( (((( lock))->l_flags & (1ULL << [<ffffffffa080f94d>] ldlm_lock_put+0x44d/0x560 [ptlrpc] 2013-11-16 17:27:14 50)) != 0) ) failed: 2013-11-16 17:27:14 Nov 16 17:27:14 iwc48 kernel: LustreError [<ffffffffa0822d22>] ldlm_cli_cancel_list+0xf2/0x3e0 [ptlrpc] 2013-11-16 17:27:14 : 105183:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG 2013-11-16 17:27:14 [<ffffffffa0824952>] ldlm_cli_cancel+0x132/0x360 [ptlrpc] 2013-11-16 17:27:14 [<ffffffffa09fdbce>] osc_lock_cancel+0xfe/0x1c0 [osc] 2013-11-16 17:27:14 [<ffffffffa0708525>] cl_lock_cancel0+0x75/0x160 [obdclass] 2013-11-16 17:27:14 [<ffffffffa07090db>] cl_lock_cancel+0x13b/0x140 [obdclass] 2013-11-16 17:27:14 [<ffffffffa070d7e4>] cl_lock_enqueue_wait+0xc4/0x2d0 [obdclass] 2013-11-16 17:27:14 [<ffffffffa070aa17>] ? cl_lock_mutex_put+0x77/0x90 [obdclass] 2013-11-16 17:27:14 [<ffffffffa0a9126b>] lov_lock_enqueue+0x3fb/0x850 [lov] 2013-11-16 17:27:14 [<ffffffffa070c65c>] cl_enqueue_try+0xfc/0x300 [obdclass] 2013-11-16 17:27:14 [<ffffffffa070da5f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass] 2013-11-16 17:27:14 [<ffffffffa070e6ae>] cl_lock_request+0x7e/0x270 [obdclass] 2013-11-16 17:27:14 [<ffffffffa071391c>] cl_io_lock+0x3cc/0x560 [obdclass] 2013-11-16 17:27:14 [<ffffffffa0713b52>] cl_io_loop+0xa2/0x1b0 [obdclass] 2013-11-16 17:27:14 [<ffffffffa0b0e290>] ll_file_io_generic+0x460/0x610 [lustre] 2013-11-16 17:27:14 [<ffffffffa0701e39>] ? cl_env_get+0x29/0x350 [obdclass] 2013-11-16 17:27:14 [<ffffffffa0b0ecb2>] ll_file_aio_write+0x142/0x2c0 [lustre] 2013-11-16 17:27:14 [<ffffffffa0b0ef9c>] ll_file_write+0x16c/0x2a0 [lustre] 2013-11-16 17:27:14 [<ffffffff81181398>] vfs_write+0xb8/0x1a0 2013-11-16 17:27:14 [<ffffffff81181c91>] sys_write+0x51/0x90 2013-11-16 17:27:14 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 2013-11-16 17:27:14
Lctl dump attached
Attachments
- analysis.txt
- 334 kB
- iwc.console.gz
- 29 kB
- iwc106.sysrq.txt
- 375 kB
- iwc106.sysrq.txt
- 375 kB
- iwc106.sysrq.txt.gz
- 29 kB
- iwc106.sysrq.txt.gz
- 29 kB
- iwc48.lctl.log.txt.gz
- 0.2 kB
- log.bz2
- 525 kB
- LU-4268-140123-log.tar.gz
- 1.75 MB
- vmcore-dmesg.txt
- 78 kB
Activity
Li, Jinshan - Are either of you able to share how you're reproducing this?
Ah, yeah, the right lock order is lock->l_lock, res->lr_lock, ns->ns_lock. Thanks for poiting that error out.
We are able to reproduce the problem steadily too. If you could push a patch, we can check whether it helps. I feel that there are more than one place where lock->l_flags is changed without the protection of lock->l_lock.
Hi Li Xi,
That looks true. Somehow I looked at the code wrong. The dlm lock was set destroyed flag before taking out of LRU, so ldlm_clear_skipped() in ldlm_lock_remove_from_lru_nolock() is indeed the root cause of this problem.
Your patch will cause deadlock because we should take res lock first and then ns_lock. My idea to fix this problem is to move ldlm_clear_skipped() out of ldlm_lock_remove_from_lru_nolock() into ldlm_lock_add_to_lru_nolock(). I think this will fix the problem as well.
Hi Jinshan,
According to your test results, this problem really looks like a race problem. Would you please try following patch to check that the lock is always held when clearing/setting the flag?
The flags should be protected by ldlm_lock::l_lock. When I saw this issue, my first reaction was to check if there are unprotected writing to l_flags and the only place is:
/** * Removes LDLM lock \a lock from LRU. Assumes LRU is already locked. */ int ldlm_lock_remove_from_lru_nolock(struct ldlm_lock *lock) { int rc = 0; if (!cfs_list_empty(&lock->l_lru)) { struct ldlm_namespace *ns = ldlm_lock_to_ns(lock); LASSERT(lock->l_resource->lr_type != LDLM_FLOCK); cfs_list_del_init(&lock->l_lru); ldlm_clear_skipped(lock); LASSERT(ns->ns_nr_unused > 0); ns->ns_nr_unused--; rc = 1; } return rc; }
However, it was protected by ns_lock and if the lock is being destroyed, it must have been taken out of LRU list.
Jinshan, according to the comment, the flags should be protected by lr_flags. Are there flag changes not protected by this lock?
Also, I have some concern that using set_bit() will not set the same bits as the Lustre network protocol on all architectures. Do you know if there is a specific bit ordering used in all cases?
I confirmed that changing the dlm flag operation to bitops fixed the problem.
Check the attachment for the concept patch. The real patch should be worked out by modifying lustre/contrib/bit-masks/lustre_dlm_flags.tpl.
diff --git a/lustre/include/lustre_dlm.h b/lustre/include/lustre_dlm.h index 98e03c9..bf4fb65 100644 --- a/lustre/include/lustre_dlm.h +++ b/lustre/include/lustre_dlm.h @@ -772,7 +772,7 @@ struct ldlm_lock { * Lock state flags. Protected by lr_lock. * \see lustre_dlm_flags.h where the bits are defined. */ - __u64 l_flags; + volatile __u64 l_flags; /** * Lock r/w usage counters. diff --git a/lustre/include/lustre_dlm_flags.h b/lustre/include/lustre_dlm_flags.h index 283546d..4be532f 100644 --- a/lustre/include/lustre_dlm_flags.h +++ b/lustre/include/lustre_dlm_flags.h @@ -57,91 +57,91 @@ /** extent, mode, or resource changed */ #define LDLM_FL_LOCK_CHANGED 0x0000000000000001ULL // bit 0 -#define ldlm_is_lock_changed(_l) LDLM_TEST_FLAG(( _l), 1ULL << 0) -#define ldlm_set_lock_changed(_l) LDLM_SET_FLAG(( _l), 1ULL << 0) -#define ldlm_clear_lock_changed(_l) LDLM_CLEAR_FLAG((_l), 1ULL << 0) +#define ldlm_is_lock_changed(_l) LDLM_TEST_FLAG(( _l), 0) +#define ldlm_set_lock_changed(_l) LDLM_SET_FLAG(( _l), 0) +#define ldlm_clear_lock_changed(_l) LDLM_CLEAR_FLAG((_l), 0) ...... ...... /** test for ldlm_lock flag bit set */ -#define LDLM_TEST_FLAG(_l, _b) (((_l)->l_flags & (_b)) != 0) +#define LDLM_TEST_FLAG(_l, _b) test_bit(_b, (volatile unsigned long *)&(_l)->l_flags) /** multi-bit test: are any of mask bits set? */ #define LDLM_HAVE_MASK(_l, _m) (((_l)->l_flags & LDLM_FL_##_m##_MASK) != 0) /** set a ldlm_lock flag bit */ -#define LDLM_SET_FLAG(_l, _b) ((_l)->l_flags |= (_b)) +#define LDLM_SET_FLAG(_l, _b) set_bit(_b, (volatile unsigned long *)&(_l)->l_flags) /** clear a ldlm_lock flag bit */ -#define LDLM_CLEAR_FLAG(_l, _b) ((_l)->l_flags &= ~(_b)) +#define LDLM_CLEAR_FLAG(_l, _b) clear_bit(_b, (volatile unsigned long *)&(_l)->l_flags) /** @} subgroup */ /** @} group */
I think the prime suspect owuld be some place that updates lock flags without taking lock spinlock. There are several such places.
I happen to be able to reproduce this issue steadily while I was working another one.
After I applied the following patch:
diff --git a/lustre/ldlm/ldlm_lock.c b/lustre/ldlm/ldlm_lock.c index 214ee8c..2e69ad0 100644 --- a/lustre/ldlm/ldlm_lock.c +++ b/lustre/ldlm/ldlm_lock.c @@ -372,12 +372,14 @@ int ldlm_lock_destroy_internal(struct ldlm_lock *lock) LBUG(); } + LDLM_DEBUG(lock, "destroy"); if (ldlm_is_destroyed(lock)) { LASSERT(cfs_list_empty(&lock->l_lru)); EXIT; return 0; } ldlm_set_destroyed(lock); + LDLM_DEBUG(lock, "destroy"); if (lock->l_export && lock->l_export->exp_lock_hash) { /* NB: it's safe to call cfs_hash_del() even lock isn't @@ -2169,6 +2171,7 @@ void ldlm_lock_cancel(struct ldlm_lock *lock) ldlm_resource_unlink_lock(lock); ldlm_lock_destroy_nolock(lock); + LDLM_DEBUG(lock, "destroyed"); if (lock->l_granted_mode == lock->l_req_mode) ldlm_pool_del(&ns->ns_pool, lock); diff --git a/lustre/ldlm/ldlm_request.c b/lustre/ldlm/ldlm_request.c index e140e0f..def3ee5 100644 --- a/lustre/ldlm/ldlm_request.c +++ b/lustre/ldlm/ldlm_request.c @@ -1141,6 +1141,7 @@ static __u64 ldlm_cli_cancel_local(struct ldlm_lock *lock) rc = LDLM_FL_LOCAL_ONLY; } ldlm_lock_cancel(lock); + LDLM_DEBUG(lock, "client-side cancel completed, rc = %lld", rc); } else { if (ns_is_client(ldlm_lock_to_ns(lock))) { LDLM_ERROR(lock, "Trying to cancel local lock"); diff --git a/lustre/llite/llite_internal.h b/lustre/llite/llite_internal.h index 92b278e..e9e7053 100644 --- a/lustre/llite/llite_internal.h +++ b/lustre/llite/llite_internal.h @@ -1469,7 +1469,7 @@ static inline void cl_isize_unlock(struct inode *inode) static inline void cl_isize_write_nolock(struct inode *inode, loff_t kms) { - LASSERT(down_trylock(&ll_i2info(inode)->lli_size_sem) != 0); + //LASSERT(down_trylock(&ll_i2info(inode)->lli_size_sem) != 0); i_size_write(inode, kms); } diff --git a/lustre/llite/llite_lib.c b/lustre/llite/llite_lib.c index df421d6..ae53f3d 100644 --- a/lustre/llite/llite_lib.c +++ b/lustre/llite/llite_lib.c @@ -1700,25 +1700,29 @@ int ll_statfs(struct dentry *de, struct kstatfs *sfs) void ll_inode_size_lock(struct inode *inode) { +#if 0 struct ll_inode_info *lli; LASSERT(!S_ISDIR(inode->i_mode)); lli = ll_i2info(inode); LASSERT(lli->lli_size_sem_owner != current); - down(&lli->lli_size_sem); + //down(&lli->lli_size_sem); LASSERT(lli->lli_size_sem_owner == NULL); lli->lli_size_sem_owner = current; +#endif } void ll_inode_size_unlock(struct inode *inode) { +#if 0 struct ll_inode_info *lli; lli = ll_i2info(inode); LASSERT(lli->lli_size_sem_owner == current); lli->lli_size_sem_owner = NULL; - up(&lli->lli_size_sem); + //up(&lli->lli_size_sem); +#endif } void ll_update_inode(struct inode *inode, struct lustre_md *md) diff --git a/lustre/obdclass/cl_lock.c b/lustre/obdclass/cl_lock.c index d440da9..b8c2e04 100644 --- a/lustre/obdclass/cl_lock.c +++ b/lustre/obdclass/cl_lock.c @@ -493,6 +493,7 @@ static struct cl_lock *cl_lock_lookup(const struct lu_env *env, const struct cl_io *io, const struct cl_lock_descr *need) { +#if 0 struct cl_lock *lock; struct cl_object_header *head; @@ -518,6 +519,7 @@ static struct cl_lock *cl_lock_lookup(const struct lu_env *env, RETURN(lock); } } +#endif RETURN(NULL); } @@ -1964,6 +1966,8 @@ static struct cl_lock *cl_lock_hold_mutex(const struct lu_env *env, cl_lock_hold_mod(env, lock, +1); lu_ref_add(&lock->cll_holders, scope, source); lu_ref_add(&lock->cll_reference, scope, source); + + cl_lock_cancel(env, lock); break; } cl_lock_mutex_put(env, lock);
And uses 32 processes to read the same file on the same time. The reading process hit the LBUG at the same place:
Call Trace: [<ffffffffa03fd895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa03fde97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0774fb7>] ldlm_lock_put+0x547/0x630 [ptlrpc] [<ffffffffa0793df2>] ldlm_cli_cancel_list+0xf2/0x3e0 [ptlrpc] [<ffffffffa0795a22>] ldlm_cli_cancel+0x132/0x360 [ptlrpc] [<ffffffffa0a8e72e>] osc_lock_cancel+0x10e/0x1d0 [osc] [<ffffffffa059c045>] cl_lock_cancel0+0x75/0x160 [obdclass] [<ffffffffa059d9d6>] cl_lock_hold_release+0x1c6/0x2a0 [obdclass] [<ffffffffa059ecc7>] cl_lock_unhold+0x37/0x130 [obdclass] [<ffffffffa0af8a68>] lov_sublock_release+0x1a8/0x280 [lov] [<ffffffffa0afa4fe>] lov_lock_unuse+0x1be/0x290 [lov] [<ffffffffa059c205>] cl_unuse_try_internal+0x55/0xe0 [obdclass] [<ffffffffa059f0a9>] cl_unuse_try+0x199/0x320 [obdclass] [<ffffffffa059f267>] cl_unuse_locked+0x37/0x1a0 [obdclass] [<ffffffffa059f40e>] cl_unuse+0x3e/0x120 [obdclass] [<ffffffffa05a5e57>] cl_io_unlock+0x127/0x2b0 [obdclass] [<ffffffffa05a6aed>] cl_io_loop+0xcd/0x1b0 [obdclass] [<ffffffffa0f7c9d6>] ll_file_io_generic+0x2b6/0x710 [lustre] [<ffffffffa0596db9>] ? cl_env_get+0x29/0x350 [obdclass] [<ffffffffa0f7cf6f>] ll_file_aio_read+0x13f/0x2c0 [lustre] [<ffffffffa0f7d42c>] ll_file_read+0x16c/0x2a0 [lustre] [<ffffffff81181a95>] vfs_read+0xb5/0x1a0 [<ffffffff81181bd1>] sys_read+0x51/0x90 [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b LustreError: dumping log to /tmp/lustre-log.1391207221.6283
However, from the log, I can see this:
00000020:00010000:5.0:1391207220.451167:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) unuse lock: ffff88081acf48e8@(2 ffff8807e6fd0080 1 3 0 1 1 2)(ffff88080d4a0db8/0/1) at cl_unuse_try():1363 00000020:00010000:5.0:1391207220.451172:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) enclosure lock: ffff88081acf4738@(1 (null) 0 3 0 1 1 2)(ffff88080d4a0db8/1/1) at cl_lock_enclosure():1691 00000020:00010000:5.0:1391207220.451175:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) enclosure lock: ffff88081acf48e8@(3 ffff8807e6fd0080 1 4 0 2 1 2)(ffff88080d4a0db8/0/2) at cl_lock_enclosure():1691 00000020:00010000:5.0:1391207220.451178:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) unuse lock: ffff88081acf4738@(2 ffff8807e6fd0080 1 3 0 1 1 2)(ffff88080d4a0c68/1/0) at cl_unuse_try():1363 00010000:00010000:5.0:1391207220.451185:0:6294:0:(ldlm_lock.c:852:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 4/1,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x10020000000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.451191:0:6294:0:(ldlm_lock.c:920:ldlm_lock_decref_internal()) ### add lock into lru list ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x10020000000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00000020:00010000:5.0:1391207220.451197:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) hold release lock: ffff88081acf4738@(3 ffff8807e6fd0080 1 5 0 2 0 2)(ffff88080d4a0c68/1/0) at cl_lock_hold_release():903 00000020:00010000:5.0:1391207220.451200:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) hold release lock: ffff88081acf4738@(2 ffff8807e6fd0080 1 5 0 1 0 2)(ffff88080d4a0db8/1/2) at cl_lock_hold_release():903 00000080:00200000:5.0:1391207220.451207:0:6294:0:(vvp_io.c:1163:vvp_io_init()) [0x200000400:0x2:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0 00000080:00200000:5.0:1391207220.452460:0:6294:0:(vvp_io.c:153:vvp_io_fini()) [0x200000400:0x2:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0 00010000:00010000:5.0:1391207220.452485:0:6294:0:(ldlm_request.c:1127:ldlm_cli_cancel_local()) ### client-side cancel ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 3/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x28400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00000020:00010000:5.0:1391207220.452493:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) cancel lock: ffff88081acf4738@(3 ffff8807e6fd0080 2 5 0 0 0 1)(ffff8807e0547920/1/1) at cl_lock_cancel():1836 00000020:00010000:5.0:1391207220.452496:0:6294:0:(cl_lock.c:151:cl_lock_trace0()) delete lock: ffff88081acf4738@(3 ffff8807e6fd0080 2 5 0 0 0 1)(ffff8807e0547920/1/1) at cl_lock_delete():1783 00010000:00010000:5.0:1391207220.452523:0:6294:0:(ldlm_lock.c:375:ldlm_lock_destroy_internal()) ### destroy ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 2/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x10869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.452529:0:6294:0:(ldlm_lock.c:382:ldlm_lock_destroy_internal()) ### destroy ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 2/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x10869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.452534:0:6294:0:(ldlm_lock.c:2174:ldlm_lock_cancel()) ### destroyed ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 1/0,0 mode: PR/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x10869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.452544:0:6294:0:(ldlm_request.c:1144:ldlm_cli_cancel_local()) ### client-side cancel completed, rc = 549755813888 ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 1/0,0 mode: --/PR res: [0x2:0x0:0x0].0 rrc: 15 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.452554:0:6294:0:(ldlm_request.c:1187:ldlm_cancel_pack()) ### packing ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 1/0,0 mode: --/PR res: [0x2:0x0:0x0].0 rrc: 14 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00010000:5.0:1391207220.452559:0:6294:0:(ldlm_request.c:1191:ldlm_cancel_pack()) 1 locks packed 00010000:00010000:5.0:1391207220.452565:0:6294:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: lustre-OST0000-osc-ffff8808198eb000 lock: ffff8808152de340/0x6d44b209b7d2fe79 lrc: 0/0,0 mode: --/PR res: [0x2:0x0:0x0].0 rrc: 14 type: EXT [0->18446744073709551615] (req 1146880->1150975) flags: 0x869400000000 nid: local remote: 0x6d44b209b7d2fe95 expref: -99 pid: 6294 timeout: 0 lvb_type: 1 00010000:00040000:5.0:1391207220.452570:0:6294:0:(ldlm_lock.c:222:ldlm_lock_put()) ASSERTION( (((( lock))->l_flags & (1ULL << 50)) != 0) ) failed: 00010000:00040000:5.0:1391207220.577549:0:6294:0:(ldlm_lock.c:222:ldlm_lock_put()) LBUG
Both line 375 and 382 in were printed. and the source is as follows:
361 int ldlm_lock_destroy_internal(struct ldlm_lock *lock) 362 { 363 ENTRY; 364 365 if (lock->l_readers || lock->l_writers) { 366 LDLM_ERROR(lock, "lock still has references"); 367 LBUG(); 368 } 369 370 if (!cfs_list_empty(&lock->l_res_link)) { 371 LDLM_ERROR(lock, "lock still on resource"); 372 LBUG(); 373 } 374 375 LDLM_DEBUG(lock, "destroy"); 376 if (ldlm_is_destroyed(lock)) { 377 LASSERT(cfs_list_empty(&lock->l_lru)); 378 EXIT; 379 return 0; 380 } 381 ldlm_set_destroyed(lock); 382 LDLM_DEBUG(lock, "destroy");
However, the destroyed flag was failed to set. This is beyond of my understanding, and we should escalate this issue.
The root cause of this issue is pretty clear.
Bobijam, can you please create a patch for this?