[LU-1306] LBUG at (dlm_lock.c:213:ldlm_lock_add_to_lru_nolock()) ASSERTION(lock->l_resource->lr_type != LDLM_FLOCK failed Created: 11/Apr/12  Updated: 06/Nov/13  Resolved: 06/Nov/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.3.0, Lustre 1.8.9

Type: Bug Priority: Minor
Reporter: Andriy Skulysh Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4630

 Description   

Following bug occured :
> c0-0c0s5n1 LustreError: 20262:0:(ldlm_lock.c:213:ldlm_lock_add_to_lru_nolock()) ASSERTION(lock->l_resource->lr_type != LDLM_FLOCK) failed
> c0-0c0s5n1 LustreError: 20262:0:(ldlm_lock.c:213:ldlm_lock_add_to_lru_nolock()) LBUG
> c0-0c0s5n1 Pid: 20262, comm: fcntl17
> c0-0c0s5n1 Call Trace:
> c0-0c0s5n1 [<ffffffff81007a89>] try_stack_unwind+0x149/0x190
> c0-0c0s5n1 [<ffffffff81006420>] dump_trace+0x90/0x300
> c0-0c0s5n1 [<ffffffffa0132992>] libcfs_debug_dumpstack+0x52/0x80 [libcfs]
> c0-0c0s5n1 [<ffffffffa0132f01>] lbug_with_loc+0x71/0xe0 [libcfs]
> c0-0c0s5n1 [<ffffffffa013c461>] libcfs_assertion_failed+0x61/0x70 [libcfs]
> c0-0c0s5n1 [<ffffffffa0261348>] ldlm_lock_add_to_lru_nolock+0xd8/0xe0 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa02619d9>] ldlm_lock_add_to_lru+0x49/0x100 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa0266d28>] ldlm_lock_decref_internal+0x2e8/0x860 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa027d288>] failed_lock_cleanup+0x58/0x100 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa027d4e6>] ldlm_cli_enqueue_fini+0x1b6/0xbb0 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa0282541>] ldlm_cli_enqueue+0x1a1/0x760 [ptlrpc]
> c0-0c0s5n1 [<ffffffffa04a876b>] ll_file_flock+0x47b/0x690 [lustre]
> c0-0c0s5n1 [<ffffffff81122dee>] vfs_lock_file+0x1e/0x40
> c0-0c0s5n1 [<ffffffff81123027>] fcntl_setlk+0x167/0x320
> c0-0c0s5n1 [<ffffffff810f6661>] sys_fcntl+0x321/0x540
> c0-0c0s5n1 [<ffffffff81002eab>] system_call_fastpath+0x16/0x1b
> c0-0c0s5n1 [<00002aaaadd7f702>] 0x2aaaadd7f702

It looks like the problem is in following race:

ldlm_cb thread calls ldlm_run_cp_ast_work() :
lock_res_and_lock(lock);
list_del_init(&lock->l_cp_ast);
LASSERT(lock->l_flags & LDLM_FL_CP_REQD);
/* save l_completion_ast since it can be changed by

mds_intent_policy(), see bug 14225 */
completion_callback = lock->l_completion_ast;
lock->l_flags &= ~LDLM_FL_CP_REQD;
unlock_res_and_lock(lock);

while original lock wait thread receives signal:
signal callback ldlm_flock_interrupted_wait() does
lock->l_flags |= LDLM_FL_CBPENDING;
without locking
l_wait_event() exits with error (signal occurred) and failed_lock_cleanup() fails on assert because LDLM_FL_CBPENDING was cleared by ldlm_run_cp_ast_work()



 Comments   
Comment by Andriy Skulysh [ 11/Apr/12 ]

CODE http://review.whamcloud.com/2511

Comment by Peter Jones [ 08/May/12 ]

Landed for 2.3

Comment by Cory Spitz [ 09/May/12 ]

Can this push to b1_8?

Comment by Andriy Skulysh [ 09/May/12 ]

The bug was originally detected on b1_8. the patch can be applied for 1.8 also.

Comment by Andriy Skulysh [ 14/May/12 ]

patch for b1_8: http://review.whamcloud.com/2727

Comment by Iurii Golovach (Inactive) [ 26/Jul/12 ]

Since there were no updates last few months new ticket to track landing into 1.8 was created:

http://jira.whamcloud.com/browse/LU-1677

Comment by Cory Spitz [ 10/Oct/12 ]

change #2727 has landed to b1_8.

Comment by Nathan Rutman [ 21/Nov/12 ]

Xyratex-bug-id: MRP-420

Comment by Sarah Liu [ 14/Jan/13 ]

Hit this LBUG again in POSIX test during interop testing between 2.3.0 server and 2.4 client. client runs build lustre-master #1142

0:11:40:Lustre: DEBUG MARKER: Run POSIX test against lustre filesystem
20:20:53:LustreError: 12733:0:(ldlm_lock.c:1570:ldlm_fill_lvb()) ### Unexpected LVB type ns: lustre-MDT0000-mdc-ffff880061663400 lock: ffff88001f87b200/0x3aa7bfeb697ea484 lrc: 5/0,1 mode: --/PW res: 8589939620/4400 rrc: 4 type: FLK pid: 386 [10->29] flags: 0x0 nid: local remote: 0x3ba20de103a0d632 expref: -99 pid: 386 timeout: 0
20:21:35:LustreError: 386:0:(ldlm_lock.c:298:ldlm_lock_add_to_lru_nolock()) ASSERTION( lock->l_resource->lr_type != LDLM_FLOCK ) failed: 
20:21:35:LustreError: 386:0:(ldlm_lock.c:298:ldlm_lock_add_to_lru_nolock()) LBUG
20:21:35:Pid: 386, comm: T.fcntl
20:21:35:
20:21:35:Call Trace:
20:21:35: [<ffffffffa0b63905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
20:21:35: [<ffffffffa0b63f17>] lbug_with_loc+0x47/0xb0 [libcfs]
20:21:35: [<ffffffffa04eec02>] ldlm_lock_add_to_lru_nolock+0x112/0x120 [ptlrpc]
20:21:35: [<ffffffffa04ef023>] ldlm_lock_add_to_lru+0x43/0x120 [ptlrpc]
20:21:35: [<ffffffffa04f4b78>] ldlm_lock_decref_internal+0x338/0xad0 [ptlrpc]
20:21:35: [<ffffffffa05008fb>] failed_lock_cleanup+0x8b/0x220 [ptlrpc]
20:21:35: [<ffffffffa0500bbf>] ldlm_cli_enqueue_fini+0x12f/0xec0 [ptlrpc]
20:21:35: [<ffffffffa0b64bae>] ? cfs_free+0xe/0x10 [libcfs]
20:21:35: [<ffffffffa0501cfd>] ldlm_cli_enqueue+0x3ad/0x790 [ptlrpc]
20:21:35: [<ffffffffa050e160>] ? ldlm_flock_completion_ast+0x0/0xb40 [ptlrpc]
20:21:35: [<ffffffffa04341b4>] mdc_enqueue+0x694/0x1510 [mdc]
20:21:35: [<ffffffffa065227c>] lmv_enqueue+0x40c/0x1a20 [lmv]
20:21:35: [<ffffffffa07d1e05>] ll_file_flock+0x635/0x9f0 [lustre]
20:21:35: [<ffffffffa050e160>] ? ldlm_flock_completion_ast+0x0/0xb40 [ptlrpc]
20:21:35: [<ffffffff811c78c3>] vfs_lock_file+0x23/0x40
20:21:35: [<ffffffff811c7b17>] fcntl_setlk+0x177/0x320
20:21:35: [<ffffffff8118dd57>] sys_fcntl+0x197/0x530
20:21:35: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
20:21:35:
20:21:35:Kernel panic - not syncing: LBUG
20:21:35:Pid: 386, comm: T.fcntl Not tainted 2.6.32-279.14.1.el6.x86_64 #1
20:21:35:Call Trace:
20:21:35: [<ffffffff814fd98a>] ? panic+0xa0/0x168
20:21:35: [<ffffffffa0b63f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
20:21:35: [<ffffffffa04eec02>] ? ldlm_lock_add_to_lru_nolock+0x112/0x120 [ptlrpc]
20:21:35: [<ffffffffa04ef023>] ? ldlm_lock_add_to_lru+0x43/0x120 [ptlrpc]
20:21:35: [<ffffffffa04f4b78>] ? ldlm_lock_decref_internal+0x338/0xad0 [ptlrpc]
20:21:35: [<ffffffffa05008fb>] ? failed_lock_cleanup+0x8b/0x220 [ptlrpc]
20:21:35: [<ffffffffa0500bbf>] ? ldlm_cli_enqueue_fini+0x12f/0xec0 [ptlrpc]
20:21:35: [<ffffffffa0b64bae>] ? cfs_free+0xe/0x10 [libcfs]
20:21:35: [<ffffffffa0501cfd>] ? ldlm_cli_enqueue+0x3ad/0x790 [ptlrpc]
20:21:35: [<ffffffffa050e160>] ? ldlm_flock_completion_ast+0x0/0xb40 [ptlrpc]
20:21:35: [<ffffffffa04341b4>] ? mdc_enqueue+0x694/0x1510 [mdc]
20:21:35: [<ffffffffa065227c>] ? lmv_enqueue+0x40c/0x1a20 [lmv]
20:21:35: [<ffffffffa07d1e05>] ? ll_file_flock+0x635/0x9f0 [lustre]
20:21:36: [<ffffffffa050e160>] ? ldlm_flock_completion_ast+0x0/0xb40 [ptlrpc]
20:21:36: [<ffffffff811c78c3>] ? vfs_lock_file+0x23/0x40
20:21:36: [<ffffffff811c7b17>] ? fcntl_setlk+0x177/0x320
20:21:36: [<ffffffff8118dd57>] ? sys_fcntl+0x197/0x530
20:21:36: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
20:21:36:Initializing cgroup subsys cpuset
20:21:36:Initializing cgroup subsys cpu
Comment by Sarah Liu [ 15/Jan/13 ]

another instance seen in 2.1.4 server vs 2.4 client:
https://maloo.whamcloud.com/test_sets/826a3efe-5f42-11e2-b507-52540035b04c

Comment by Andreas Dilger [ 06/Nov/13 ]

Patches were landed for b1_8 and master.

Generated at Sat Feb 10 01:15:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.