[LU-1217] ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed Created: 14/Mar/12 Updated: 04/Jun/12 Resolved: 04/Jun/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.1.2 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Alexandre Louvet | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4609 |
| Description |
|
system crashed with following assertion LustreError: 35225:0:(osc_page.c:293:osc_page_putref_lock()) ASSERTION(cfs_atomic_read(&lock->cll_ref) > 1 || olock->ols_hold == 0) failed Call Trace: Unless I did something really wrong, At time of crash, the system was higly loaded with a lot of interactive commands and did suffer of disconnection from some OST & MDT. |
| Comments |
| Comment by Peter Jones [ 14/Mar/12 ] |
|
Bobi Could you please look into this one? Thanks Peter |
| Comment by Jinshan Xiong (Inactive) [ 14/Mar/12 ] |
|
it's impossible for this atomic count to be 258 if it hit this assertion. I don't know why it's `cfs_atomic_read(&lock->cll_ref) > 1' instead of `> 0'. |
| Comment by Peter Jones [ 23/Mar/12 ] |
|
Alexandre You mentioned on yesterday's call that it would be possible to supply a crash dump for this ticket. That would definitely help expedite us understanding this scenario. Thanks Peter |
| Comment by Zhenyu Xu [ 23/Mar/12 ] |
|
Would you mind uploading crash image/kernel image/symbol files to our ftp site? |
| Comment by Ned Bass [ 23/Mar/12 ] |
|
We hit this today as well on one of our classified 2.1 clients. We have a crash dump, but can't upload it. I'm happy to do any analysis of the dump at your direction. |
| Comment by Zhenyu Xu [ 26/Mar/12 ] |
|
first, we need to know the troublesome cl_lock's content. |
| Comment by Ned Bass [ 26/Mar/12 ] |
I've only been able to find one address on the stack that looks like a cl_lock. But, with lock->cll_ref = 99 it shouldn't have hit the assertion. Maybe something touched that memory after the atomic read? struct cl_lock {
cll_ref = {
counter = 99
},
cll_layers = {
next = 0xffff88...
prev = 0xffff88...
},
cll_linkage = {
next = 0xffff88...
prev = 0xffff88...
},
cll_descr = {
cld_obj = 0xffff88...
cld_start = 0,
cld_end = 18446744073709551615,
cld_gid = 0,
cld_mode = CLM_READ,
cld_enq_flags = 0
},
cll_state = CLS_HELD,
cll_wq = {
lock = {
raw_lock = {
slock = 169675293
}
},
task_list = {
next = 0xffff88...
prev = 0xffff88...
}
},
cll_guard = {
count = {
counter = 1
},
wait_lock = {
raw_lock = {
slock = 1179666
}
},
wait_list = {
next = 0xffff88...
prev = 0xffff88...
},
owner = 0x0
},
cll_guarder = 0x0,
cll_depth = 0,
cll_intransit_owner = 0x0,
cll_error = 0,
cll_holds = 1,
cll_users = 1,
cll_flags = 0,
cll_inclosure = {
next = 0xffff88...
prev = 0xffff88...
},
cll_conflict = 0x0,
cll_reference = {<No data fields>},
cll_holders = {<No data fields>},
cll_obj_ref = 0x0
}
|
| Comment by Jinshan Xiong (Inactive) [ 26/Mar/12 ] |
|
let's try to print the lock state precisely with patch: http://review.whamcloud.com/#change,2383 |
| Comment by Ned Bass [ 26/Mar/12 ] |
|
OK, but it will be several weeks before we can update this system, and even then the bug may be rare to reproduce. |
| Comment by Jinshan Xiong (Inactive) [ 06/Apr/12 ] |
|
from the attached crash dump, the lock looks okay. I don't know why. crash> p *(struct osc_lock *)0xffff881fc381a140
$28 = {
ols_cl = {
cls_lock = 0xffff881dea5c9e78,
cls_obj = 0xffff881f30d9e4e0,
cls_ops = 0xffffffffa06537e0,
cls_linkage = {
next = 0xffff881dea5c9e80,
prev = 0xffff881e8c1e8938
}
},
ols_lock = 0xffff881e974df240,
ols_lvb = {
lvb_size = 1443889168,
lvb_mtime = 1329614555,
lvb_atime = 1329614548,
lvb_ctime = 1329614555,
lvb_blocks = 2820112
},
ols_flags = 1052673,
ols_handle = {
cookie = 8399946539139540594
},
ols_einfo = {
ei_type = 11,
ei_mode = 4,
ei_cb_bl = 0xffffffffa064ee10,
ei_cb_cp = 0xffffffffa064fcb0,
ei_cb_gl = 0xffffffffa064ea50,
ei_cb_wg = 0xffffffffa064e7a0,
ei_cbdata = 0xffff881fc381a140,
ei_async = 0
},
ols_state = OLS_GRANTED,
ols_pageref = {
counter = 256
},
ols_hold = 1,
ols_has_ref = 1,
ols_locklessable = 0,
ols_ast_wait = 0,
ols_flush = 0,
ols_glimpse = 0,
ols_owner = 0xffff881aac5471d0
}
crash> p *(struct cl_lock *)0xffff881dea5c9e78
$29 = {
cll_ref = {
counter = 258
},
cll_layers = {
next = 0xffff881e8c1e8938,
prev = 0xffff881fc381a158
},
cll_linkage = {
next = 0xffff8819d4c0df38,
prev = 0xffff8819d4c0df38
},
cll_descr = {
cld_obj = 0xffff8819d4c0df58,
cld_start = 0,
cld_end = 18446744073709551615,
cld_gid = 0,
cld_mode = CLM_READ,
cld_enq_flags = 0
},
cll_state = CLS_HELD,
cll_wq = {
lock = {
raw_lock = {
slock = 76350605
}
},
task_list = {
next = 0xffff881dea5c9ed8,
prev = 0xffff881dea5c9ed8
}
},
cll_guard = {
count = {
counter = 1
},
wait_lock = {
raw_lock = {
slock = 0
}
},
wait_list = {
next = 0xffff881dea5c9ef0,
prev = 0xffff881dea5c9ef0
},
owner = 0x0
},
cll_guarder = 0x0,
cll_depth = 0,
cll_intransit_owner = 0x0,
cll_error = 0,
cll_holds = 1,
cll_users = 1,
cll_flags = 0,
cll_inclosure = {
next = 0xffff881dea5c9f38,
prev = 0xffff881dea5c9f38
},
cll_conflict = 0x0,
cll_reference = {<No data fields>},
cll_holders = {<No data fields>},
cll_obj_ref = 0x0
}
|
| Comment by Oleg Drokin [ 06/Apr/12 ] |
|
Ned, do you also apply multi-threaded ptlrpcd patch on top of your 2.1 tree? |
| Comment by Christopher Morrone [ 06/Apr/12 ] |
|
No, we do not have multi-threaded pltrpcd. |
| Comment by Jinshan Xiong (Inactive) [ 06/Apr/12 ] |
|
I have no idea what happened. Can you please apply the patch I mentioned so that we can see what was exactly the lock state? |
| Comment by Ned Bass [ 06/Apr/12 ] |
|
Hi Jinshan, we have applied the patch to our tree and it will by rolled out in the coming weeks. |
| Comment by Christopher Morrone [ 24/Apr/12 ] |
|
A sysadmin got a hit on a client on our secure network with your debugging patch applied. He could not cut-and-paste. What I know so far is just that your patch reported that the olock pointer is non-null and hold is 1. If there is something specific that you want from CL_LOCK_DEBUG, let me know and I'll either have the admin transcribe it or a developer can do it when we return from LUG. |
| Comment by Jinshan Xiong (Inactive) [ 24/Apr/12 ] |
|
I will need the detailed info of the lock to address the problem. |
| Comment by Christopher Morrone [ 25/Apr/12 ] |
|
Ah, yes, I see all values that cl_lock_print includes. We'll need Ned to copy those into this ticket when he gets back from LUG. |
| Comment by Jinshan Xiong (Inactive) [ 25/Apr/12 ] |
|
Yes, I knew this from your previous comment either. Let's do it after you guys come back from LUG. |
| Comment by Ned Bass [ 26/Apr/12 ] |
lock@ffff8803948bc978 [2 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x100cf000:0x3e138b:0x0] {
lovsub@ffff8804340bd9a0 [0 ffff8804340db9a0: [ 0 ffff8801679d2d60 R(1):[256, 18446744073709551615]@[0x3f5c87bae:0x6a:0x0]]
osc@ffff88041a6671f8: ffff880429e7eb40 00101001 0xa29f718631a3c48d 3 ffff8803b773f2b8 size: 367001600 mtime: 1335282921 atime: 1335282807 ctime: 1335282921 blocks: 716808
} lock@ffff8803948bc978
olock: ffff88041a6671f8, hold: 1.
PID: 14280 COMMAND: "ptlrpcd-brw"
machine_kexec
crash_kexec
panic
lbug_with_loc
libcfs_assertion_failed
osc_page_putref_lock
osc_page_completion_read
cl_page_completion
osc_completion
osc_ap_completion
brw_interpret
ptlrpc_check_set
ptlrpcd_check
ptlrpcd
kernel_thread
|
| Comment by Jinshan Xiong (Inactive) [ 26/Apr/12 ] |
|
Ned, thank you for the information. I realize this problem is a race between lock use and assertion check as following: Fix this problem is easy - just to remove the assertion. |
| Comment by Jinshan Xiong (Inactive) [ 26/Apr/12 ] |
|
Please apply patch: http://review.whamcloud.com/2604. |
| Comment by Christopher Morrone [ 27/Apr/12 ] |
|
Will do. |
| Comment by Peter Jones [ 04/Jun/12 ] |
|
Landed for 2.1.2 and 2.3 |