[LU-2720] osc_page_delete()) ASSERTION(0) failed Created: 31/Jan/13 Updated: 15/Apr/16 Resolved: 23/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Vitaly Fertman | Assignee: | Keith Mannthey (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LB, patch | ||
| Issue Links: |
|
||||||||||
| Sub-Tasks: |
|
||||||||||
| Severity: | 3 | ||||||||||
| Rank (Obsolete): | 6614 | ||||||||||
| Description |
|
it was already posted to 2012-11-06T14:57:10.872333-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060eedfe58@ {[23 -> 23/255], [2|0|-|cache|wi|ffff88020e18f8c8], [4096|1|+|-|ffff8801f9ed9c18|256| (null)]} trunc at 23. |
| Comments |
| Comment by Vitaly Fertman [ 31/Jan/13 ] |
|
1. The ENODATA handling code in ll_page_mkwrite0 writes the page and invalidates it, but the page could have been re-added to the cache in between these 2 steps. At the same time, it is not clear why we wanted to just PEEK the lock, not waiting here for a new one. we want a writable page, thus we need a lock. if old one is cancelled, we will have to request a new one anyway. i.e. 2. whereas mkwrite will finish must faster with no new lock request, later we still want to make the page writable and anyway we have to request a new lock - so not faster in general, isn't it? so the PEEK "optimisation" is not clear and troublesome, so the patch I send drops it. |
| Comment by Vitaly Fertman [ 31/Jan/13 ] |
| Comment by Andreas Dilger [ 05/Feb/13 ] |
|
Vitaly, can you please include some information about how this problem was initially hit (e.g. test load, frequency of being hit, etc). |
| Comment by Wally Wang (Inactive) [ 07/Feb/13 ] |
|
We hit this often when running fsx-linux from LTP. Usually it happens during a stress run in an hour or two. We haven't seen this bug after applying this patch together with |
| Comment by Vitaly Fertman [ 08/Feb/13 ] |
|
after a talk to Jay, we decided not to change cl_lock_peek, because it may return REPEAT only for lock canceling or glimpse ast. at the same time, lock canceling may be long and we do not sleep here, so this looping will consume CPU resources. as after this patch, cl_lock_peek is used for SOM only, it may result the only ioepoch holder does not provide attibute update in done_writing and mds re-asks for them. if there will be a need in minimizing the amount of these RPCs, a sleeping version of cl_lock_peek is to be implemented. |
| Comment by Keith Mannthey (Inactive) [ 08/Feb/13 ] |
|
Are you going to drop http://review.whamcloud.com/5222 ? |
| Comment by Peter Jones [ 23/Feb/13 ] |
|
Landed for 2.4 |
| Comment by Cory Spitz [ 06/Mar/13 ] |
|
Can someone please comment about change #5222 landing considering the comments and questions from 08/Feb/13? Cray is seeing significant CPU spinning in 2.4 RC testing, but Wally would have to confirm if change #5222 is the cause. |
| Comment by Oleg Drokin [ 06/Mar/13 ] |
|
Vitaly comments relates to patchset #2, he since added patchset #3 that fixes the potential issue discussed. |
| Comment by Cory Spitz [ 06/Mar/13 ] |
|
Ah, thanks Oleg. Sure, we'll keep looking at our spin issue. |
| Comment by Peter Jones [ 06/Mar/13 ] |
|
Corey It would be best to open a new ticket with details of the CPU spinning issue that you are seeing Peter |
| Comment by Cory Spitz [ 08/May/13 ] |
|
Thanks, it is |