[LU-2720] osc_page_delete()) ASSERTION(0) failed Created: 31/Jan/13  Updated: 15/Apr/16  Resolved: 23/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Vitaly Fertman Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: LB, patch

Issue Links:
Related
is related to LU-2557 osc_page_delete()) Trying to teardown... Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
LU-2788 Sanity test_132 (lu_object.c:1982:lu_... Technical task Resolved Keith Mannthey  
Severity: 3
Rank (Obsolete): 6614

 Description   

it was already posted to LU-1723, but seems to be a different issue from the original one in that ticket, so moving it here.

2012-11-06T14:57:10.872333-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_cache.c:2367:osc_teardown_async_page()) extent ffff88060eedfe58@

{[23 -> 23/255], [2|0|-|cache|wi|ffff88020e18f8c8], [4096|1|+|-|ffff8801f9ed9c18|256| (null)]}

trunc at 23.
2012-11-06T14:57:10.872389-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8bc0[2 ffff88020cefad08:23 ^ (null)_ffff88080e8c8b00 4 0 1 (null) (null) 0x0]
2012-11-06T14:57:10.902445-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) page@ffff88080e8c8b00[1 ffff88020d8c1f58:23 ^ffff88080e8c8bc0_ (null) 4 0 1 (null) (null) 0x0]
2012-11-06T14:57:10.902494-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) vvp-page@ffff88080e8cf5a0(0:0:0) vm@ffffea001c6a99b8 e00000000000063 7:0 0 23 lru
2012-11-06T14:57:10.902508-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) lov-page@ffff8808050a7888
2012-11-06T14:57:10.958005-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) osc-page@ffff88080d78ed20: 1< 0x845fed 258 0 + - > 2< 94208 0 4096 0x0 0x520 | (null) ffff880804350700 ffff88020e18f8c8 > 3< + ffff8801f8f577b0 0 0 0 > 4< 0 0 8 33824768 - | - - + - > 5< - - + - | 0 - | 1 - ->
2012-11-06T14:57:10.958049-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) end page@ffff88080e8c8bc0
2012-11-06T14:57:10.983504-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:411:osc_page_delete()) Trying to teardown failed: -16
2012-11-06T14:57:10.983536-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) ASSERTION( 0 ) failed:
2012-11-06T14:57:10.983549-06:00 c0-0c1s6n0 LustreError: 5270:0:(osc_page.c:412:osc_page_delete()) LBUG
2012-11-06T14:57:10.983570-06:00 c0-0c1s6n0 Pid: 5270, comm: fsx-linux
2012-11-06T14:57:10.983582-06:00 c0-0c1s6n0 Call Trace:
2012-11-06T14:57:10.983605-06:00 c0-0c1s6n0 [<ffffffff810063b1>] try_stack_unwind+0x161/0x1a0
2012-11-06T14:57:11.009114-06:00 c0-0c1s6n0 [<ffffffff81004bf9>] dump_trace+0x89/0x440
2012-11-06T14:57:11.009138-06:00 c0-0c1s6n0 [<ffffffffa014e887>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2012-11-06T14:57:11.009161-06:00 c0-0c1s6n0 [<ffffffffa014ede7>] lbug_with_loc+0x47/0xc0 [libcfs]
2012-11-06T14:57:11.009173-06:00 c0-0c1s6n0 [<ffffffffa0671d41>] osc_page_delete+0x2d1/0x2e0 [osc]
2012-11-06T14:57:11.034700-06:00 c0-0c1s6n0 [<ffffffffa02b0095>] cl_page_delete0+0xd5/0x4e0 [obdclass]
2012-11-06T14:57:11.034737-06:00 c0-0c1s6n0 [<ffffffffa02b04e2>] cl_page_delete+0x42/0x120 [obdclass]
2012-11-06T14:57:11.034762-06:00 c0-0c1s6n0 [<ffffffffa07f2e2d>] ll_invalidatepage+0x8d/0x170 [lustre]
2012-11-06T14:57:11.034774-06:00 c0-0c1s6n0 [<ffffffffa07ea290>] ll_page_mkwrite+0x7c0/0x840 [lustre]
2012-11-06T14:57:11.034820-06:00 c0-0c1s6n0 [<ffffffff81107cb7>] __do_fault+0xe7/0x570
2012-11-06T14:57:11.034833-06:00 c0-0c1s6n0 [<ffffffff811081e4>] handle_pte_fault+0xa4/0xcd0
2012-11-06T14:57:11.060561-06:00 c0-0c1s6n0 [<ffffffff81108fbe>] handle_mm_fault+0x1ae/0x240
2012-11-06T14:57:11.060588-06:00 c0-0c1s6n0 [<ffffffff81025471>] do_page_fault+0x191/0x410
2012-11-06T14:57:11.060600-06:00 c0-0c1s6n0 [<ffffffff81301b5f>] page_fault+0x1f/0x30
2012-11-06T14:57:11.060647-06:00 c0-0c1s6n0 [<00000000200422b3>] 0x200422b3
2012-11-06T14:57:11.060660-06:00 c0-0c1s6n0 Kernel panic - not syncing: LBUG



 Comments   
Comment by Vitaly Fertman [ 31/Jan/13 ]

1. The ENODATA handling code in ll_page_mkwrite0 writes the page and invalidates it, but the page could have been re-added to the cache in between these 2 steps.
2. lock could be busy, so we get ENODATA on trylock, whereas a racy ll_fault waits for lock mutex and re-adds the page to the cache;

At the same time, it is not clear why we wanted to just PEEK the lock, not waiting here for a new one. we want a writable page, thus we need a lock. if old one is cancelled, we will have to request a new one anyway. i.e.
1. if to be cancelled and to request a new lock, the new one will be granted only after canceling the old one on ost. thus, page will not get under old lock and will not be written, won't it ?

2. whereas mkwrite will finish must faster with no new lock request, later we still want to make the page writable and anyway we have to request a new lock - so not faster in general, isn't it?

so the PEEK "optimisation" is not clear and troublesome, so the patch I send drops it.

Comment by Vitaly Fertman [ 31/Jan/13 ]

http://review.whamcloud.com/5222

Comment by Andreas Dilger [ 05/Feb/13 ]

Vitaly, can you please include some information about how this problem was initially hit (e.g. test load, frequency of being hit, etc).

Comment by Wally Wang (Inactive) [ 07/Feb/13 ]

We hit this often when running fsx-linux from LTP. Usually it happens during a stress run in an hour or two. We haven't seen this bug after applying this patch together with LU-2722/2723 patches fixing the DIO issues.

Comment by Vitaly Fertman [ 08/Feb/13 ]

after a talk to Jay, we decided not to change cl_lock_peek, because it may return REPEAT only for lock canceling or glimpse ast. at the same time, lock canceling may be long and we do not sleep here, so this looping will consume CPU resources. as after this patch, cl_lock_peek is used for SOM only, it may result the only ioepoch holder does not provide attibute update in done_writing and mds re-asks for them. if there will be a need in minimizing the amount of these RPCs, a sleeping version of cl_lock_peek is to be implemented.

Comment by Keith Mannthey (Inactive) [ 08/Feb/13 ]

Are you going to drop http://review.whamcloud.com/5222 ?

Comment by Peter Jones [ 23/Feb/13 ]

Landed for 2.4

Comment by Cory Spitz [ 06/Mar/13 ]

Can someone please comment about change #5222 landing considering the comments and questions from 08/Feb/13? Cray is seeing significant CPU spinning in 2.4 RC testing, but Wally would have to confirm if change #5222 is the cause.

Comment by Oleg Drokin [ 06/Mar/13 ]

Vitaly comments relates to patchset #2, he since added patchset #3 that fixes the potential issue discussed.
As for the cpu spinning you are seeing, we need more details to make any educated guess there.

Comment by Cory Spitz [ 06/Mar/13 ]

Ah, thanks Oleg. Sure, we'll keep looking at our spin issue.

Comment by Peter Jones [ 06/Mar/13 ]

Corey

It would be best to open a new ticket with details of the CPU spinning issue that you are seeing

Peter

Comment by Cory Spitz [ 08/May/13 ]

Thanks, it is LU-3217.

Generated at Sat Feb 10 01:27:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.