[LU-3192] LBUG:(osc_request.c:1308:osc_brw_prep_request()) ASSERTION( i == 0 || pg->off > pg_prev->off ) Created: 18/Apr/13  Updated: 07/Nov/14  Resolved: 07/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Major
Reporter: Vitaly Fertman Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-247 Lustre client slow performance on BG/... Resolved
Severity: 3
Rank (Obsolete): 7788

 Description   

This happens to Lustre 2.3 running with SP1 but we have seen it in SP2 or 2.2(or even 2.1 with different stack trace, see LELUS-41):

LustreError: 3821:0:(osc_request.c:819:osc_announce_cached()) dirty 361 - 362 > system dirty_max 2113536
LustreError: 3818:0:(osc_request.c:1308:osc_brw_prep_request()) ASSERTION( i == 0 || pg->off > pg_prev->off ) failed: i 1 p_c 151 pg ffffea0004129ce8 [pri 18446612138517887360 ind 30208] off 123731968 prev_pg ffffea000dbc0e18 [pri 0
ind 263316] off 123731968
LustreError: 3818:0:(osc_request.c:1308:osc_brw_prep_request()) LBUG
Pid: 3818, comm: doio_mpi
Call Trace:
[<ffffffff81007e59>] try_stack_unwind+0x1a9/0x200
[<ffffffff81006625>] dump_trace+0x95/0x300
[<ffffffffa016b8d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
[<ffffffffa016be27>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa06969c9>] osc_brw_prep_request+0x959/0xe60 [osc]
[<ffffffffa06983bc>] osc_build_rpc+0x90c/0x1180 [osc]
[<ffffffffa06ad617>] osc_send_oap_rpc+0x3c7/0xc20 [osc]
[<ffffffffa06ae24f>] osc_io_unplug+0x3df/0x730 [osc]
[<ffffffffa06a834a>] osc_io_submit+0x1da/0x520 [osc]
[<ffffffffa02e13e8>] cl_io_submit_rw+0x78/0x190 [obdclass]
[<ffffffffa07334cd>] lov_io_submit+0x27d/0xc00 [lov]
[<ffffffffa02e13e8>] cl_io_submit_rw+0x78/0x190 [obdclass]
[<ffffffffa02e35c1>] cl_io_read_page+0xd1/0x190 [obdclass]
[<ffffffffa07ee254>] ll_readpage+0x184/0x210 [lustre]
[<ffffffff810d3ff0>] generic_file_aio_read+0x230/0x640
[<ffffffffa0819846>] vvp_io_read_start+0x1f6/0x3d0 [lustre]
[<ffffffffa02e16ca>] cl_io_start+0x6a/0x130 [obdclass]
[<ffffffffa02e597c>] cl_io_loop+0xac/0x1a0 [obdclass]
[<ffffffffa07c5c23>] ll_file_io_generic+0x353/0x530 [lustre]
[<ffffffffa07c62c8>] ll_file_aio_read+0x238/0x290 [lustre]
[<ffffffffa07c69bf>] ll_file_read+0x20f/0x2b0 [lustre]
[<ffffffff81117fe8>] vfs_read+0xc8/0x1a0
[<ffffffff811181b5>] sys_read+0x55/0x90
[<ffffffff8100305b>] system_call_fastpath+0x16/0x1b
[<0000000020139f90>] 0x20139f90
Kernel panic - not syncing: LBUG

Here are also some other instances all seem to hit the same page for the RPC:

Lustre 2.3.0-trunk-1.0000.40706.58.4-abuild-trunk 27

First hit Cname # hits Apid/Roles
LUS: LBUG-ASSERTION( i == 0 || pg->off > pg_prev->off ) failed: i 1 p_c 256 pg ffffea000c65a268 [pri 18446612137055183040 ind 26880] off 110100480 prev_pg ffffea000e0e73f0 [pri 0 ind 263338] off 110100480
13/02/06 01:43:15 c0-0c0s1n2 1 854758

854758 doio_mpi

LUS: LBUG-ASSERTION( i == 0 || pg->off > pg_prev->off ) failed: i 1 p_c 256 pg ffffea00179fdfa0 [pri 18446612160482967680 ind 46909] off 192139264 prev_pg ffffea000bd34598 [pri 0 ind 263425] off 192139264
13/02/06 02:22:40 c0-0c0s4n3 1 855115

855115 doio_mpi

LUS: LBUG-ASSERTION( i == 0 || pg->off > pg_prev->off ) failed: i 32 p_c 155 pg ffffea001bcb95e0 [pri 0 ind 264145] off 14807040 prev_pg ffffea001bff2bd0 [pri 18446612166806628160 ind 3615] off 14807040
13/02/06 00:09:44 c0-0c0s7n2 1 854062

854062 doio_mpi



 Comments   
Comment by Keith Mannthey (Inactive) [ 18/Apr/13 ]

What is 2.3 SP1 and SP2? Do you see this with the 2.3 branch in the git tree? Or with Master?

Comment by Peter Jones [ 18/Apr/13 ]

Keith I think that Vitaly means SLES11 SP1 and SP2

Comment by Vitaly Fertman [ 18/Apr/13 ]

this is not correct to kick transient pages off the CLIO cache, because we may get 2 sets of pages in osc_build_req(), cacheable and transient ones, for the same offsets.

http://review.whamcloud.com/6096 - revert of LU-481
http://review.whamcloud.com/6097 - fix of original LU-481 issue

Comment by Vitaly Fertman [ 18/Apr/13 ]

lustre 2.3 on sles11 sp1/sp2

Comment by Oleg Drokin [ 22/Apr/13 ]

So can you please describe the scenario to hit this and the environment?

Comment by Vitaly Fertman [ 22/Apr/13 ]

due to putting out transient pages of the page cache, we may get 2 sets of pages, cacheable and transient, for the same offsets, originated in BIO and DIO. later, when preparing rpc, such pages are gathered for the same rpc, and assertion that 2 pages with the same offset are found is hit.

Comment by Niu Yawei (Inactive) [ 23/Apr/13 ]

Actually, I just started refresh the fix of LU-247 (http://review.whamcloud.com/#change,980) yesterday, which could fix this problem (concurrent dio read & buffered read), but that isn't a small fix, I'm not sure if it can be merged in 2.4.

Comment by Andreas Dilger [ 24/Apr/13 ]

Vitaly, can you please tell us what the workload is that is triggering this problem? Is this a synthetic workload that is generating "difficult" IO patterns with O_DIRECT and buffered IO on the same file offset, or is this a failure seen by a real application running from a user?

I would be surprised if any real application will have an IO workload to trigger this, but then again application developers are known to do all kinds of strange things...

If this is only from a synthetic workload, then I'm inclined to drop it as a blocker for the 2.4.0 release, since this is extremely unlikely to be hit under real-world usage.

Comment by Cheng Shao (Inactive) [ 24/Apr/13 ]

Andreas, this is not some real application workload. It is from an internal stress test.

Comment by Jinshan Xiong (Inactive) [ 25/Apr/13 ]

Let's mark the transient page as special in IO engine so that it can't be merged with caching pages.

Comment by Alexey Lyashkov [ 26/Apr/13 ]

Jay,

transient pages already marked as special and live in CLIO cache until we have IO finished, and merging with buffered pages exist in CLIO from beginning, so we may say - that is broken for now. Per Nikita`s comments - he tried to present a flat file as offset range with clio pages if it's covered by lock. That allowed to re-stripe it easy just a change mapping a some offset range to new osc object.

from other view - OST uses a single page cache to store DIO and buffered IO so we have no differences (buffered IO may read DIO data without access to disk). so we have free to use same view on client side, have a single CLIO page cache to store a data before send over network. That way is better in case we will return to idea to create a p2p communication and client may send own data to new client without asking any data from a server. That is very usefull in excacale as extend buffer cache dramatically.

Because of it, i think we need to return to put transient pages into CLIO cache, but remember that is special pages and need disowned after IO finished and don't forget we need to stay in waiting until that data committed on OST side and don't removed a request from sending queue, otherwise we will hit a LU-1573. But until request in sending queue, we a free to read/write in same pages, to reduce a network usage.

Comment by Jinshan Xiong (Inactive) [ 26/Apr/13 ]

transient pages already marked as special and live in CLIO cache until we have IO finished, and merging with buffered pages exist in CLIO from beginning, so we may say - that is broken for now.

It was. But we found a problem in LU-481 where a transient page can be in cache when trying to add a caching page. We could fix it by awaiting transient pages to go away but I didn't see any benefits there. Any idea?

Per Nikita`s comments - he tried to present a flat file as offset range with clio pages if it's covered by lock. That allowed to re-stripe it easy just a change mapping a some offset range to new osc object.

This turns out too optimistic. Restriping is a case of layout change and right now we have a complete solution for layout change. The safest way to change a file's layout is to clean up file's page cache in prior. Don't bother talking about performance as changing layout is a rare operation

from other view - OST uses a single page cache to store DIO and buffered IO so we have no differences (buffered IO may read DIO data without access to disk). so we have free to use same view on client side, have a single CLIO page cache to store a data before send over network.

hmm.. I tend to think DIO shouldn't pollute cache in anyways - this is why it is called Direct IO. Even on the OST side, it should go directly into disks before replying the clients. Also it;s really dangerous to cache write of DIO on the OSTs.

Comment by Niu Yawei (Inactive) [ 27/Apr/13 ]

Alexey, we've seen quite a few problems caused by putting 'transient' pages in the radix tree:

  • Race of buffered io vs. dio described in this ticket & LU-481;
  • Unable to support unalinged DIO. Lustre now can only do aligned DIO (buffer address, offset & write count must be page size aligned), the reason we have such unnecessary
    restriction is that we put user pages (transient pages) in the radix tree. I think unaligned DIO is very important for DIO users and it also very helpful for lockless io;

All in all, DIO pages are managed by applications, I don't see any reason to store them in the radix tree. (add them in the tree before io then remove on io completion looks only increase overhead).

I've refreshed the patch of LU-247 (http://review.whamcloud.com/#change,980), but I'm not sure if it can catch the 2.4 window. I think Jingshan's idea of not merging 'transient' page with cached page into same RPC could be a temporariy solution if the patch can't fit in with the 2.4 schedule.

Comment by Alexey Lyashkov [ 28/Apr/13 ]

> It was. But we found a problem in LU-481 where a transient page can be in cache when trying to add a caching page. We could fix it by awaiting transient pages to go away but I didn't see any benefits there. Any idea?

Vitaly have fix for LU-481, it's simple. But with that fix we have avoid doubling IO stream in case mixing IO and (as i say before) we may read from a that pages without sending network requests.

>This turns out too optimistic. Restriping is a case of layout change and right now we have a complete solution for layout change. The safest way to change a file's layout is to clean up file's page cache in prior. Don't bother talking about performance as changing layout is a rare operation

why? if you change a layout - you just change a translation about offset in file, to data object and offset to object. main question in that case - have a stable point when old layout and new layout have a same contents.
but that is outside of that ticket.

> hmm.. I tend to think DIO shouldn't pollute cache in anyways - this is why it is called Direct IO. Even on the OST side, it should go directly into disks before replying the clients. Also it;s really dangerous to cache write of DIO on the OSTs.

by POSIX/SUS definition, DIO need bypass only pagecache, but don't see anything about raid cache, disk drive cache or other cache types...

http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap03.html
Direct I/O

Historically, direct I/O refers to the system bypassing intermediate buffering, but may be extended to cover implementation-defined optimizations.
Comment by Alexey Lyashkov [ 28/Apr/13 ]

> Race of buffered io vs. dio described in this ticket & LU-481;

That ticket isn't race. that ticket is result of "fix" in LU-481. When you kill finding a buffered page for DIO access you start able to put two CLIO pages in same offset and OSC send two pages in single bulk with same offset.

> Unable to support unalinged DIO. Lustre now can only do aligned DIO (buffer address, offset & write count must be page size aligned), the reason we have such unnecessary

That is big question. You need to read full page anyway on OST side and have full page write.
but unaligned write may be not at DIO case, so you will fix a particle case and not a generic.
Why it's don't fix in different way? like send a first page of buffer as part of LOCK ENQ request?
that is need to change a wire protocol, but will solve DIO and buffered unaligned write access.
(CLIO release page should be revised to ability to kill pages only with locks cover it, otherwise we have much more problems with readahead, single page reads inside a large up2date region and other)

as about lockless IO.. lock less IO more easy implement as direct access to osc_brw api or other ways but have similar to obdecho client.

> All in all, DIO pages are managed by applications, I don't see any reason to store them in the radix tree. (add them in the tree before io then remove on io completion looks only increase overhead).

if you don't plan to have it's in radix tree - why it's should be covered by CLIO page? kill these pages from CLIO that is reduce a complexity of CLIO (we will have a single page type).

Comment by Alexander Boyko [ 01/Jul/14 ]

http://review.whamcloud.com/#/c/10930/

Comment by Cory Spitz [ 07/Nov/14 ]

http://review.whamcloud.com/#/c/10930 landed, so should this bug be closed?

Comment by Peter Jones [ 07/Nov/14 ]

Yes it should - thanks!

Generated at Sat Feb 10 01:31:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.