Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3192

LBUG:(osc_request.c:1308:osc_brw_prep_request()) ASSERTION( i == 0 || pg->off > pg_prev->off )

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • None
    • 3
    • 7788

    Description

      This happens to Lustre 2.3 running with SP1 but we have seen it in SP2 or 2.2(or even 2.1 with different stack trace, see LELUS-41):

      LustreError: 3821:0:(osc_request.c:819:osc_announce_cached()) dirty 361 - 362 > system dirty_max 2113536
      LustreError: 3818:0:(osc_request.c:1308:osc_brw_prep_request()) ASSERTION( i == 0 || pg->off > pg_prev->off ) failed: i 1 p_c 151 pg ffffea0004129ce8 [pri 18446612138517887360 ind 30208] off 123731968 prev_pg ffffea000dbc0e18 [pri 0
      ind 263316] off 123731968
      LustreError: 3818:0:(osc_request.c:1308:osc_brw_prep_request()) LBUG
      Pid: 3818, comm: doio_mpi
      Call Trace:
      [<ffffffff81007e59>] try_stack_unwind+0x1a9/0x200
      [<ffffffff81006625>] dump_trace+0x95/0x300
      [<ffffffffa016b8d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
      [<ffffffffa016be27>] lbug_with_loc+0x47/0xb0 [libcfs]
      [<ffffffffa06969c9>] osc_brw_prep_request+0x959/0xe60 [osc]
      [<ffffffffa06983bc>] osc_build_rpc+0x90c/0x1180 [osc]
      [<ffffffffa06ad617>] osc_send_oap_rpc+0x3c7/0xc20 [osc]
      [<ffffffffa06ae24f>] osc_io_unplug+0x3df/0x730 [osc]
      [<ffffffffa06a834a>] osc_io_submit+0x1da/0x520 [osc]
      [<ffffffffa02e13e8>] cl_io_submit_rw+0x78/0x190 [obdclass]
      [<ffffffffa07334cd>] lov_io_submit+0x27d/0xc00 [lov]
      [<ffffffffa02e13e8>] cl_io_submit_rw+0x78/0x190 [obdclass]
      [<ffffffffa02e35c1>] cl_io_read_page+0xd1/0x190 [obdclass]
      [<ffffffffa07ee254>] ll_readpage+0x184/0x210 [lustre]
      [<ffffffff810d3ff0>] generic_file_aio_read+0x230/0x640
      [<ffffffffa0819846>] vvp_io_read_start+0x1f6/0x3d0 [lustre]
      [<ffffffffa02e16ca>] cl_io_start+0x6a/0x130 [obdclass]
      [<ffffffffa02e597c>] cl_io_loop+0xac/0x1a0 [obdclass]
      [<ffffffffa07c5c23>] ll_file_io_generic+0x353/0x530 [lustre]
      [<ffffffffa07c62c8>] ll_file_aio_read+0x238/0x290 [lustre]
      [<ffffffffa07c69bf>] ll_file_read+0x20f/0x2b0 [lustre]
      [<ffffffff81117fe8>] vfs_read+0xc8/0x1a0
      [<ffffffff811181b5>] sys_read+0x55/0x90
      [<ffffffff8100305b>] system_call_fastpath+0x16/0x1b
      [<0000000020139f90>] 0x20139f90
      Kernel panic - not syncing: LBUG

      Here are also some other instances all seem to hit the same page for the RPC:

      Lustre 2.3.0-trunk-1.0000.40706.58.4-abuild-trunk 27

      First hit Cname # hits Apid/Roles
      LUS: LBUG-ASSERTION( i == 0 || pg->off > pg_prev->off ) failed: i 1 p_c 256 pg ffffea000c65a268 [pri 18446612137055183040 ind 26880] off 110100480 prev_pg ffffea000e0e73f0 [pri 0 ind 263338] off 110100480
      13/02/06 01:43:15 c0-0c0s1n2 1 854758

      854758 doio_mpi

      LUS: LBUG-ASSERTION( i == 0 || pg->off > pg_prev->off ) failed: i 1 p_c 256 pg ffffea00179fdfa0 [pri 18446612160482967680 ind 46909] off 192139264 prev_pg ffffea000bd34598 [pri 0 ind 263425] off 192139264
      13/02/06 02:22:40 c0-0c0s4n3 1 855115

      855115 doio_mpi

      LUS: LBUG-ASSERTION( i == 0 || pg->off > pg_prev->off ) failed: i 32 p_c 155 pg ffffea001bcb95e0 [pri 0 ind 264145] off 14807040 prev_pg ffffea001bff2bd0 [pri 18446612166806628160 ind 3615] off 14807040
      13/02/06 00:09:44 c0-0c0s7n2 1 854062

      854062 doio_mpi

      Attachments

        Issue Links

          Activity

            [LU-3192] LBUG:(osc_request.c:1308:osc_brw_prep_request()) ASSERTION( i == 0 || pg->off > pg_prev->off )
            pjones Peter Jones added a comment -

            Yes it should - thanks!

            pjones Peter Jones added a comment - Yes it should - thanks!
            spitzcor Cory Spitz added a comment -

            http://review.whamcloud.com/#/c/10930 landed, so should this bug be closed?

            spitzcor Cory Spitz added a comment - http://review.whamcloud.com/#/c/10930 landed, so should this bug be closed?
            aboyko Alexander Boyko added a comment - http://review.whamcloud.com/#/c/10930/

            > Race of buffered io vs. dio described in this ticket & LU-481;

            That ticket isn't race. that ticket is result of "fix" in LU-481. When you kill finding a buffered page for DIO access you start able to put two CLIO pages in same offset and OSC send two pages in single bulk with same offset.

            > Unable to support unalinged DIO. Lustre now can only do aligned DIO (buffer address, offset & write count must be page size aligned), the reason we have such unnecessary

            That is big question. You need to read full page anyway on OST side and have full page write.
            but unaligned write may be not at DIO case, so you will fix a particle case and not a generic.
            Why it's don't fix in different way? like send a first page of buffer as part of LOCK ENQ request?
            that is need to change a wire protocol, but will solve DIO and buffered unaligned write access.
            (CLIO release page should be revised to ability to kill pages only with locks cover it, otherwise we have much more problems with readahead, single page reads inside a large up2date region and other)

            as about lockless IO.. lock less IO more easy implement as direct access to osc_brw api or other ways but have similar to obdecho client.

            > All in all, DIO pages are managed by applications, I don't see any reason to store them in the radix tree. (add them in the tree before io then remove on io completion looks only increase overhead).

            if you don't plan to have it's in radix tree - why it's should be covered by CLIO page? kill these pages from CLIO that is reduce a complexity of CLIO (we will have a single page type).

            shadow Alexey Lyashkov added a comment - > Race of buffered io vs. dio described in this ticket & LU-481 ; That ticket isn't race. that ticket is result of "fix" in LU-481 . When you kill finding a buffered page for DIO access you start able to put two CLIO pages in same offset and OSC send two pages in single bulk with same offset. > Unable to support unalinged DIO. Lustre now can only do aligned DIO (buffer address, offset & write count must be page size aligned), the reason we have such unnecessary That is big question. You need to read full page anyway on OST side and have full page write. but unaligned write may be not at DIO case, so you will fix a particle case and not a generic. Why it's don't fix in different way? like send a first page of buffer as part of LOCK ENQ request? that is need to change a wire protocol, but will solve DIO and buffered unaligned write access. (CLIO release page should be revised to ability to kill pages only with locks cover it, otherwise we have much more problems with readahead, single page reads inside a large up2date region and other) as about lockless IO.. lock less IO more easy implement as direct access to osc_brw api or other ways but have similar to obdecho client. > All in all, DIO pages are managed by applications, I don't see any reason to store them in the radix tree. (add them in the tree before io then remove on io completion looks only increase overhead). if you don't plan to have it's in radix tree - why it's should be covered by CLIO page? kill these pages from CLIO that is reduce a complexity of CLIO (we will have a single page type).

            > It was. But we found a problem in LU-481 where a transient page can be in cache when trying to add a caching page. We could fix it by awaiting transient pages to go away but I didn't see any benefits there. Any idea?

            Vitaly have fix for LU-481, it's simple. But with that fix we have avoid doubling IO stream in case mixing IO and (as i say before) we may read from a that pages without sending network requests.

            >This turns out too optimistic. Restriping is a case of layout change and right now we have a complete solution for layout change. The safest way to change a file's layout is to clean up file's page cache in prior. Don't bother talking about performance as changing layout is a rare operation

            why? if you change a layout - you just change a translation about offset in file, to data object and offset to object. main question in that case - have a stable point when old layout and new layout have a same contents.
            but that is outside of that ticket.

            > hmm.. I tend to think DIO shouldn't pollute cache in anyways - this is why it is called Direct IO. Even on the OST side, it should go directly into disks before replying the clients. Also it;s really dangerous to cache write of DIO on the OSTs.

            by POSIX/SUS definition, DIO need bypass only pagecache, but don't see anything about raid cache, disk drive cache or other cache types...

            http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap03.html
            Direct I/O
            
            Historically, direct I/O refers to the system bypassing intermediate buffering, but may be extended to cover implementation-defined optimizations.
            
            shadow Alexey Lyashkov added a comment - > It was. But we found a problem in LU-481 where a transient page can be in cache when trying to add a caching page. We could fix it by awaiting transient pages to go away but I didn't see any benefits there. Any idea? Vitaly have fix for LU-481 , it's simple. But with that fix we have avoid doubling IO stream in case mixing IO and (as i say before) we may read from a that pages without sending network requests. >This turns out too optimistic. Restriping is a case of layout change and right now we have a complete solution for layout change. The safest way to change a file's layout is to clean up file's page cache in prior. Don't bother talking about performance as changing layout is a rare operation why? if you change a layout - you just change a translation about offset in file, to data object and offset to object. main question in that case - have a stable point when old layout and new layout have a same contents. but that is outside of that ticket. > hmm.. I tend to think DIO shouldn't pollute cache in anyways - this is why it is called Direct IO. Even on the OST side, it should go directly into disks before replying the clients. Also it;s really dangerous to cache write of DIO on the OSTs. by POSIX/SUS definition, DIO need bypass only pagecache, but don't see anything about raid cache, disk drive cache or other cache types... http: //pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap03.html Direct I/O Historically, direct I/O refers to the system bypassing intermediate buffering, but may be extended to cover implementation-defined optimizations.

            Alexey, we've seen quite a few problems caused by putting 'transient' pages in the radix tree:

            • Race of buffered io vs. dio described in this ticket & LU-481;
            • Unable to support unalinged DIO. Lustre now can only do aligned DIO (buffer address, offset & write count must be page size aligned), the reason we have such unnecessary
              restriction is that we put user pages (transient pages) in the radix tree. I think unaligned DIO is very important for DIO users and it also very helpful for lockless io;

            All in all, DIO pages are managed by applications, I don't see any reason to store them in the radix tree. (add them in the tree before io then remove on io completion looks only increase overhead).

            I've refreshed the patch of LU-247 (http://review.whamcloud.com/#change,980), but I'm not sure if it can catch the 2.4 window. I think Jingshan's idea of not merging 'transient' page with cached page into same RPC could be a temporariy solution if the patch can't fit in with the 2.4 schedule.

            niu Niu Yawei (Inactive) added a comment - Alexey, we've seen quite a few problems caused by putting 'transient' pages in the radix tree: Race of buffered io vs. dio described in this ticket & LU-481 ; Unable to support unalinged DIO. Lustre now can only do aligned DIO (buffer address, offset & write count must be page size aligned), the reason we have such unnecessary restriction is that we put user pages (transient pages) in the radix tree. I think unaligned DIO is very important for DIO users and it also very helpful for lockless io; All in all, DIO pages are managed by applications, I don't see any reason to store them in the radix tree. (add them in the tree before io then remove on io completion looks only increase overhead). I've refreshed the patch of LU-247 ( http://review.whamcloud.com/#change,980 ), but I'm not sure if it can catch the 2.4 window. I think Jingshan's idea of not merging 'transient' page with cached page into same RPC could be a temporariy solution if the patch can't fit in with the 2.4 schedule.

            transient pages already marked as special and live in CLIO cache until we have IO finished, and merging with buffered pages exist in CLIO from beginning, so we may say - that is broken for now.

            It was. But we found a problem in LU-481 where a transient page can be in cache when trying to add a caching page. We could fix it by awaiting transient pages to go away but I didn't see any benefits there. Any idea?

            Per Nikita`s comments - he tried to present a flat file as offset range with clio pages if it's covered by lock. That allowed to re-stripe it easy just a change mapping a some offset range to new osc object.

            This turns out too optimistic. Restriping is a case of layout change and right now we have a complete solution for layout change. The safest way to change a file's layout is to clean up file's page cache in prior. Don't bother talking about performance as changing layout is a rare operation

            from other view - OST uses a single page cache to store DIO and buffered IO so we have no differences (buffered IO may read DIO data without access to disk). so we have free to use same view on client side, have a single CLIO page cache to store a data before send over network.

            hmm.. I tend to think DIO shouldn't pollute cache in anyways - this is why it is called Direct IO. Even on the OST side, it should go directly into disks before replying the clients. Also it;s really dangerous to cache write of DIO on the OSTs.

            jay Jinshan Xiong (Inactive) added a comment - transient pages already marked as special and live in CLIO cache until we have IO finished, and merging with buffered pages exist in CLIO from beginning, so we may say - that is broken for now. It was. But we found a problem in LU-481 where a transient page can be in cache when trying to add a caching page. We could fix it by awaiting transient pages to go away but I didn't see any benefits there. Any idea? Per Nikita`s comments - he tried to present a flat file as offset range with clio pages if it's covered by lock. That allowed to re-stripe it easy just a change mapping a some offset range to new osc object. This turns out too optimistic. Restriping is a case of layout change and right now we have a complete solution for layout change. The safest way to change a file's layout is to clean up file's page cache in prior. Don't bother talking about performance as changing layout is a rare operation from other view - OST uses a single page cache to store DIO and buffered IO so we have no differences (buffered IO may read DIO data without access to disk). so we have free to use same view on client side, have a single CLIO page cache to store a data before send over network. hmm.. I tend to think DIO shouldn't pollute cache in anyways - this is why it is called Direct IO. Even on the OST side, it should go directly into disks before replying the clients. Also it;s really dangerous to cache write of DIO on the OSTs.

            Jay,

            transient pages already marked as special and live in CLIO cache until we have IO finished, and merging with buffered pages exist in CLIO from beginning, so we may say - that is broken for now. Per Nikita`s comments - he tried to present a flat file as offset range with clio pages if it's covered by lock. That allowed to re-stripe it easy just a change mapping a some offset range to new osc object.

            from other view - OST uses a single page cache to store DIO and buffered IO so we have no differences (buffered IO may read DIO data without access to disk). so we have free to use same view on client side, have a single CLIO page cache to store a data before send over network. That way is better in case we will return to idea to create a p2p communication and client may send own data to new client without asking any data from a server. That is very usefull in excacale as extend buffer cache dramatically.

            Because of it, i think we need to return to put transient pages into CLIO cache, but remember that is special pages and need disowned after IO finished and don't forget we need to stay in waiting until that data committed on OST side and don't removed a request from sending queue, otherwise we will hit a LU-1573. But until request in sending queue, we a free to read/write in same pages, to reduce a network usage.

            shadow Alexey Lyashkov added a comment - Jay, transient pages already marked as special and live in CLIO cache until we have IO finished, and merging with buffered pages exist in CLIO from beginning, so we may say - that is broken for now. Per Nikita`s comments - he tried to present a flat file as offset range with clio pages if it's covered by lock. That allowed to re-stripe it easy just a change mapping a some offset range to new osc object. from other view - OST uses a single page cache to store DIO and buffered IO so we have no differences (buffered IO may read DIO data without access to disk). so we have free to use same view on client side, have a single CLIO page cache to store a data before send over network. That way is better in case we will return to idea to create a p2p communication and client may send own data to new client without asking any data from a server. That is very usefull in excacale as extend buffer cache dramatically. Because of it, i think we need to return to put transient pages into CLIO cache, but remember that is special pages and need disowned after IO finished and don't forget we need to stay in waiting until that data committed on OST side and don't removed a request from sending queue, otherwise we will hit a LU-1573 . But until request in sending queue, we a free to read/write in same pages, to reduce a network usage.

            Let's mark the transient page as special in IO engine so that it can't be merged with caching pages.

            jay Jinshan Xiong (Inactive) added a comment - Let's mark the transient page as special in IO engine so that it can't be merged with caching pages.

            Andreas, this is not some real application workload. It is from an internal stress test.

            cheng_shao Cheng Shao (Inactive) added a comment - Andreas, this is not some real application workload. It is from an internal stress test.

            People

              wc-triage WC Triage
              vitaly_fertman Vitaly Fertman
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: