[LU-17364] osc_page_delete LBUG - trying to delete a page under write Created: 14/Dec/23 Updated: 26/Jan/24 Resolved: 23/Jan/24 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Zhenyu Xu | Assignee: | Alexey Lyashkov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Client crashed on osc_page_delete, and the page is waiting for write [2281253.531369] LustreError: 81367:0:(osc_cache.c:2558:osc_teardown_async_page()) extent ffff883c74c7b810@
{[28680 -> 28680/32767], [2|0|-|cache|wi|ffff883ba0bbde00], [28672|1|+|-|ffff884a03601b00|4096| (null)]}
trunc at 28680.
[2281253.553678] LustreError: 81367:0:(osc_cache.c:2558:osc_teardown_async_page()) ### extent: ffff883c74c7b810 ns: euscrat-OST0004-osc-ffff887a7da55000 lock: ffff884a03601b00/0xb3e57269e5f70d90 lrc: 12/0,1 mode: PW/PW res: [0x480000402:0x1beede4b:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x800020000020000 nid: local remote: 0x93336ee709e18f0b expref: -99 pid: 81367 timeout: 0 lvb_type: 1 l_ast_data: 0000000000000000
[2281253.597632] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) page@ffff8841ea116400[3 ffff883e49200d70 4 1 (null)]
[2281253.613495] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) vvp-page@ffff8841ea116458(1:0) vm@fffff967dd5ace00 2fffff00000835 2:0 ffff8841ea116400 28680 lru
[2281253.613499] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) lov-page@ffff8841ea116498, gen: 0
[2281253.613511] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) osc-page@ffff8841ea1164d0 28680: 1< 0x845fed 2 0 + - > 2< 117473280 0 4096 0x0 0x420 | (null) ffff88622eda49b0 ffff883ba0bbde00 > 3< 0 0 0 > 4< 0 0 16 242233344 - | - - + - > 5< - - + - | 0 - | 3648 - ->
[2281253.613512] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) end page@ffff8841ea116400
[2281253.613514] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) Trying to teardown failed: -16
[2281253.613515] LustreError: 81367:0:(osc_page.c:192:osc_page_delete()) ASSERTION( 0 ) failed:
[2281253.613516] LustreError: 81367:0:(osc_page.c:192:osc_page_delete()) LBUG
[2281253.613518] Pid: 81367, comm: julia 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023
[2281253.613518] Call Trace:
[2281253.613549] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[2281253.613560] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[2281253.613568] [<0>] osc_page_delete+0x47e/0x4b0 [osc]
[2281253.613592] [<0>] cl_page_delete0+0x80/0x220 [obdclass]
[2281253.613602] [<0>] cl_page_delete+0x33/0x110 [obdclass]
[2281253.613618] [<0>] ll_invalidatepage+0x87/0x180 [lustre]
[2281253.613634] [<0>] do_invalidatepage_range+0x7d/0x90
[2281253.613642] [<0>] truncate_inode_page+0x7f/0x90
[2281253.613643] [<0>] generic_error_remove_page+0x2a/0x40
[2281253.613652] [<0>] vvp_page_discard+0x5e/0xd0 [lustre]
[2281253.613663] [<0>] cl_page_discard+0x4b/0x70 [obdclass]
[2281253.613675] [<0>] cl_page_list_discard+0x56/0x160 [obdclass]
[2281253.613682] [<0>] ll_io_read_page+0x3f5/0x890 [lustre]
[2281253.613688] [<0>] ll_readpage+0xe6/0x820 [lustre]
[2281253.613693] [<0>] filemap_fault+0x1f8/0x420
[2281253.613699] [<0>] ll_filemap_fault+0x39/0x70 [lustre]
[2281253.613706] [<0>] vvp_io_fault_start+0x5fa/0xe50 [lustre]
[2281253.613718] [<0>] cl_io_start+0x70/0x140 [obdclass]
[2281253.613729] [<0>] cl_io_loop+0x9f/0x200 [obdclass]
[2281253.613735] [<0>] ll_fault+0x52d/0x8a0 [lustre]
[2281253.613746] [<0>] __do_fault.isra.61+0x8a/0x100
[2281253.613754] [<0>] do_shared_fault.isra.64+0x4c/0x280
[2281253.613758] [<0>] handle_mm_fault+0x459/0x1190
[2281253.613765] [<0>] __do_page_fault+0x213/0x510
[2281253.613766] [<0>] do_page_fault+0x35/0x90
|
| Comments |
| Comment by Ian Costello [ 18/Dec/23 ] |
|
This looks identical to |
| Comment by Gerrit Updater [ 22/Dec/23 ] |
|
"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53533 |
| Comment by Alexey Lyashkov [ 22/Dec/23 ] |
|
In fact, this is not a bug in read. And page might don't have dirty flag once it processed by osc_make_ready and vvp_make_ready have clear this flag, but oap/extent still in rpc state which caused a bug. 00000080:00800000:1.0:1702501537.785719:0:24553:0:(llite_mmap.c:235:ll_page_mkwrite0()) fsx mkwrite with 0 00000080:00800008:1.0:1702501537.785731:0:24553:0:(llite_mmap.c:406:ll_fault()) [0x200000401:0x10:0x0]: vma=000000005cc36916 start=0x7f24a1f83000 end=0x7f24a1f8b000 vm_flags=0x80000fb idx=12833 00000080:00000001:1.0:1702501537.785734:0:24553:0:(llite_mmap.c:304:ll_fault0()) Process entered 00000080:00200000:1.0:1702501537.785743:0:24553:0:(rw.c:1885:ll_readpage()) fast read pgno: 12833 00000080:00000001:1.0:1702501537.785757:0:24553:0:(llite_mmap.c:337:ll_fault0()) Process leaving via out (rc=0 : 0 : 0x0) 00000080:00800000:1.0:1702501537.785760:0:24553:0:(llite_mmap.c:380:ll_fault0()) fsx fault 512/0 00000080:00000001:1.0:1702501537.785761:0:24553:0:(llite_mmap.c:381:ll_fault0()) Process leaving (rc=512 : 512 : 200) Page installed into page cache, mmap write started 00000080:00800008:1.0:1702501537.785764:0:24553:0:(llite_mmap.c:486:ll_page_mkwrite()) [0x200000401:0x10:0x0]: vma=000000005cc36916 start=0x7f24a1f83000 end=0x7f24a1f8b000 vm_flags=0x80000fb idx=12833 00000080:00000001:1.0:1702501537.785769:0:24553:0:(llite_mmap.c:170:ll_page_mkwrite0()) Process entered 00000080:00000001:1.0:1702501537.785774:0:24553:0:(llite_mmap.c:107:ll_fault_io_init()) Process entered 00000080:00800000:1.0:1702501537.785775:0:24553:0:(llite_mmap.c:129:ll_fault_io_init()) [0x200000401:0x10:0x0]: vma=000000005cc36916 start=0x7f24a1f83000 end=0x7f24a1f8b000 vm_flags=0x80000fb idx=12833 00000080:00000001:1.0:1702501537.785805:0:24553:0:(llite_mmap.c:155:ll_fault_io_init()) Process leaving (rc=18446612688480406592 : -131385229145024 : ffff88817e168c40) 00000080:00000001:3.0:1702501537.785972:0:24541:0:(rw26.c:170:do_release_page()) Process entered 00000008:00000040:3.0:1702501537.785985:0:24541:0:(osc_cache.c:2492:osc_teardown_async_page()) teardown oap 000000004bf2105f page 000000008d8d09d3 at index 12833. 00000080:00000001:3.0:1702501537.786004:0:24541:0:(rw26.c:213:do_release_page()) Process leaving (rc=1 : 1 : 1) page release by drop cache, but vmpage had an extra references and don't freed. 00000080:00000001:1.0:1702501537.786026:0:24553:0:(vvp_io.c:1483:vvp_io_fault_start()) Process entered 00000020:00008000:1.0:1702501537.786048:0:24553:0:(cl_page.c:359:cl_page_find()) 12833@[0x200000401:0x10:0x0] 000000000a4501fb 0 1 00000008:00000040:1.0:1702501537.786074:0:24553:0:(osc_cache.c:2289:osc_prep_async_page()) oap 00000000ca649d88 vmpage 000000000a4501fb obj off 52563968 Ops. vmpage old, but no cl page attached - so allocate new, vmpage locked, mkwrite starts io 00000008:00000020:1.0:1702501537.786117:0:24553:0:(osc_cache.c:2370:osc_queue_async_io()) obj 00000000d9a2a1d1 ready 0|-|- wr 9|-|- rd 0|- oap 00000000ca649d88 page 000000000a4501fb added for cmd 2 00000008:00000020:1.0:1702501537.786128:0:24553:0:(osc_cache.c:1388:osc_consume_write_grant()) using 4096 grant credits for brw 00000000513e54ab page 000000000a4501fb dirty flag cleared, vmpage lock released. 00000080:00000001:1.0:1702501537.786248:0:24553:0:(vvp_io.c:1637:vvp_io_fault_start()) Process leaving 00000080:00000001:1.0:1702501537.786489:0:24553:0:(llite_mmap.c:229:ll_page_mkwrite0()) Process leaving 00000080:00800000:1.0:1702501537.786513:0:24553:0:(llite_mmap.c:235:ll_page_mkwrite0()) fsx mkwrite with 0 00000080:00800008:1.0:1702501537.786527:0:24553:0:(llite_mmap.c:406:ll_fault()) [0x200000401:0x10:0x0]: vma=000000005cc36916 start=0x7f24a1f83000 end=0x7f24a1f8b000 vm_flags=0x80000fb idx=12832 00000080:00000001:1.0:1702501537.786530:0:24553:0:(llite_mmap.c:304:ll_fault0()) Process entered 00000080:00000001:1.0:1702501537.786549:0:24553:0:(llite_mmap.c:107:ll_fault_io_init()) Process entered 00000080:00800000:1.0:1702501537.786550:0:24553:0:(llite_mmap.c:129:ll_fault_io_init()) [0x200000401:0x10:0x0]: vma=000000005cc36916 start=0x7f24a1f83000 end=0x7f24a1f8b000 vm_flags=0x80000fb idx=12832 00000080:00000001:1.0:1702501537.786581:0:24553:0:(llite_mmap.c:155:ll_fault_io_init()) Process leaving (rc=18446612688480406592 : -131385229145024 : ffff88817e168c40) once page uptodate flag cleared by cl_page_delete early, mmap read start to read this page from network... OOOPS 00000080:00000001:1.0:1702501537.786787:0:24553:0:(vvp_io.c:1483:vvp_io_fault_start()) Process entered 00000020:00008000:1.0:1702501537.787727:0:24553:0:(cl_page.c:359:cl_page_find()) 12833@[0x200000401:0x10:0x0] 000000000a4501fb ffff88817fcefa00 1 00000008:00000020:1.0:1702501537.788082:0:24553:0:(osc_io.c:178:osc_io_submit()) Busy oap 00000000ca649d88 page 000000007e1a659d for submit. Busy OAP -> return -EAGAIN to submit IO and page under delete in the ll_io_read_page. That's ALL. Panic. |
| Comment by Gerrit Updater [ 25/Dec/23 ] |
|
"Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53550 |
| Comment by Dominika Wanat [ 09/Jan/24 ] |
|
Hi, We are stuck with the same problem with Lustre based on the b2_15 branch. Are you planning to backport the abovementioned patches in the near future? Best, Dominika Wanat |
| Comment by Patrick Farrell [ 12/Jan/24 ] |
|
Dominika, Do you have any details on your setup? Was there any particular change you made that seemed linked to the problem occurring, for example, a kernel update? I ask because the underlying bug has been present for a while and we're trying to track if there's a reason it's now happening more often. |
| Comment by Dominika Wanat [ 12/Jan/24 ] |
|
@Patrick, next week, we will prepare a complete history of updates, patches, and bug occurrences from the client and server sides. This bug is also nothing new for us - we have been experiencing it on the client nodes since the beginning of April 2022 (and kernel 4.18.0-348.20.1.el8_5.x86_64). Now, we are using clients with kernel 4.18.0-477.27.1.el8_8.x86_64, and the issues seem to occur more often despite the latest application of |
| Comment by Alexey Lyashkov [ 12/Jan/24 ] |
|
@Patrik, this issue exist for any lustre where up2date bit cleared in vvp_page_delete and this is related to the page reclaim. |
| Comment by Gerrit Updater [ 23/Jan/24 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53550/ |
| Comment by Peter Jones [ 23/Jan/24 ] |
|
Landed for 2.16 |