[LU-17364] osc_page_delete LBUG - trying to delete a page under write Created: 14/Dec/23  Updated: 26/Jan/24  Resolved: 23/Jan/24

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Major
Reporter: Zhenyu Xu Assignee: Alexey Lyashkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Client crashed on osc_page_delete, and the page is waiting for write

[2281253.531369] LustreError: 81367:0:(osc_cache.c:2558:osc_teardown_async_page()) extent ffff883c74c7b810@

{[28680 -> 28680/32767], [2|0|-|cache|wi|ffff883ba0bbde00], [28672|1|+|-|ffff884a03601b00|4096| (null)]}

trunc at 28680.
[2281253.553678] LustreError: 81367:0:(osc_cache.c:2558:osc_teardown_async_page()) ### extent: ffff883c74c7b810 ns: euscrat-OST0004-osc-ffff887a7da55000 lock: ffff884a03601b00/0xb3e57269e5f70d90 lrc: 12/0,1 mode: PW/PW res: [0x480000402:0x1beede4b:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x800020000020000 nid: local remote: 0x93336ee709e18f0b expref: -99 pid: 81367 timeout: 0 lvb_type: 1 l_ast_data: 0000000000000000
[2281253.597632] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) page@ffff8841ea116400[3 ffff883e49200d70 4 1 (null)]

[2281253.613495] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) vvp-page@ffff8841ea116458(1:0) vm@fffff967dd5ace00 2fffff00000835 2:0 ffff8841ea116400 28680 lru

[2281253.613499] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) lov-page@ffff8841ea116498, gen: 0

[2281253.613511] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) osc-page@ffff8841ea1164d0 28680: 1< 0x845fed 2 0 + - > 2< 117473280 0 4096 0x0 0x420 | (null) ffff88622eda49b0 ffff883ba0bbde00 > 3< 0 0 0 > 4< 0 0 16 242233344 - | - - + - > 5< - - + - | 0 - | 3648 - ->

[2281253.613512] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) end page@ffff8841ea116400

[2281253.613514] LustreError: 81367:0:(osc_page.c:191:osc_page_delete()) Trying to teardown failed: -16
[2281253.613515] LustreError: 81367:0:(osc_page.c:192:osc_page_delete()) ASSERTION( 0 ) failed:
[2281253.613516] LustreError: 81367:0:(osc_page.c:192:osc_page_delete()) LBUG
[2281253.613518] Pid: 81367, comm: julia 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023
[2281253.613518] Call Trace:
[2281253.613549] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[2281253.613560] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[2281253.613568] [<0>] osc_page_delete+0x47e/0x4b0 [osc]
[2281253.613592] [<0>] cl_page_delete0+0x80/0x220 [obdclass]
[2281253.613602] [<0>] cl_page_delete+0x33/0x110 [obdclass]
[2281253.613618] [<0>] ll_invalidatepage+0x87/0x180 [lustre]
[2281253.613634] [<0>] do_invalidatepage_range+0x7d/0x90
[2281253.613642] [<0>] truncate_inode_page+0x7f/0x90
[2281253.613643] [<0>] generic_error_remove_page+0x2a/0x40
[2281253.613652] [<0>] vvp_page_discard+0x5e/0xd0 [lustre]
[2281253.613663] [<0>] cl_page_discard+0x4b/0x70 [obdclass]
[2281253.613675] [<0>] cl_page_list_discard+0x56/0x160 [obdclass]
[2281253.613682] [<0>] ll_io_read_page+0x3f5/0x890 [lustre]
[2281253.613688] [<0>] ll_readpage+0xe6/0x820 [lustre]
[2281253.613693] [<0>] filemap_fault+0x1f8/0x420
[2281253.613699] [<0>] ll_filemap_fault+0x39/0x70 [lustre]
[2281253.613706] [<0>] vvp_io_fault_start+0x5fa/0xe50 [lustre]
[2281253.613718] [<0>] cl_io_start+0x70/0x140 [obdclass]
[2281253.613729] [<0>] cl_io_loop+0x9f/0x200 [obdclass]
[2281253.613735] [<0>] ll_fault+0x52d/0x8a0 [lustre]
[2281253.613746] [<0>] __do_fault.isra.61+0x8a/0x100
[2281253.613754] [<0>] do_shared_fault.isra.64+0x4c/0x280
[2281253.613758] [<0>] handle_mm_fault+0x459/0x1190
[2281253.613765] [<0>] __do_page_fault+0x213/0x510
[2281253.613766] [<0>] do_page_fault+0x35/0x90


 Comments   
Comment by Ian Costello [ 18/Dec/23 ]

This looks identical to LU-12752 ?

Comment by Gerrit Updater [ 22/Dec/23 ]

"Zhenyu Xu <bobijam@hotmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53533
Subject: LU-17364 llite: do not discard dirty pages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1b7657791d03f8d548eef53ef335cfad21652b3a

Comment by Alexey Lyashkov [ 22/Dec/23 ]

In fact, this is not a bug in read. And page might don't have dirty flag once it processed by osc_make_ready and vvp_make_ready have clear this flag, but oap/extent still in rpc state which caused a bug.
this bug reprocoduced easy with fsx run with sysctl -w vm.drop_caches=3 in parallel.
just less than half hour, typically 10min.
this is race between page reclaim, mkwrite and any read (read page / readahead / ... etc).

00000080:00800000:1.0:1702501537.785719:0:24553:0:(llite_mmap.c:235:ll_page_mkwrite0()) fsx mkwrite with 0
00000080:00800008:1.0:1702501537.785731:0:24553:0:(llite_mmap.c:406:ll_fault()) [0x200000401:0x10:0x0]: vma=000000005cc36916 start=0x7f24a1f83000 end=0x7f24a1f8b000 vm_flags=0x80000fb idx=12833
00000080:00000001:1.0:1702501537.785734:0:24553:0:(llite_mmap.c:304:ll_fault0()) Process entered
00000080:00200000:1.0:1702501537.785743:0:24553:0:(rw.c:1885:ll_readpage()) fast read pgno: 12833
00000080:00000001:1.0:1702501537.785757:0:24553:0:(llite_mmap.c:337:ll_fault0()) Process leaving via out (rc=0 : 0 : 0x0)
00000080:00800000:1.0:1702501537.785760:0:24553:0:(llite_mmap.c:380:ll_fault0()) fsx fault 512/0
00000080:00000001:1.0:1702501537.785761:0:24553:0:(llite_mmap.c:381:ll_fault0()) Process leaving (rc=512 : 512 : 200)
Page installed into page cache, mmap write started 

00000080:00800008:1.0:1702501537.785764:0:24553:0:(llite_mmap.c:486:ll_page_mkwrite()) [0x200000401:0x10:0x0]: vma=000000005cc36916 start=0x7f24a1f83000 end=0x7f24a1f8b000 vm_flags=0x80000fb idx=12833
00000080:00000001:1.0:1702501537.785769:0:24553:0:(llite_mmap.c:170:ll_page_mkwrite0()) Process entered
00000080:00000001:1.0:1702501537.785774:0:24553:0:(llite_mmap.c:107:ll_fault_io_init()) Process entered
00000080:00800000:1.0:1702501537.785775:0:24553:0:(llite_mmap.c:129:ll_fault_io_init()) [0x200000401:0x10:0x0]: vma=000000005cc36916 start=0x7f24a1f83000 end=0x7f24a1f8b000 vm_flags=0x80000fb idx=12833
00000080:00000001:1.0:1702501537.785805:0:24553:0:(llite_mmap.c:155:ll_fault_io_init()) Process leaving (rc=18446612688480406592 : -131385229145024 : ffff88817e168c40)


00000080:00000001:3.0:1702501537.785972:0:24541:0:(rw26.c:170:do_release_page()) Process entered
00000008:00000040:3.0:1702501537.785985:0:24541:0:(osc_cache.c:2492:osc_teardown_async_page()) teardown oap 000000004bf2105f page 000000008d8d09d3 at index 12833.
00000080:00000001:3.0:1702501537.786004:0:24541:0:(rw26.c:213:do_release_page()) Process leaving (rc=1 : 1 : 1)

page release by drop cache, but vmpage had an extra references and don't freed.


00000080:00000001:1.0:1702501537.786026:0:24553:0:(vvp_io.c:1483:vvp_io_fault_start()) Process entered
00000020:00008000:1.0:1702501537.786048:0:24553:0:(cl_page.c:359:cl_page_find()) 12833@[0x200000401:0x10:0x0] 000000000a4501fb 0 1
00000008:00000040:1.0:1702501537.786074:0:24553:0:(osc_cache.c:2289:osc_prep_async_page()) oap 00000000ca649d88 vmpage 000000000a4501fb obj off 52563968
Ops. vmpage old, but no cl page attached - so allocate new, vmpage locked, mkwrite starts io

00000008:00000020:1.0:1702501537.786117:0:24553:0:(osc_cache.c:2370:osc_queue_async_io()) obj 00000000d9a2a1d1 ready 0|-|- wr 9|-|- rd 0|- oap 00000000ca649d88 page 000000000a4501fb added for cmd 2
00000008:00000020:1.0:1702501537.786128:0:24553:0:(osc_cache.c:1388:osc_consume_write_grant()) using 4096 grant credits for brw 00000000513e54ab page 000000000a4501fb

dirty flag cleared, vmpage lock released.

00000080:00000001:1.0:1702501537.786248:0:24553:0:(vvp_io.c:1637:vvp_io_fault_start()) Process leaving
00000080:00000001:1.0:1702501537.786489:0:24553:0:(llite_mmap.c:229:ll_page_mkwrite0()) Process leaving
00000080:00800000:1.0:1702501537.786513:0:24553:0:(llite_mmap.c:235:ll_page_mkwrite0()) fsx mkwrite with 0
00000080:00800008:1.0:1702501537.786527:0:24553:0:(llite_mmap.c:406:ll_fault()) [0x200000401:0x10:0x0]: vma=000000005cc36916 start=0x7f24a1f83000 end=0x7f24a1f8b000 vm_flags=0x80000fb idx=12832
00000080:00000001:1.0:1702501537.786530:0:24553:0:(llite_mmap.c:304:ll_fault0()) Process entered
00000080:00000001:1.0:1702501537.786549:0:24553:0:(llite_mmap.c:107:ll_fault_io_init()) Process entered
00000080:00800000:1.0:1702501537.786550:0:24553:0:(llite_mmap.c:129:ll_fault_io_init()) [0x200000401:0x10:0x0]: vma=000000005cc36916 start=0x7f24a1f83000 end=0x7f24a1f8b000 vm_flags=0x80000fb idx=12832
00000080:00000001:1.0:1702501537.786581:0:24553:0:(llite_mmap.c:155:ll_fault_io_init()) Process leaving (rc=18446612688480406592 : -131385229145024 : ffff88817e168c40)

once page uptodate flag cleared by cl_page_delete early, mmap read start to read this page from network... OOOPS

00000080:00000001:1.0:1702501537.786787:0:24553:0:(vvp_io.c:1483:vvp_io_fault_start()) Process entered
00000020:00008000:1.0:1702501537.787727:0:24553:0:(cl_page.c:359:cl_page_find()) 12833@[0x200000401:0x10:0x0] 000000000a4501fb ffff88817fcefa00 1
00000008:00000020:1.0:1702501537.788082:0:24553:0:(osc_io.c:178:osc_io_submit()) Busy oap 00000000ca649d88 page 000000007e1a659d for submit.

Busy OAP -> return -EAGAIN to submit IO and page under delete in the ll_io_read_page. That's ALL. Panic.
Comment by Gerrit Updater [ 25/Dec/23 ]

"Alexey Lyashkov <alexey.lyashkov@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53550
Subject: LU-17364 llite: don't use stale page.
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c056724ef52cf68b7680a20e8e9a6da0848b3ef5

Comment by Dominika Wanat [ 09/Jan/24 ]

Hi, 

We are stuck with the same problem with Lustre based on the b2_15 branch. Are you planning to backport the abovementioned patches in the near future?

Best, 

Dominika Wanat

Comment by Patrick Farrell [ 12/Jan/24 ]

Dominika,

Do you have any details on your setup?  Was there any particular change you made that seemed linked to the problem occurring, for example, a kernel update?  I ask because the underlying bug has been present for a while and we're trying to track if there's a reason it's now happening more often.

Comment by Dominika Wanat [ 12/Jan/24 ]

@Patrick, next week, we will prepare a complete history of updates, patches, and bug occurrences from the client and server sides. This bug is also nothing new for us - we have been experiencing it on the client nodes since the beginning of April 2022 (and kernel 4.18.0-348.20.1.el8_5.x86_64). Now, we are using clients with kernel 4.18.0-477.27.1.el8_8.x86_64, and the issues seem to occur more often despite the latest application of LU-16043. Interestingly, the presented LBUG is always caused by ams.exe, a part of ADF, but not every job using ADF causes lbug...

Comment by Alexey Lyashkov [ 12/Jan/24 ]

@Patrik,

this issue exist for any lustre where up2date bit cleared in vvp_page_delete and this is related to the page reclaim.

Comment by Gerrit Updater [ 23/Jan/24 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53550/
Subject: LU-17364 llite: don't use stale page.
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dad3bed7617fba895db169facde91856e89c2b08

Comment by Peter Jones [ 23/Jan/24 ]

Landed for 2.16

Generated at Sat Feb 10 03:34:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.