[LU-16156] stale read during IOR test due LU-14541 Created: 14/Sep/22  Updated: 01/Dec/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Alexey Lyashkov Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14541 Memory reclaim caused a stale data read Resolved
is related to LU-16160 take ldlm lock when queue sync pages Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It looks LU-14541 returned and last patch isn't enough.
reproduced on cray-2.15-int with last 14541 patch applied.

00000080:00010000:3.0:1661448429.922359:0:17969:0:(file.c:5884:ll_layout_lock_set()) ### file [0x200044d2e:0x22f:0x0](000000008945c220) being reconfigured ns: kjcf04-MDT0000-mdc-ffff88879388d000 lock: 00000000265351e4/0x2ea
4f3791074a3c7 lrc: 3/1,0 mode: CR/CR res: [0x200044d2e:0x22f:0x0].0x0 bits 0x8/0x0 rrc: 2 type: IBT gid 0 flags: 0x0 nid: local remote: 0x76fc642919c90d07 expref: -99 pid: 17969 timeout: 0 lvb_type: 3
00000080:00200000:3.0:1661448429.922913:0:17969:0:(file.c:5779:ll_layout_conf()) [0x200044d2e:0x22f:0x0]: layout version change: 4294967294 -> 15
DoM
00010000:00010000:3.0:1661448429.926955:0:17969:0:(ldlm_lock.c:902:ldlm_lock_decref_internal()) ### add lock into lru list ns: kjcf04-MDT0000-mdc-ffff88879388d000 lock: 00000000ca0cb036/0x2ea4f3791074a3e3 lrc: 3/0,0 mode: P
R/PR res: [0x200044d2e:0x22f:0x0].0x0 bits 0x40/0x0 rrc: 3 type: IBT gid 0 flags: 0x800020000000000 nid: local remote: 0x76fc642919c90d3f expref: -99 pid: 17969 timeout: 0 lvb_type: 1
00010000:00010000:3.0:1661448429.926982:0:17969:0:(ldlm_lock.c:766:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: kjcf04-OST0005-osc-ffff88879388d000 lock: 00000000a8693f35/0x2ea4f3791074a3ff lrc: 3/1,0 mode: --/PR res: [0x25c6ad3d:0x0:0x0].0x0 rrc: 2 type: EXT [0->0] (req 0->0) gid 0 flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 17969 timeout: 0 lvb_type: 1
00010000:00010000:3.0:1661448429.926985:0:17969:0:(ldlm_request.c:1028:ldlm_cli_enqueue()) ### client-side enqueue START, flags 0x0 ns: kjcf04-OST0005-osc-ffff88879388d000 lock: 00000000a8693f35/0x2ea4f3791074a3ff lrc: 3/1,0 mode: --/PR res: [0x25c6ad3d:0x0:0x0].0x0 rrc: 2 type: EXT [1048576->4194303] (req 1048576->4194303) gid 0 flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 17969 timeout: 0 lvb_type: 1
00010000:00010000:3.0:1661448429.926994:0:17969:0:(ldlm_request.c:1115:ldlm_cli_enqueue()) ### sending request ns: kjcf04-OST0005-osc-ffff88879388d000 lock: 00000000a8693f35/0x2ea4f3791074a3ff lrc: 3/1,0 mode: --/PR res: [0x25c6ad3d:0x0:0x0].0x0 rrc: 2 type: EXT [1048576->4194303] (req 1048576->4194303) gid 0 flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 17969 timeout: 0 lvb_type: 1
00010000:00010000:3.0:1661448429.927377:0:17969:0:(ldlm_resource.c:1686:ldlm_resource_add_lock()) ### About to add this lock ns: kjcf04-OST0005-osc-ffff88879388d000 lock: 00000000a8693f35/0x2ea4f3791074a3ff lrc: 4/1,0 mode: --/PR res: [0x25c6ad3d:0x0:0x0].0x0 rrc: 2 type: EXT [1048576->4194303] (req 1048576->4194303) gid 0 flags: 0x10000000020000 nid: local remote: 0x10ff82d82b217a69 expref: -99 pid: 17969 timeout: 0 lvb_type: 1
00000080:00200000:3.0:1661448430.113932:0:17969:0:(vvp_io.c:312:vvp_io_fini()) [0x200044d2e:0x22f:0x0] ignore/verify layout 0/0, layout version 15 need write layout 0, restore needed 0
00000008:00100000:17.0:1661448430.236630:0:10311:0:(osc_request.c:1804:osc_brw_prep_request()) brw rpc 00000000ff43cadf - object 0x0:633777469 offset 104857600<>109056000
00000100:00100000:23.0:1661448430.292928:0:10312:0:(client.c:2220:ptlrpc_check_set()) Completed RPC req@00000000ff43cadf pname:cluuid:pid:xid:nid:opc:job:rc ptlrpcd_00_04:2f7f0639-3f88-4f3f-9faf-e0b16a082f10:10312:1741723660248256:10.17.100.60@o2ib:3::0/4194304
00000080:00200000:3.0:1661448430.310692:0:17969:0:(vvp_io.c:847:vvp_io_read_start()) IORfile_4m.00000011: read [100663296, 104857600)
00000080:00200008:3.0:1661448430.321303:0:17969:0:(file.c:1997:ll_file_read_iter()) file IORfile_4m.00000011:[0x200044d2e:0x22f:0x0], ppos: 104857600, count: 4194304
00010000:00010000:3.0:1661448430.314974:0:17969:0:(ldlm_lock.c:909:ldlm_lock_decref_internal()) ### do not add lock into lru list ns: kjcf04-OST0005-osc-ffff88879388d000 lock: 00000000a8693f35/0x2ea4f3791074a3ff lrc: 5/1,0 mode: PR/PR res: [0x25c6ad3d:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 1048576->4194303) gid 0 flags: 0x810020000020000 nid: local remote: 0x10ff82d82b217a69 expref: -99 pid: 17969 timeout: 0 lvb_type: 1

new ldlm lock requested/granted, but BRW rpc created with single page hole inside.
it caused a stale read, first patch with clear uptodate page flag helps and stale data don't reported.



 Comments   
Comment by Alexey Lyashkov [ 01/Dec/22 ]

In fact, problem is simple while fix is not.

CLIO broke a kernel MM protocol for the memory reclaim and ll_releasepage checks is racy and buggy. This bug from CLIO begin.

CLIO hold a vmpage ref until CL page released (aka vvp_page_fini), it cause a situation when memory reclaim or layout change needs to flush a pages, but blocking AST arrived in same time. osc_gang_lookup holds a CL page reference while memory reclaim start to delete a page, but existent check is racy. So ll_releasepage exit after cl_page_delete with extra vmpage reference.
It caused a __remove_mapping failed to remove a page from mapping.
From other side blocking AST handler don't able to run cl_page_own as it moved into CPS_FREEDING by cl_page_delete.

That all. VM page still in the page cache and any read my pickup this data.

It's easy check by adds assert at end of ll_releasepages with check refcount and run a dual mount fsx test from sanityN.

00000008:00000001:9.0:1666910059.632700:0:5016:0:(osc_cache.c:3088:osc_page_gang_lookup()) Process entered
but cl_page ref was hold where
bl ast enter and interrrupted by ll_releasepage aka cache flush.
00000020:00000001:8.0:1666910059.632703:0:11668:0:(cl_page.c:545:cl_vmpage_page()) Process leaving (rc=18446624413482391544 : -119660227160072 : ffff932b6eaa97f8)
00000020:00000001:8.0:1666910059.632708:0:11668:0:(cl_page.c:444:cl_page_state_set0()) page@ffff932b6eaa97f8[3 ffff932b5cd3f2b0 1 1 0000000000000000]
00000020:00000001:8.0:1666910059.632709:0:11668:0:(cl_page.c:445:cl_page_state_set0()) page fffff2cc04e941c0 map ffff932c62810218 index 82632 flags 17ffffc0002015 count 3 priv ffff932b6eaa97f8:
00000020:00000001:8.0:1666910059.633545:0:11668:0:(cl_page.c:489:cl_pagevec_put()) page@ffff932b6eaa97f8[2 ffff932b5cd3f2b0 5 1 0000000000000000]
00000020:00000001:8.0:1666910059.633546:0:11668:0:(cl_page.c:490:cl_pagevec_put()) page fffff2cc04e941c0 map ffff932c62810218 index 82632 flags 17ffffc0000015 count 3 priv 0:
00000080:00008000:8.0:1666910059.633548:0:11668:0:(rw26.c:175:ll_releasepage()) page fffff2cc04e941c0 map ffff932c62810218 index 82632 flags 17ffffc0000015 count 3 priv 0: clpage ffff932b6eaa97f8 : 1
ll_releasepage exit and expect to free a cl_page but ref hold by BL AST thread.
and vmpage still with 3 refs while __remove_mapping whats 2.
00000020:00000001:9.0:1666910059.642999:0:5016:0:(cl_page.c:489:cl_pagevec_put()) page@ffff932b6eaa97f8[1 ffff932b5cd3f2b0 5 1 0000000000000000]
00000020:00000001:9.0:1666910059.643000:0:5016:0:(cl_page.c:490:cl_pagevec_put()) page fffff2cc04e941c0 map ffff932c62810218 index 82632 flags 17ffffc0000014 count 2 priv 0:
00000020:00000010:9.0:1666910059.643003:0:5016:0:(cl_page.c:178:__cl_page_free()) slab-freed 'cl_page': 472 at ffff932b6eaa97f8.
cl_page freed -> vmpage ref released, vmpage with 2refs and it may removed from pagecache, but none want's to do it and uptodate page still in pagecache.
Generated at Sat Feb 10 03:24:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.