Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Medium
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

cl_batch_put() ASSERTION(page->cp_state == CPS_FREEING) fails in the
write completion path (brw_interpret -> osc_extent_finish ->
osc_completion -> cl_page_put -> cl_page_batch_put).

Root cause: osc_completion() cleared ops_transfer_pinned directly
(bypassing osc_page_transfer_put()) before cl_page_complete() but
deferred the actual cl_page_put() until after. This split the
flag-clear from the ref-drop, creating a race window:

1. osc_completion() clears ops_transfer_pinned directly
2. cl_page_complete() sets CPS_CACHED, calls end_page_writeback()
3. Page is now reclaimable. Another CPU enters do_release_page() ->
cl_page_delete() -> osc_page_delete() -> osc_page_transfer_put()
4. osc_page_transfer_put() sees flag already 0, skips cl_page_put()
5. vvp_page_delete() drops cache ref
6. Back on CPU A, cl_page_put() drops transfer pin – now the last ref
7. cl_batch_put() fires. On weakly-ordered architectures (aarch64),
the CPS_FREEING store from step 3 may not yet be visible -> LBUG

The root cause was directly manipulating ops_transfer_pinned instead
of using the osc_page_transfer_put() accessor which keeps the flag
and the cl_page reference in sync.

Fix: Do not clear ops_transfer_pinned directly. Call
osc_page_transfer_put() after cl_page_complete(), which clears the
flag and drops the ref together. The transfer pin reference also
keeps cl_page_in_use() returning true, which prevents concurrent
reclaim until the ref is dropped.

Also add documentation on ops_transfer_pinned warning that it must
only be managed through osc_page_transfer_get/put accessors, since
the flag is paired with a cl_page reference.

Confirmed via vmcore analysis from an aarch64 system (256 CPUs):
cl_page had cp_state=CPS_FREEING at dump time (set by a concurrent
thread AFTER the assertion fired), cp_ref=0, and vmpage PG_private
already clear (vvp_page_delete completed on another CPU).

Component: osc

Attachments

Activity

People

Assignee:: WC Triage

Reporter:: Patrick Farrell

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Mar/26 8:33 PM

Updated:: 10/Mar/26 3:06 PM