- Cray don't have a full logs described this problem, but big picture looks clean.
Client node start a memory reclaim and enter to the ll_releasepage, where seen page is not a busy and have 3 vmpage references. It caused a cl_page_delete call which remove page from own page tree and move to the CPS_FREEDING state. It's fine for the kernels < 2.6.37.
But 2.6.37 introduce a different way to page free, it is ->freepage callback.
>>
commit 6072d13c429373c5d63b69dadbbef40a9b035552
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed Dec 1 13:35:19 2010 -0500
Call the filesystem back whenever a page is removed from the page cache
>>
It introduced because remove_mapping() can prohibit to kill page from page cache due page refcount != 2, or PageDirty reasons. As page in CPS_FREEDING state, cl_page_own is failed to own a page in the blocking ast an code expect some else will free page, but none do it. OOPS. Stale page with uptodate flag set in the page cache - where it can read du fast read code path.
Some existent logs.
>>>
00000008:00100000:10.0:1615300198.692889:0:4147:0:(osc_cache.c:3288:osc_page_gang_lookup()) vvp-page@ffff8800310524e0(1:1) vm@ffffea000119bdd0 10000000000002c 4:0 0 82094 lru
bad
00000008:00100000:10.0:1615300198.692873:0:4147:0:(osc_cache.c:3279:osc_page_gang_lookup()) vvp-page@ffff8800310520e0(1:1) vm@ffffea000119be08 10000000000002c 3:0 0 82095 lru
good
>>>
Other logs show it's race between lock cancel (osc_gang_lookup) and kswapd.
so one more vmpage reference highly likely caused fail.
based from crash dump in second after it. Page have a two references.
so likely we have a race with page access.
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50599/
Subject: Revert "
LU-14541llite: Check vmpage in releasepage"Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 72b5be5ccc1c58ae6edc968fa9106d53578aeccb