Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
None
-
3
-
9223372036854775807
Description
- Cray don't have a full logs described this problem, but big picture looks clean.
Client node start a memory reclaim and enter to the ll_releasepage, where seen page is not a busy and have 3 vmpage references. It caused a cl_page_delete call which remove page from own page tree and move to the CPS_FREEDING state. It's fine for the kernels < 2.6.37.
But 2.6.37 introduce a different way to page free, it is ->freepage callback.
>>
commit 6072d13c429373c5d63b69dadbbef40a9b035552
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed Dec 1 13:35:19 2010 -0500
Call the filesystem back whenever a page is removed from the page cache
>>
It introduced because remove_mapping() can prohibit to kill page from page cache due page refcount != 2, or PageDirty reasons. As page in CPS_FREEDING state, cl_page_own is failed to own a page in the blocking ast an code expect some else will free page, but none do it. OOPS. Stale page with uptodate flag set in the page cache - where it can read du fast read code path.
Some existent logs.
>>>
00000008:00100000:10.0:1615300198.692889:0:4147:0:(osc_cache.c:3288:osc_page_gang_lookup()) vvp-page@ffff8800310524e0(1:1) vm@ffffea000119bdd0 10000000000002c 4:0 0 82094 lru
bad
00000008:00100000:10.0:1615300198.692873:0:4147:0:(osc_cache.c:3279:osc_page_gang_lookup()) vvp-page@ffff8800310520e0(1:1) vm@ffffea000119be08 10000000000002c 3:0 0 82095 lru
good
>>>
Other logs show it's race between lock cancel (osc_gang_lookup) and kswapd.
so one more vmpage reference highly likely caused fail.
based from crash dump in second after it. Page have a two references.
so likely we have a race with page access.
Attachments
Issue Links
- is related to
-
LU-16156 stale read during IOR test due LU-14541
- Open
-
LU-12587 DIO fallback to Buffer IO unexpectedly
- Reopened
-
LU-15815 fast_read/stale data/reclaim workround causes SIGBUS
- Resolved
-
LU-16160 take ldlm lock when queue sync pages
- Resolved
-
LU-15819 Executables run from Lustre may receive spurious SIGBUS signals
- Closed
- is related to
-
LU-8633 SIGBUS under memory pressure with fast_read enabled
- Resolved