[LU-14541] Memory reclaim caused a stale data read Created: 22/Mar/21  Updated: 20/May/23  Resolved: 20/May/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0, Lustre 2.15.3

Type: Bug Priority: Critical
Reporter: Alexey Lyashkov Assignee: Patrick Farrell
Resolution: Fixed Votes: 0
Labels: None

Attachments: File LU-14541-performance-results.tar.gz     Microsoft Word LU-14541-performance-results.xlsx     File rw_seq_cst_vs_drop_caches.c    
Issue Links:
Duplicate
Related
is related to LU-8633 SIGBUS under memory pressure with fas... Resolved
is related to LU-16156 stale read during IOR test due LU-14541 Open
is related to LU-12587 DIO fallback to Buffer IO unexpectedly Reopened
is related to LU-15815 fast_read/stale data/reclaim workroun... Resolved
is related to LU-16160 take ldlm lock when queue sync pages Resolved
is related to LU-15819 Executables run from Lustre may recei... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
  1. Cray don't have a full logs described this problem, but big picture looks clean.
    Client node start a memory reclaim and enter to the ll_releasepage, where seen page is not a busy and have 3 vmpage references. It caused a cl_page_delete call which remove page from own page tree and move to the CPS_FREEDING state. It's fine for the kernels < 2.6.37.
    But 2.6.37 introduce a different way to page free, it is ->freepage callback.
    >>
    commit 6072d13c429373c5d63b69dadbbef40a9b035552
    Author: Linus Torvalds <torvalds@linux-foundation.org>
    Date: Wed Dec 1 13:35:19 2010 -0500

Call the filesystem back whenever a page is removed from the page cache
>>
It introduced because remove_mapping() can prohibit to kill page from page cache due page refcount != 2, or PageDirty reasons. As page in CPS_FREEDING state, cl_page_own is failed to own a page in the blocking ast an code expect some else will free page, but none do it. OOPS. Stale page with uptodate flag set in the page cache - where it can read du fast read code path.
Some existent logs.
>>>
00000008:00100000:10.0:1615300198.692889:0:4147:0:(osc_cache.c:3288:osc_page_gang_lookup()) vvp-page@ffff8800310524e0(1:1) vm@ffffea000119bdd0 10000000000002c 4:0 0 82094 lru
bad
00000008:00100000:10.0:1615300198.692873:0:4147:0:(osc_cache.c:3279:osc_page_gang_lookup()) vvp-page@ffff8800310520e0(1:1) vm@ffffea000119be08 10000000000002c 3:0 0 82095 lru
good
>>>
Other logs show it's race between lock cancel (osc_gang_lookup) and kswapd.

so one more vmpage reference highly likely caused fail.
based from crash dump in second after it. Page have a two references.
so likely we have a race with page access.



 Comments   
Comment by Wang Shilong (Inactive) [ 28/Apr/21 ]

I think problem existed, but it might be different a bit like commit:

ll_releasepage() we will

       if (!cl_page_in_use(page)) {

                result = 1;

                cl_page_delete(env, page);

        }   

And at this time, cl_page reference count is 2, and cl_page_delete() called to move CPS_FREEDING, and cl_page reference will be dropped to 1.

At the last we have  cl_page_put(env, page); with this, cl_page and osc page will be dropped from memory.

 

But as comments saying, it is still possible that vmpage could not be released because of (page refcount != 2), vmpage refcount is a bit tricky here.

I think potentially we might fix this issue by clear page update flag in vvp_page_delete()?

diff --git a/lustre/llite/vvp_page.c b/lustre/llite/vvp_page.c

index d0e274c..87ec2a8 100644

--- a/lustre/llite/vvp_page.c

+++ b/lustre/llite/vvp_page.c

@@ -178,6 +178,12 @@ static void vvp_page_delete(const struct lu_env *env,

 

        ClearPagePrivate(vmpage);

        vmpage->private = 0;

+       

+       /**

+        * Vmpage might not be released due page refcount != 2,

+        * clear Page uptodate here to avoid stale data.

+        */

+       ClearPageUptodate(vmpage);

        /*

         * Reference from vmpage to cl_page is removed, but the reference back

 

 

Comment by Wang Shilong (Inactive) [ 28/Apr/21 ]

Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43476
Subject: LU-14541 llite: avoid stale data reading
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f9795606d8d5d05d9c4cca7d0221f16dfe7db7d1

Comment by Alexey Lyashkov [ 28/Apr/21 ]

This patch is tested already - it solve a problem, but some problems exist.
1) page isn't freed and stay in memory for long time until page cache LRU will flush it.
2) page without uptodate flag may cause a EIO in some cases, a specially with splice. Don't sure - but possible.

I have a different patch with change a cl_page states change to avoid own a CPS_FREED pages, but no resources to verify it.
Cluster where problem is reproduced busy now.

in our case, it reproduced with overstripe with 5000 stripes and sysctl -w vm.drop_caches=3 on client nodes in parallel to the IOR.

Comment by Andreas Dilger [ 28/Apr/21 ]

Shadow, can you please clarify your comment. It sounds like you made a similar patch, but it isn't clear if you are currently using that patch or if it had so many problems that it wasn't applied? It sounds like there are some theoretical problems but they may not actually be seen in real use?

Do you have any idea why there are extra references on the page? Is that because of a bug in llite or is some other kernel process actually holding a reference on the page at the time we are trying to drop it (e.g. kswapd) and it never gets cleaned up afterward?

Comment by Alexey Lyashkov [ 28/Apr/21 ]

Andreas,

i was cook a similar patch while debug this issue. But during internal discussion - Panda point to the place in kernel where EIO can returned in case page is in mapping and it not uptodate. This is reason to decline this way. In additional to the extra memory usage after this patch applied.

As about extra reference, highly likely this is clean kswapd vs lock cancel race and it looks osc_gang_lookup take an extra ref while do own work.

Comment by Andreas Dilger [ 28/Apr/21 ]

I guess the main question is how often the EIO case can happen (e.g. only in rare cases, or commonly, and only if the file is being modified, or even for files that are not changed), and whether it is better to return EIO (safe, but annoying) or stale data (unsafe, but usually correct?) to the application?

Comment by Andreas Dilger [ 03/Jun/21 ]

As about extra reference, highly likely this is clean kswapd vs lock cancel race and it looks osc_gang_lookup take an extra ref while do own work.

It looks like cl_page_delete() will prevent new references to find a page, but doesn't handle old references. What about adding a refcount check in osc_gang_lookup() that detects if it is the last refcount holder and dropping the page from the mapping there?

Comment by Gerrit Updater [ 22/Jul/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43476/
Subject: LU-14541 llite: avoid stale data reading
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f2a16793fa4316fc9ccdc46bcfe54f6b8d1e442b

Comment by Peter Jones [ 22/Jul/21 ]

Landed for 2.15

Comment by Cory Spitz [ 23/Jul/21 ]

Are there any performance test results with this patch?

Comment by Patrick Farrell [ 23/Jul/21 ]

No, but it would be very surprising if it had any performance implications - the direct effect is a single bit operation on page flags, and then making a page which is being destroyed inaccessible slightly earlier.  (Which is correct because the data in it was not up to date.)

Comment by Cory Spitz [ 23/Jul/21 ]

Thanks, Patrick. It is funny because we have been dragging our feet since March on this because of the potential perf impact. We're in the process of gathering numbers now and will post them when available.

Comment by Patrick Farrell [ 23/Jul/21 ]

Why do folks think there would be a perf impact?  This is a page which is being deleted.

Comment by Andreas Dilger [ 25/Jul/21 ]

Cory, could you please clarify your comment about potential performance impact? Shadow had some concerns about the provided patch not fixing all of the corner cases, but no patch was forthcoming from him. The landed patch fixes a regular crash in the regression tests, and if there is an additional patch that addresses other issues it can be applied on top of the one that was landed.

Comment by Cory Spitz [ 02/Aug/21 ]

adilger, sorry, I was out on PTO and wasn't able to clarify sooner. Yes, we had some performance concerns because clearly there could be a buffer cache impact with a change such as this. Fortunately, we've done some performance testing of the fix and the long and short of it is that the performance is fine. We can share more if necessary.

As for the EIO potential, that investigation and testing is still on-going. So far, so good. We can report back on that topic shortly.

Comment by Petros Koutoupis [ 02/Aug/21 ]

I uploaded the performance results (LU-14541-performance-results.tar.gz and LU-14541-performance-results.xlsx) against an internal cray-2.12.4.3 baseline (7699fab). What is being tested is the cray-2.12.4.3 baseline + the LU-14541 patch and we see no regressions.

Comment by Patrick Farrell [ 02/Aug/21 ]

Still really don't understand why it was thought this might have a performance impact; perhaps it was another approach to fixing this?  This one is just making it so pages which are currently being deleted cannot be read.

Comment by Alexey Lyashkov [ 03/Aug/21 ]

@Patrik, any change should be tested with performance impact.

as example not visible issue: https://review.whamcloud.com/#/c/35610/ - will cause a random read, where a sequentially was before. It because pages covered lock can be aged and removed at any order, but lock cancel have a guarantee continues area will freed.

For this particular case, different approach was exist. But it looks have issues. and have a cost -5% for IOPs. So this patch should be the tested for correctness - LTP, IOR .. and checked by performance impact.

Comment by Cory Spitz [ 05/Aug/21 ]

Additional context:
In May we had empirical data showing a slowdown. We had a reproducer job that we could run at a rate of 175/day. When we inserted the fix the rate decreased to 162/day. This started our performance concern. Then we got these results on an 'aged' filesystem:

Perf_Data mdtest 1 Dir_create 29329.673 ops/sec
Perf_Data mdtest 1 Dir_create 29554.075 ops/sec
Perf_Data mdtest Dir_create PERCENT_DIFF: + 0%
Perf_Data mdtest 1 Dir_stat 63331.142 ops/sec
Perf_Data mdtest 1 Dir_stat 62922.968 ops/sec
Perf_Data mdtest Dir_stat PERCENT_DIFF: + 0%
Perf_Data mdtest 1 Dir_rm 35983.039 ops/sec
Perf_Data mdtest 1 Dir_rm 35867.840 ops/sec
Perf_Data mdtest Dir_rm PERCENT_DIFF: + 0%
Perf_Data mdtest 1 File_create 35296.924 ops/sec
Perf_Data mdtest 1 File_create 34660.408 ops/sec
Perf_Data mdtest File_create PERCENT_DIFF: - 1%
Perf_Data mdtest 1 File_stat 57846.705 ops/sec
Perf_Data mdtest 1 File_stat 57867.343 ops/sec
Perf_Data mdtest File_stat PERCENT_DIFF: + 0%
Perf_Data mdtest 1 File_read 55573.686 ops/sec
Perf_Data mdtest 1 File_read 53646.697 ops/sec
Perf_Data mdtest File_read PERCENT_DIFF: - 3%
Perf_Data mdtest 1 File_rm 34249.630 ops/sec
Perf_Data mdtest 1 File_rm 34135.695 ops/sec
Perf_Data mdtest File_rm PERCENT_DIFF: + 0%

aprun -n 256 -N 16 /cray/css/ostest/binaries/xt/rel.52.gemini.cray/xtcnl/apps/ROOT.latest/perf/IOR-2.10.3/RUN/IOR -vv -w -F -b 163840m -t 64m -i 5 -k -m -D 180 -B -o /lus/snx11205/talbers/baseline/pools/ssu1/ior/out.write aprun -n 256 -N 16 /cray/css/ostest/binaries/xt/rel.52.gemini.cray/xtcnl/apps/ROOT.latest/perf/IOR-2.10.3/RUN/IOR -vv -w -F -b 163840m -t 64m -i 5 -k -m -D 180 -B -o /lus/snx11205/talbers/baseline/pools/ssu1/ior/out.write
/cray/css/ostest/binaries/xt/rel.52.gemini.cray/xtcnl/apps/ROOT.latest/perf/IOR-2.10.3/RUN/IOR -vv -w -F -b 163840m -t 64m -i 5 -k -m -D 180 -B -o /lus/snx11205/talbers/baseline/pools/ssu1/ior/out.write
Perf_Data IOR 67108864 write 23628.770 MiB/sec
Perf_Data IOR 67108864 write 22613.789 MiB/sec
Perf_Data IOR write PERCENT_DIFF: - 4%
Starting Pre-fill for Read Phase
aprun -n 256 -N 16 /cray/css/ostest/binaries/xt/rel.52.gemini.cray/xtcnl/apps/ROOT.latest/perf/IOR-2.10.3/RUN/IOR -vv -w -F -b 163840m -t 64m -i 1 -k -D 1800 -B -o /lus/snx11205/talbers/baseline/pools/ssu1/ior/out.read aprun -n 256 -N 16 /cray/css/ostest/binaries/xt/rel.52.gemini.cray/xtcnl/apps/ROOT.latest/perf/IOR-2.10.3/RUN/IOR -vv -w -F -b 163840m -t 64m -i 1 -k -D 1800 -B -o /lus/snx11205/talbers/baseline/pools/ssu1/ior/out.read
/cray/css/ostest/binaries/xt/rel.52.gemini.cray/xtcnl/apps/ROOT.latest/perf/IOR-2.10.3/RUN/IOR -vv -w -F -b 163840m -t 64m -i 1 -k -D 1800 -B -o /lus/snx11205/talbers/baseline/pools/ssu1/ior/out.read
Perf_Data IOR 67108864 write 23204.926 MiB/sec
Perf_Data IOR 67108864 write 22281.541 MiB/sec
Perf_Data IOR write PERCENT_DIFF: - 3%

But as koutoupis wrote when we run on the fix on the benchmarking platform we see no regressions.

Comment by Cory Spitz [ 05/Aug/21 ]

> 2) page without uptodate flag may cause a EIO in some cases, a specially with splice. Don't sure - but possible.

I guess the main question is how often the EIO case can happen (e.g. only in rare cases, or commonly, and only if the file is being modified, or even for files that are not changed), and whether it is better to return EIO (safe, but annoying) or stale data (unsafe, but usually correct?) to the application?

We've completed runs of an LTP-heavy mix and found no issues with EIO.

Comment by Gerrit Updater [ 14/Dec/21 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45848
Subject: LU-14541 llite: avoid stale data reading
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 217a0de0e1787fa9632bde5f792e7a90ee79656a

Comment by John Hammond [ 04/May/22 ]

Attached a reproducer rw_seq_cst_vs_drop_caches.c. Running with fix reverted:

# lctl set_param llite.*.fast_read=1
llite.lustre-ffff9d20ca304088.fast_read=1
llite.lustre-ffff9d21cc919028.fast_read=1
# ./rw_seq_cst_vs_drop_caches /mnt/lustre/f0 /mnt/lustre2/f0
u = 2, v = { 2, 1 }
Aborted (core dumped)
# lctl set_param llite.*.fast_read=0
llite.lustre-ffff9d20ca304088.fast_read=0
llite.lustre-ffff9d21cc919028.fast_read=0
# ./rw_seq_cst_vs_drop_caches /mnt/lustre/f0 /mnt/lustre2/f0
u = 2, v = { 2, 1 }
Aborted (core dumped)
Comment by Gerrit Updater [ 06/May/22 ]

"John L. Hammond <jhammond@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47243
Subject: LU-14541 llite: add rw_seq_cst_vs_drop_caches
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2c16e1062eec944b0700a74b5242f87922122958

Comment by Patrick Farrell [ 09/May/22 ]

Looking at the code for readpage in filemap_fault(), we can see clearly that it's not OK to unset pageuptodate at an arbitrary time.  The page comes back unlocked from readpage, and the code requires it be uptodate or it will give an error (resulting in the SIGBUS we see with LU-14541 in place).

But if we leave the page uptodate, we hit this issue where we can see stale data.

Comment by Gerrit Updater [ 09/May/22 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47262
Subject: LU-14541 llite: Check vmpage in releasepage
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8370da3d1d88f0d06ffd616141fa6913e7142150

Comment by Gerrit Updater [ 09/May/22 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47266
Subject: LU-14541 llite: Disable releasepage for Lustre
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e6070e6b11e83dafe70d72e1aaec4081fe6b0a67

Comment by Gerrit Updater [ 11/May/22 ]

"John L. Hammond <jhammond@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47298
Subject: LU-14541 llite: reenable fast_read by default
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3c105dbc6a5e5e2fd4f2dc4af93d002823d08887

Comment by Gerrit Updater [ 19/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47298/
Subject: LU-14541 llite: reenable fast_read by default
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a94e28fda44077f77058aeb4ff496da721daa1d8

Comment by Gerrit Updater [ 19/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47243/
Subject: LU-14541 llite: add rw_seq_cst_vs_drop_caches
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 99e6bc29a437cac43fee499d6f9a2c17854468ee

Comment by Gerrit Updater [ 20/May/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47262/
Subject: LU-14541 llite: Check vmpage in releasepage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c524079f4f59a39b99467d9868ee4aafdcf033e9

Comment by Peter Jones [ 20/May/22 ]

Looks like the latest patches have landed

Comment by Gerrit Updater [ 23/Aug/22 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48311
Subject: LU-14541 llite: Check vmpage in releasepage
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 4060c9747dc3a6e392959255f6e0f854f50e2255

Comment by Andreas Dilger [ 21/Oct/22 ]

sanity test_277 is being skipped on master because of this ticket, but it should really be referencing LU-12587.

Comment by Gerrit Updater [ 17/Jan/23 ]

"Patrick Farrell <farr0186@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49653
Subject: LU-14541 llite: Check for page deletion after fault
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cee73d934c44a1bea934c9de40efcc56d2634e1c

Comment by Gerrit Updater [ 17/Jan/23 ]

"Patrick Farrell <farr0186@gmail.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49654
Subject: Revert "LU-14541 llite: Check vmpage in releasepage"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 95fb01cb020295531839d8c758320262db6b7bef

Comment by Gerrit Updater [ 11/Apr/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50598
Subject: LU-14541 llite: Check for page deletion after fault
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 9cf02b117d1e51d8ad367bae88f8fda4c0a95f49

Comment by Gerrit Updater [ 11/Apr/23 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50599
Subject: LU-14541 revert: "llite: Check vmpage in releasepage"
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 0926e0b3b61c8180024df55223f654080f789a6d

Comment by Gerrit Updater [ 26/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49653/
Subject: LU-14541 llite: Check for page deletion after fault
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b3d2114e538cf95a7e036f8313e9095fe821da79

Comment by Gerrit Updater [ 09/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49654/
Subject: Revert "LU-14541 llite: Check vmpage in releasepage"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e3cfb688ed7116a57b2c7f89a3e4f28291a0b69f

Comment by Gerrit Updater [ 20/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50598/
Subject: LU-14541 llite: Check for page deletion after fault
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 4a134425bd9d03f5e40e489fd7e9acf4788e9da1

Comment by Gerrit Updater [ 20/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50599/
Subject: Revert "LU-14541 llite: Check vmpage in releasepage"
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 72b5be5ccc1c58ae6edc968fa9106d53578aeccb

Generated at Sat Feb 10 03:10:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.