[LU-13759] sanity-dom sanityn_test_20 fails with '1 page left in cache after lock cancel' Created: 07/Jul/20  Updated: 04/Mar/21  Resolved: 20/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0, Lustre 2.12.7

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: DNE
Environment:

DNE


Issue Links:
Related
is related to LU-13645 Various data corruptions possible in ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-dom sanityn test_20 fails with '1 page left in cache after lock cancel'. This test started failing on 28 June 2020 and is only failing for DNE testing meaning in review-dne-part-4 and review-dne-zfs-part-4.

sanity-dom runs several sanityn.sh tests with DOM enabled

 178 test_sanityn()
 179 {
 180         # XXX: to fix 60
 181         ONLY="1 2 4 5 6 7 8 9 10 11 12 14 17 19 20 23 27 39 51a 51c 51d" \
 182                 OSC="mdc" DOM="yes" bash sanityn.sh
 183 
 184         return 0
 185 }
 186 run_test sanityn "Run sanityn with Data-on-MDT files"

and it is actually sanityn test 20 that we see fail here.

There’s a couple of problems:
1. sanityn test 20 is failing when DOM=”yes” is set
2. when this test fails, sanity-dom is not marked as failed or not marked in a way that Maloo recognizes the failure. So, this is a silent failure

This ticket deals with sanity-dom’s sanityn test 20 failure. I’ll open a different ticket for the sanity-dom failures not getting recognized as failures.

For a recent failure, logs at https://testing.whamcloud.com/test_sets/5230daaa-9cb6-4bdf-98ad-330a658a197a, the suite_log doesn’t reveal anything about the cause of the failure

== sanityn test 20: test extra readahead page left in cache ========================================== 09:32:02 (1594114322)
striped dir -i0 -c2 -H fnv_1a_64 /mnt/lustre/d20
 sanityn test_20: @@@@@@ FAIL: 1 page left in cache after lock cancel 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6167:error()
  = sanityn.sh:600:test_20()

Since the failure is not recognized as a failure by Maloo, there are no logs other than console logs to look at. The console logs do not provide any information on why the test is failing.

Recent failures of this test are at:
https://testing.whamcloud.com/test_sets/61841ecb-57f6-4c0f-b563-01eae76405f2
https://testing.whamcloud.com/test_sets/88646434-24d8-41fc-81cc-43d19e862c07



 Comments   
Comment by Peter Jones [ 09/Jul/20 ]

Mike

This failure is happening regularly - could you please investigate?

Peter

Comment by Mikhail Pershin [ 13/Jul/20 ]

I think it result of LU-13645 issue

Comment by Gerrit Updater [ 16/Jul/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39401
Subject: LU-13759 dom: lock cancel to drop pages
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c9b7e98520226fa4c8db7df7afb2fff562309624

Comment by Mikhail Pershin [ 16/Jul/20 ]

The problem is part of LU-13645 and is related to read-on-open feature. When llite gets pages along with open these pages are not connected with CLIO but lock cancel goes now to the MDC level and uses CLIO methods to flush/discard data, so there pages are not visible and kept in VM even when LDLM lock is gone. Test illustrates the problem.

I see two possible fixes here, the first one is implemented in patch - revert commit 02e766f5ed from LU-11427 to connect VM pages with cl_pages in CLIO which looks simple and should work in all cases.

The second way is to return dropped code with truncate_inode_pages_range() in llite when lock is being canceled. This needs more work to get layout DOM data from LOV first and still require call to MDC to handle LVB problems described in LU-12296

The only possible drawback with the first approach is performance drop, the LU-11427 hasn't any numbers to compare, so I will try to collect performance numbers before and after patch to compare

Comment by Gerrit Updater [ 17/Jul/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39412
Subject: LU-13759 dom: drop pages on lock cancel correctly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f7f8bffee607eb888a9ecdb24f4a5c2dca9b9ac0

Comment by Mikhail Pershin [ 17/Jul/20 ]

Alternative patch with keeping performance improvement from LU-11427.

Comment by Alexey Lyashkov [ 20/Jul/20 ]

I tests a patch submitted by Mike, and I don't see any performance regressions with open.
test is simple.
a sort of directories with 40k files created. each file is 16k size and have just DoM layout. L300 MDT with IB FDR network.

createmany have simple modification to make open as O_RDONLY, number open descriptors is adjusted to able open all files without close.
run was started a several times in different directories to avoid some disk effects.
No performance regression found.

Comment by Andreas Dilger [ 29/Jul/20 ]

LU-13645 is an issue that has existed for several years already, and it wouldn't explain why this test suddenly started failing so frequently.

There were a number of patches that landed on June 28th that are likely the cause of this problem:

https://review.whamcloud.com/39134 "LU-12678 socklnd: don't fall-back to tcp_sendpage."  [lnet] [trivial]
https://review.whamcloud.com/39127 "LU-12678 lnet: Fix some out-of-date comments."  [lnet] [trivial]
https://review.whamcloud.com/39122 "LU-12678 o2iblnd: allocate init_qp_attr on stack."  [lnet] [trivial]
https://review.whamcloud.com/39117 "LU-9859 libcfs: fold cfs_tracefile_*_arch into their only callers."  [libcfs] [trivial]
https://review.whamcloud.com/38985 "LU-930 misc: update URLs in README"  [trivial]
https://review.whamcloud.com/38981 "LU-9859 libcfs: move tgt_descs to standard Linux bitmaps."
ttps://review.whamcloud.com/38941 "LU-13595 scripts: Add a debug option to lustre_rmmod"  [trivial]
https://review.whamcloud.com/38743 "LU-13566 socklnd: fix local interface binding"  [lnet]
https://review.whamcloud.com/38181 "LU-13437 mdt: rename misses remote LOOKUP lock revoke" **
https://review.whamcloud.com/37756 "LU-930 doc: update James Simmons contact info"  [trivial]
https://review.whamcloud.com/37567 "LU-13180 osc: disable ext merging for rdma only pages and non-rdma" ****
https://review.whamcloud.com/36707 "LU-8130 lu_object: convert lu_object cache to rhashtable" **
https://review.whamcloud.com/33616 "LU-8130 ptlrpc: convert conn_hash to rhashtable"
https://review.whamcloud.com/39135 "LU-9679 obdclass: remove init to 0 from lustre_init_lsi()" ** [trivial]
https://review.whamcloud.com/38580 "LU-13525 sec: better struct sepol_downcall_data"
https://review.whamcloud.com/37969 "LU-13365 ldlm: check slv and limit before updating"  ****
https://review.whamcloud.com/37607 "LU-9679 osc: simplify osc_extent_find()"  ****

Most of these seem unrelated to this failure, based on the description, but since several of them were marked trivial it is possible that they introduced an error in the background. The most probable sources of this regression are marked with "****", a few less likely ones are "**". Pushing a patch to revert those changes, with "Test-Parameters: trivial testlist=sanity-dom env=ONLY=20,ONLY_REPEAT=50" would see if they are still failing, or if the revert has resolved the problem.

Comment by James Nunez (Inactive) [ 29/Jul/20 ]

The following patch is not a fix for this issue, but should be used if we can't find a solution for this issue in a timely manner.

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39534
Subject: LU-13759 tests: stop running sanity-dom test 20
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 53182625fce4a0936008b21274048c13c2c5da45

Comment by Mikhail Pershin [ 29/Jul/20 ]

James, I expect that https://review.whamcloud.com/39401 is solution

Comment by Andreas Dilger [ 30/Jul/20 ]

Mike, if https://review.whamcloud.com/39401 is the solution to this problem, it would be good to add a Test-Parameters: line as I put in my previous comment to verify it fixes this issue.

What is still confusing to me is why this test started failing on June 28th, when the code being fixed by 39401 is much older than that?

Comment by Gerrit Updater [ 30/Jul/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39540
Subject: LU-13759 test: test sanityn 20
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0dcb8836a1ec5c8ecab97ad47afdb63ce4856ef2

Comment by Mikhail Pershin [ 30/Jul/20 ]

Andreas, I've just pushed separate patch just for testing. Meanwhile test was failing also before June 28th as I can see. And if I am right about the reason then the problem was introduced by https://review.whamcloud.com/#/c/34858/ which start using MDC code to flush DoM data.

Maloo search shows first failures at 2019-11-21, then a bit in February-March and then growing amount of failures till now. I am not sure why its frequency is increasing and agree that there can be other trigger of this or other root cause even

Comment by Gerrit Updater [ 30/Jul/20 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39549
Subject: LU-13759 tests: debug patch not for review
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 99685a7c88a5be791bdf452f1c679808d8394502

Comment by Gerrit Updater [ 13/Aug/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39401/
Subject: LU-13759 dom: lock cancel to drop pages
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e95eca236471cf23083ef281ef204a5920e4db9b

Comment by Gerrit Updater [ 13/Aug/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39540/
Subject: LU-13759 test: make sanityn test_20 repeatable
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 910ed44d1f3844ae3f76a3594dbd1a09b5892643

Comment by Gerrit Updater [ 19/Oct/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40302
Subject: LU-13759 dom: lock cancel to drop pages
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 59daace04573950e436385020c565399cae08c9e

Comment by Peter Jones [ 20/Oct/20 ]

Seems to be fixed

Comment by Gerrit Updater [ 04/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40302/
Subject: LU-13759 dom: lock cancel to drop pages
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 63b0c8f28dbd8513774219b8802370a638668811

Generated at Sat Feb 10 03:03:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.