[LU-13759] sanity-dom sanityn_test_20 fails with '1 page left in cache after lock cancel' Created: 07/Jul/20 Updated: 04/Mar/21 Resolved: 20/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.7 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | DNE | ||
| Environment: |
DNE |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
sanity-dom sanityn test_20 fails with '1 page left in cache after lock cancel'. This test started failing on 28 June 2020 and is only failing for DNE testing meaning in review-dne-part-4 and review-dne-zfs-part-4. sanity-dom runs several sanityn.sh tests with DOM enabled
178 test_sanityn()
179 {
180 # XXX: to fix 60
181 ONLY="1 2 4 5 6 7 8 9 10 11 12 14 17 19 20 23 27 39 51a 51c 51d" \
182 OSC="mdc" DOM="yes" bash sanityn.sh
183
184 return 0
185 }
186 run_test sanityn "Run sanityn with Data-on-MDT files"
and it is actually sanityn test 20 that we see fail here. There’s a couple of problems: This ticket deals with sanity-dom’s sanityn test 20 failure. I’ll open a different ticket for the sanity-dom failures not getting recognized as failures. For a recent failure, logs at https://testing.whamcloud.com/test_sets/5230daaa-9cb6-4bdf-98ad-330a658a197a, the suite_log doesn’t reveal anything about the cause of the failure == sanityn test 20: test extra readahead page left in cache ========================================== 09:32:02 (1594114322) striped dir -i0 -c2 -H fnv_1a_64 /mnt/lustre/d20 sanityn test_20: @@@@@@ FAIL: 1 page left in cache after lock cancel Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6167:error() = sanityn.sh:600:test_20() Since the failure is not recognized as a failure by Maloo, there are no logs other than console logs to look at. The console logs do not provide any information on why the test is failing. Recent failures of this test are at: |
| Comments |
| Comment by Peter Jones [ 09/Jul/20 ] |
|
Mike This failure is happening regularly - could you please investigate? Peter |
| Comment by Mikhail Pershin [ 13/Jul/20 ] |
|
I think it result of |
| Comment by Gerrit Updater [ 16/Jul/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39401 |
| Comment by Mikhail Pershin [ 16/Jul/20 ] |
|
The problem is part of I see two possible fixes here, the first one is implemented in patch - revert commit 02e766f5ed from The second way is to return dropped code with truncate_inode_pages_range() in llite when lock is being canceled. This needs more work to get layout DOM data from LOV first and still require call to MDC to handle LVB problems described in The only possible drawback with the first approach is performance drop, the |
| Comment by Gerrit Updater [ 17/Jul/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39412 |
| Comment by Mikhail Pershin [ 17/Jul/20 ] |
|
Alternative patch with keeping performance improvement from |
| Comment by Alexey Lyashkov [ 20/Jul/20 ] |
|
I tests a patch submitted by Mike, and I don't see any performance regressions with open. createmany have simple modification to make open as O_RDONLY, number open descriptors is adjusted to able open all files without close. |
| Comment by Andreas Dilger [ 29/Jul/20 ] |
|
There were a number of patches that landed on June 28th that are likely the cause of this problem: https://review.whamcloud.com/39134 "LU-12678 socklnd: don't fall-back to tcp_sendpage." [lnet] [trivial] https://review.whamcloud.com/39127 "LU-12678 lnet: Fix some out-of-date comments." [lnet] [trivial] https://review.whamcloud.com/39122 "LU-12678 o2iblnd: allocate init_qp_attr on stack." [lnet] [trivial] https://review.whamcloud.com/39117 "LU-9859 libcfs: fold cfs_tracefile_*_arch into their only callers." [libcfs] [trivial] https://review.whamcloud.com/38985 "LU-930 misc: update URLs in README" [trivial] https://review.whamcloud.com/38981 "LU-9859 libcfs: move tgt_descs to standard Linux bitmaps." ttps://review.whamcloud.com/38941 "LU-13595 scripts: Add a debug option to lustre_rmmod" [trivial] https://review.whamcloud.com/38743 "LU-13566 socklnd: fix local interface binding" [lnet] https://review.whamcloud.com/38181 "LU-13437 mdt: rename misses remote LOOKUP lock revoke" ** https://review.whamcloud.com/37756 "LU-930 doc: update James Simmons contact info" [trivial] https://review.whamcloud.com/37567 "LU-13180 osc: disable ext merging for rdma only pages and non-rdma" **** https://review.whamcloud.com/36707 "LU-8130 lu_object: convert lu_object cache to rhashtable" ** https://review.whamcloud.com/33616 "LU-8130 ptlrpc: convert conn_hash to rhashtable" https://review.whamcloud.com/39135 "LU-9679 obdclass: remove init to 0 from lustre_init_lsi()" ** [trivial] https://review.whamcloud.com/38580 "LU-13525 sec: better struct sepol_downcall_data" https://review.whamcloud.com/37969 "LU-13365 ldlm: check slv and limit before updating" **** https://review.whamcloud.com/37607 "LU-9679 osc: simplify osc_extent_find()" **** Most of these seem unrelated to this failure, based on the description, but since several of them were marked trivial it is possible that they introduced an error in the background. The most probable sources of this regression are marked with "****", a few less likely ones are "**". Pushing a patch to revert those changes, with "Test-Parameters: trivial testlist=sanity-dom env=ONLY=20,ONLY_REPEAT=50" would see if they are still failing, or if the revert has resolved the problem. |
| Comment by James Nunez (Inactive) [ 29/Jul/20 ] |
|
The following patch is not a fix for this issue, but should be used if we can't find a solution for this issue in a timely manner. James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39534 |
| Comment by Mikhail Pershin [ 29/Jul/20 ] |
|
James, I expect that https://review.whamcloud.com/39401 is solution |
| Comment by Andreas Dilger [ 30/Jul/20 ] |
|
Mike, if https://review.whamcloud.com/39401 is the solution to this problem, it would be good to add a Test-Parameters: line as I put in my previous comment to verify it fixes this issue. What is still confusing to me is why this test started failing on June 28th, when the code being fixed by 39401 is much older than that? |
| Comment by Gerrit Updater [ 30/Jul/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39540 |
| Comment by Mikhail Pershin [ 30/Jul/20 ] |
|
Andreas, I've just pushed separate patch just for testing. Meanwhile test was failing also before June 28th as I can see. And if I am right about the reason then the problem was introduced by https://review.whamcloud.com/#/c/34858/ which start using MDC code to flush DoM data. Maloo search shows first failures at 2019-11-21, then a bit in February-March and then growing amount of failures till now. I am not sure why its frequency is increasing and agree that there can be other trigger of this or other root cause even |
| Comment by Gerrit Updater [ 30/Jul/20 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39549 |
| Comment by Gerrit Updater [ 13/Aug/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39401/ |
| Comment by Gerrit Updater [ 13/Aug/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39540/ |
| Comment by Gerrit Updater [ 19/Oct/20 ] |
|
Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40302 |
| Comment by Peter Jones [ 20/Oct/20 ] |
|
Seems to be fixed |
| Comment by Gerrit Updater [ 04/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40302/ |