Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13759

sanity-dom sanityn_test_20 fails with '1 page left in cache after lock cancel'

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0, Lustre 2.12.7
    • Lustre 2.14.0
    • DNE
    • 3
    • 9223372036854775807

    Description

      sanity-dom sanityn test_20 fails with '1 page left in cache after lock cancel'. This test started failing on 28 June 2020 and is only failing for DNE testing meaning in review-dne-part-4 and review-dne-zfs-part-4.

      sanity-dom runs several sanityn.sh tests with DOM enabled

       178 test_sanityn()
       179 {
       180         # XXX: to fix 60
       181         ONLY="1 2 4 5 6 7 8 9 10 11 12 14 17 19 20 23 27 39 51a 51c 51d" \
       182                 OSC="mdc" DOM="yes" bash sanityn.sh
       183 
       184         return 0
       185 }
       186 run_test sanityn "Run sanityn with Data-on-MDT files"
      

      and it is actually sanityn test 20 that we see fail here.

      There’s a couple of problems:
      1. sanityn test 20 is failing when DOM=”yes” is set
      2. when this test fails, sanity-dom is not marked as failed or not marked in a way that Maloo recognizes the failure. So, this is a silent failure

      This ticket deals with sanity-dom’s sanityn test 20 failure. I’ll open a different ticket for the sanity-dom failures not getting recognized as failures.

      For a recent failure, logs at https://testing.whamcloud.com/test_sets/5230daaa-9cb6-4bdf-98ad-330a658a197a, the suite_log doesn’t reveal anything about the cause of the failure

      == sanityn test 20: test extra readahead page left in cache ========================================== 09:32:02 (1594114322)
      striped dir -i0 -c2 -H fnv_1a_64 /mnt/lustre/d20
       sanityn test_20: @@@@@@ FAIL: 1 page left in cache after lock cancel 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6167:error()
        = sanityn.sh:600:test_20()
      

      Since the failure is not recognized as a failure by Maloo, there are no logs other than console logs to look at. The console logs do not provide any information on why the test is failing.

      Recent failures of this test are at:
      https://testing.whamcloud.com/test_sets/61841ecb-57f6-4c0f-b563-01eae76405f2
      https://testing.whamcloud.com/test_sets/88646434-24d8-41fc-81cc-43d19e862c07

      Attachments

        Issue Links

          Activity

            [LU-13759] sanity-dom sanityn_test_20 fails with '1 page left in cache after lock cancel'

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40302/
            Subject: LU-13759 dom: lock cancel to drop pages
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 63b0c8f28dbd8513774219b8802370a638668811

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40302/ Subject: LU-13759 dom: lock cancel to drop pages Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 63b0c8f28dbd8513774219b8802370a638668811
            pjones Peter Jones added a comment -

            Seems to be fixed

            pjones Peter Jones added a comment - Seems to be fixed

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40302
            Subject: LU-13759 dom: lock cancel to drop pages
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 59daace04573950e436385020c565399cae08c9e

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40302 Subject: LU-13759 dom: lock cancel to drop pages Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 59daace04573950e436385020c565399cae08c9e

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39540/
            Subject: LU-13759 test: make sanityn test_20 repeatable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 910ed44d1f3844ae3f76a3594dbd1a09b5892643

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39540/ Subject: LU-13759 test: make sanityn test_20 repeatable Project: fs/lustre-release Branch: master Current Patch Set: Commit: 910ed44d1f3844ae3f76a3594dbd1a09b5892643

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39401/
            Subject: LU-13759 dom: lock cancel to drop pages
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e95eca236471cf23083ef281ef204a5920e4db9b

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39401/ Subject: LU-13759 dom: lock cancel to drop pages Project: fs/lustre-release Branch: master Current Patch Set: Commit: e95eca236471cf23083ef281ef204a5920e4db9b

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39549
            Subject: LU-13759 tests: debug patch not for review
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 99685a7c88a5be791bdf452f1c679808d8394502

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39549 Subject: LU-13759 tests: debug patch not for review Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 99685a7c88a5be791bdf452f1c679808d8394502
            tappro Mikhail Pershin added a comment - - edited

            Andreas, I've just pushed separate patch just for testing. Meanwhile test was failing also before June 28th as I can see. And if I am right about the reason then the problem was introduced by https://review.whamcloud.com/#/c/34858/ which start using MDC code to flush DoM data.

            Maloo search shows first failures at 2019-11-21, then a bit in February-March and then growing amount of failures till now. I am not sure why its frequency is increasing and agree that there can be other trigger of this or other root cause even

            tappro Mikhail Pershin added a comment - - edited Andreas, I've just pushed separate patch just for testing. Meanwhile test was failing also before June 28th as I can see. And if I am right about the reason then the problem was introduced by https://review.whamcloud.com/#/c/34858/ which start using MDC code to flush DoM data. Maloo search shows first failures at 2019-11-21, then a bit in February-March and then growing amount of failures till now. I am not sure why its frequency is increasing and agree that there can be other trigger of this or other root cause even

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39540
            Subject: LU-13759 test: test sanityn 20
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0dcb8836a1ec5c8ecab97ad47afdb63ce4856ef2

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39540 Subject: LU-13759 test: test sanityn 20 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0dcb8836a1ec5c8ecab97ad47afdb63ce4856ef2

            Mike, if https://review.whamcloud.com/39401 is the solution to this problem, it would be good to add a Test-Parameters: line as I put in my previous comment to verify it fixes this issue.

            What is still confusing to me is why this test started failing on June 28th, when the code being fixed by 39401 is much older than that?

            adilger Andreas Dilger added a comment - Mike, if https://review.whamcloud.com/39401 is the solution to this problem, it would be good to add a Test-Parameters: line as I put in my previous comment to verify it fixes this issue. What is still confusing to me is why this test started failing on June 28th, when the code being fixed by 39401 is much older than that?

            James, I expect that https://review.whamcloud.com/39401 is solution

            tappro Mikhail Pershin added a comment - James, I expect that https://review.whamcloud.com/39401 is solution

            People

              tappro Mikhail Pershin
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: