Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17165

recovery-small test_141: mgc lost locks

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.17.0
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for S Buisson <sbuisson@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c7030573-01fa-4889-ae5f-64824986f779

      test_141 failed with the following error:

      mgc lost locks (12 != 4)
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/99081 - 4.18.0-477.21.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/99081 - 4.18.0-477.21.1.el8_lustre.x86_64

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      recovery-small test_141 - mgc lost locks (12 != 4)

      Attachments

        Issue Links

          Activity

            [LU-17165] recovery-small test_141: mgc lost locks

            The majority of failure cases are (confusingly) either "oldc 0 != newc 23" or "oldc 23 != newc 0". This makes me think that the delay between "cancel_lru_locks MGC" and "do_facet ost1 ... lock_count" is racy and they randomly get "23" or "0" and sometimes they match (either both "0" or both "23") and sometimes they don't.

            It isn't clear to me why the DLM locks are cancelled before being read on the OST before it is restarted? If that is to check that the client is refreshing its locks, then it would make sense for there to be some gap (e.g. 3-5s) so that the client will refresh the MGC locks again before they are counted. Then afterward it should use wait_update() until there are the same number of locks before considering the test a failure.

            adilger Andreas Dilger added a comment - The majority of failure cases are (confusingly) either " oldc 0 != newc 23 " or " oldc 23 != newc 0 ". This makes me think that the delay between " cancel_lru_locks MGC " and " do_facet ost1 ... lock_count " is racy and they randomly get "23" or "0" and sometimes they match (either both "0" or both "23") and sometimes they don't. It isn't clear to me why the DLM locks are cancelled before being read on the OST before it is restarted? If that is to check that the client is refreshing its locks, then it would make sense for there to be some gap (e.g. 3-5s) so that the client will refresh the MGC locks again before they are counted. Then afterward it should use wait_update() until there are the same number of locks before considering the test a failure.
            yujian Jian Yu added a comment - Lustre 2.16.0 RC5: https://testing.whamcloud.com/test_sets/631b2da4-fd5d-41b6-ab19-f1838b9c4ac5
            yujian Jian Yu added a comment -

            The failure occurred consistently in failover-part-1 test sessions.

            yujian Jian Yu added a comment - The failure occurred consistently in failover-part-1 test sessions.

            From the recent test failures, it looks like the MGC lock count is pretty erratic. Sometimes the lock count before MGS restart is greater than after MGS restart, sometimes it is the other way around. And sometimes the MGC lock count before the MGS restart is even 0, which is a bit surprising for a running Lustre file system.
            See the error message printed in this test results page:
            https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=be891259-db0f-49c1-bd23-eafc220a3fc8&start_date=2024-09-27&end_date=2024-10-03&source=sub_tests#redirect

            I initially introduced recovery-small test_141, but it does not seem the problem is in the test itself. Unfortunately I am not an ldlm lock specialist.

            sebastien Sebastien Buisson added a comment - From the recent test failures, it looks like the MGC lock count is pretty erratic. Sometimes the lock count before MGS restart is greater than after MGS restart, sometimes it is the other way around. And sometimes the MGC lock count before the MGS restart is even 0, which is a bit surprising for a running Lustre file system. See the error message printed in this test results page: https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=be891259-db0f-49c1-bd23-eafc220a3fc8&start_date=2024-09-27&end_date=2024-10-03&source=sub_tests#redirect I initially introduced recovery-small test_141, but it does not seem the problem is in the test itself. Unfortunately I am not an ldlm lock specialist.
            yujian Jian Yu added a comment - +1 on master branch: https://testing.whamcloud.com/test_sets/91343880-9495-40dd-b77f-895ac8f90176
            adilger Andreas Dilger added a comment - This is still failing regularly: https://testing.whamcloud.com/search?horizon=2332800&status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=be891259-db0f-49c1-bd23-eafc220a3fc8&source=sub_tests#redirect
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55591/
            Subject: LU-17165 tests: fix recovery-small test_141
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 755c9c0f78a777b342a42a74aa8fb93d04e7cad8

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55591/ Subject: LU-17165 tests: fix recovery-small test_141 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 755c9c0f78a777b342a42a74aa8fb93d04e7cad8

            "Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55591
            Subject: LU-17165 tests: fix recovery-small test_141
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c7b45f61bda6d8efed4ec7ccda79e2d11aad823d

            gerrit Gerrit Updater added a comment - "Sebastien Buisson <sbuisson@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55591 Subject: LU-17165 tests: fix recovery-small test_141 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c7b45f61bda6d8efed4ec7ccda79e2d11aad823d

            This test was added in patch https://review.whamcloud.com/37344 "LU-13116 mgc: do not lose sptlrpc config lock". It isn't clear if there is a real problem or if there is some unrelated MGC lock that is being canceled during the test?

            adilger Andreas Dilger added a comment - This test was added in patch https://review.whamcloud.com/37344 " LU-13116 mgc: do not lose sptlrpc config lock ". It isn't clear if there is a real problem or if there is some unrelated MGC lock that is being canceled during the test?

            People

              core-lustre-triage Core Lustre Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: