Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17165

recovery-small test_141: mgc lost locks

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • Lustre 2.17.0
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for S Buisson <sbuisson@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/c7030573-01fa-4889-ae5f-64824986f779

      test_141 failed with the following error:

      mgc lost locks (12 != 4)
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/99081 - 4.18.0-477.21.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/99081 - 4.18.0-477.21.1.el8_lustre.x86_64

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      recovery-small test_141 - mgc lost locks (12 != 4)

      Attachments

        Issue Links

          Activity

            [LU-17165] recovery-small test_141: mgc lost locks

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56945/
            Subject: LU-17165 tests: stable count in recovery-small/141
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 17a3151337308979eef57c4b422cc58142e003f7

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56945/ Subject: LU-17165 tests: stable count in recovery-small/141 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 17a3151337308979eef57c4b422cc58142e003f7

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56945
            Subject: LU-17165 tests: stable count in recovery-small/141
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f3e61a6eb23a749f30f7a1b16c7af72402918516

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56945 Subject: LU-17165 tests: stable count in recovery-small/141 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f3e61a6eb23a749f30f7a1b16c7af72402918516

            The majority of failure cases are (confusingly) either "oldc 0 != newc 23" or "oldc 23 != newc 0". This makes me think that the delay between "cancel_lru_locks MGC" and "do_facet ost1 ... lock_count" is racy and they randomly get "23" or "0" and sometimes they match (either both "0" or both "23") and sometimes they don't.

            It isn't clear to me why the DLM locks are cancelled before being read on the OST before it is restarted? If that is to check that the client is refreshing its locks, then it would make sense for there to be some gap (e.g. 3-5s) so that the client will refresh the MGC locks again before they are counted. Then afterward it should use wait_update() until there are the same number of locks before considering the test a failure.

            adilger Andreas Dilger added a comment - The majority of failure cases are (confusingly) either " oldc 0 != newc 23 " or " oldc 23 != newc 0 ". This makes me think that the delay between " cancel_lru_locks MGC " and " do_facet ost1 ... lock_count " is racy and they randomly get "23" or "0" and sometimes they match (either both "0" or both "23") and sometimes they don't. It isn't clear to me why the DLM locks are cancelled before being read on the OST before it is restarted? If that is to check that the client is refreshing its locks, then it would make sense for there to be some gap (e.g. 3-5s) so that the client will refresh the MGC locks again before they are counted. Then afterward it should use wait_update() until there are the same number of locks before considering the test a failure.
            yujian Jian Yu added a comment - Lustre 2.16.0 RC5: https://testing.whamcloud.com/test_sets/631b2da4-fd5d-41b6-ab19-f1838b9c4ac5
            yujian Jian Yu added a comment -

            The failure occurred consistently in failover-part-1 test sessions.

            yujian Jian Yu added a comment - The failure occurred consistently in failover-part-1 test sessions.

            From the recent test failures, it looks like the MGC lock count is pretty erratic. Sometimes the lock count before MGS restart is greater than after MGS restart, sometimes it is the other way around. And sometimes the MGC lock count before the MGS restart is even 0, which is a bit surprising for a running Lustre file system.
            See the error message printed in this test results page:
            https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=be891259-db0f-49c1-bd23-eafc220a3fc8&start_date=2024-09-27&end_date=2024-10-03&source=sub_tests#redirect

            I initially introduced recovery-small test_141, but it does not seem the problem is in the test itself. Unfortunately I am not an ldlm lock specialist.

            sebastien Sebastien Buisson added a comment - From the recent test failures, it looks like the MGC lock count is pretty erratic. Sometimes the lock count before MGS restart is greater than after MGS restart, sometimes it is the other way around. And sometimes the MGC lock count before the MGS restart is even 0, which is a bit surprising for a running Lustre file system. See the error message printed in this test results page: https://testing.whamcloud.com/search?status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=be891259-db0f-49c1-bd23-eafc220a3fc8&start_date=2024-09-27&end_date=2024-10-03&source=sub_tests#redirect I initially introduced recovery-small test_141, but it does not seem the problem is in the test itself. Unfortunately I am not an ldlm lock specialist.
            yujian Jian Yu added a comment - +1 on master branch: https://testing.whamcloud.com/test_sets/91343880-9495-40dd-b77f-895ac8f90176
            adilger Andreas Dilger added a comment - This is still failing regularly: https://testing.whamcloud.com/search?horizon=2332800&status%5B%5D=FAIL&test_set_script_id=f36cabd0-32c3-11e0-a61c-52540025f9ae&sub_test_script_id=be891259-db0f-49c1-bd23-eafc220a3fc8&source=sub_tests#redirect
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            People

              core-lustre-triage Core Lustre Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: