Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10734

sanity test_160g: User cl8 still found in changelog_users

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.0
    • Lustre 2.11.0
    • 3
    • 9223372036854775807

    Description

      sanity test_160g - User cl8 still found in changelog_users
      ^^^^^^^^^^^^^ DO NOT REMOVE LINE ABOVE ^^^^^^^^^^^^^

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run:
      https://testing.hpdd.intel.com/test_sets/5a8495f4-1bfa-11e8-a6ad-52540065bddc
      https://testing.hpdd.intel.com/test_sets/34e243bc-1be3-11e8-a7cd-52540065bddc

      test_160g failed with the following error:

      User cl8 still found in changelog_users
      

      This may be a dup of LU-9624
      I can't tell if it is so I am raising a fresh ticket.
      Will let somebody else decide if it's a dup or not.

      Attachments

        Issue Links

          Activity

            [LU-10734] sanity test_160g: User cl8 still found in changelog_users

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/31604/
            Subject: LU-10734 tests: ensure current GC interval is over
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 821087e65882a9885964ed07d6f2a630dfb599d5

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/31604/ Subject: LU-10734 tests: ensure current GC interval is over Project: fs/lustre-release Branch: master Current Patch Set: Commit: 821087e65882a9885964ed07d6f2a630dfb599d5

            This fail is blocked for now. test 160g was added to ALWAYS_EXCEPT in a patch landed to master for LU-10680. May need to look for similar fails if and when test 160g is taken back out of ALWAYS_EXCEPT.

            bogl Bob Glossman (Inactive) added a comment - This fail is blocked for now. test 160g was added to ALWAYS_EXCEPT in a patch landed to master for LU-10680 . May need to look for similar fails if and when test 160g is taken back out of ALWAYS_EXCEPT.

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31604
            Subject: LU-10734 tests: ensure current GC interval is over
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 72032b016ea8ab62cc681e72b5565ba207a6c316

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/31604 Subject: LU-10734 tests: ensure current GC interval is over Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 72032b016ea8ab62cc681e72b5565ba207a6c316
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Eh eh, after taking some time to think about it, I was wondering if the only regression/side effect coming from patch https://review.whamcloud.com/27535 ("a37134d LU-9624 tests: fix pre-DNE test exceptions/llog usage"), that we strongly suspect to be the cause of these failures, is to have slightly reduced the execution/elapsed time of sanity/test_160g script's beginning/prologue that may now take less than the 2 seconds of delay interval between 2x garbage-collection thread runs (as it has just run in sanity/test_160f, when sanity.sh is being fully executed during auto-tests) being used/configured ("changelog_min_gc_interval=2").
            And this seems to be confirmed during my reproducer testing.

            So a simple "sleep 2" at the beginning of sanity/test_160g should fix this problem.

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Eh eh, after taking some time to think about it, I was wondering if the only regression/side effect coming from patch https://review.whamcloud.com/27535 ("a37134d LU-9624 tests: fix pre-DNE test exceptions/llog usage"), that we strongly suspect to be the cause of these failures, is to have slightly reduced the execution/elapsed time of sanity/test_160g script's beginning/prologue that may now take less than the 2 seconds of delay interval between 2x garbage-collection thread runs (as it has just run in sanity/test_160f, when sanity.sh is being fully executed during auto-tests) being used/configured ("changelog_min_gc_interval=2"). And this seems to be confirmed during my reproducer testing. So a simple "sleep 2" at the beginning of sanity/test_160g should fix this problem.
            pjones Peter Jones added a comment -

            > It didn't fail during normal testing, but I guess SLES is not part of regular testing.

            Well, it is tested regularly, but due to the round robin system used for pre-landing review test runs, it is not guaranteed to run before everything lands unless people proactively request this with test parameters.

            pjones Peter Jones added a comment - > It didn't fail during normal testing, but I guess SLES is not part of regular testing. Well, it is tested regularly, but due to the round robin system used for pre-landing review test runs, it is not guaranteed to run before everything lands unless people proactively request this with test parameters.

            Note also that with patch https://review.whamcloud.com/31552 "LU-10680 mdd: disable changelog garbage collection by default" test_160f and test_160g need to be modified to set changelog_gc=1 at the start of each test, and remove the tests from ALWAYS_EXCEPT so that the tests will run properly.

            adilger Andreas Dilger added a comment - Note also that with patch https://review.whamcloud.com/31552 " LU-10680 mdd: disable changelog garbage collection by default " test_160f and test_160g need to be modified to set changelog_gc=1 at the start of each test, and remove the tests from ALWAYS_EXCEPT so that the tests will run properly.

            It looks like this failure relates to the landing of patch https://review.whamcloud.com/27535 "LU-9624 tests: fix pre-DNE test exceptions/llog usage". It didn't fail during normal testing, but I guess SLES is not part of regular testing.

            adilger Andreas Dilger added a comment - It looks like this failure relates to the landing of patch https://review.whamcloud.com/27535 " LU-9624 tests: fix pre-DNE test exceptions/llog usage ". It didn't fail during normal testing, but I guess SLES is not part of regular testing.

            Having a better look to the recent changes that may have introduced this regression, I think that "a37134d LU-9624 tests: fix pre-DNE test exceptions/llog usage" could better be the cause of it.

            Hope to get more about this soon now.

            bfaccini Bruno Faccini (Inactive) added a comment - Having a better look to the recent changes that may have introduced this regression, I think that "a37134d LU-9624 tests: fix pre-DNE test exceptions/llog usage" could better be the cause of it. Hope to get more about this soon now.

            +1 on master, all with DNE
            testing.hpdd.intel.com/test_sessions/7a5adbc7-2d4b-425a-9e71-a4674823a0df
            testing.hpdd.intel.com/test_sessions/a9ae8e29-d45d-49b6-a639-a6fba84f5dfc

            tappro Mikhail Pershin added a comment - +1 on master, all with DNE testing.hpdd.intel.com/test_sessions/7a5adbc7-2d4b-425a-9e71-a4674823a0df testing.hpdd.intel.com/test_sessions/a9ae8e29-d45d-49b6-a639-a6fba84f5dfc

            here is a similar fail seen on el7, not on SLES at all.
            proof this problem isn't SLES only.

            https://testing.hpdd.intel.com/test_sets/b3cb95da-20dd-11e8-a4b1-52540065bddc

            bogl Bob Glossman (Inactive) added a comment - here is a similar fail seen on el7, not on SLES at all. proof this problem isn't SLES only. https://testing.hpdd.intel.com/test_sets/b3cb95da-20dd-11e8-a4b1-52540065bddc

            after many similar fails here is a SLES test run that did NOT hit the failure:
            https://testing.hpdd.intel.com/test_sessions/ba2847b6-e445-448d-882d-356fca02b96e

            Don't know what the diff is between runs that fail and those that don't.
            Noting this instance that didn't fail in the hope that it may be of some use.

            bogl Bob Glossman (Inactive) added a comment - after many similar fails here is a SLES test run that did NOT hit the failure: https://testing.hpdd.intel.com/test_sessions/ba2847b6-e445-448d-882d-356fca02b96e Don't know what the diff is between runs that fail and those that don't. Noting this instance that didn't fail in the hope that it may be of some use.

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: