Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11161

sanity test_160g fails for DNE with ''mds2: User cl9 still registered''

Details

    • 3
    • 9223372036854775807

    Description

      test_160g failed with the following error:

      'mds2: User cl9 still registered'
      

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d2eee862-8ad3-11e8-9e83-52540065bddc

      This issue was created by maloo for James Nunez <james.a.nunez@intel.com>

      <<Please provide additional information about the failure here>>

      The patch for LU-10734, https://review.whamcloud.com/#/c/31604/, recently landed to master. It modifies sanity test 106g and removes that test from the ALWAYS_EXCEPT list. There seems to be an issue with the test since it fails in DNE testing when there is more than two MDSs.

      Looking at the MDS console logs for the test session mentioned above, we see the following. On the first MDS with MDT0 and MDT2 we can clearly see that the changelog user cl9 is deregistered

      [ 6260.597934] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users
      [ 6260.922657] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0002.changelog_users
      [ 6264.392672] Lustre: 8849:0:(mdd_trans.c:187:mdd_chlg_garbage_collect()) lustre-MDD0000: Force deregister of ChangeLog user cl9 idle with more than 4 unprocessed records
      [ 6264.600644] Lustre: DEBUG MARKER: ps -e -o comm= | grep chlg_gc_thread
      [ 6264.928866] Lustre: DEBUG MARKER: ps -e -o comm= | grep chlg_gc_thread
      [ 6265.932195] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users
      [ 6266.259457] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users
      [ 6266.586909] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.changelog_users
      

      On the other MDS with MDT1 and MDT3, we don't see the same user deregistered. In fact we don't see any users deregistered

      [ 6261.587345] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users
      [ 6261.913956] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users
      [ 6262.237009] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0003.changelog_users
      [ 6265.271455] Lustre: DEBUG MARKER: ps -e -o comm= | grep chlg_gc_thread
      [ 6265.602819] Lustre: DEBUG MARKER: ps -e -o comm= | grep chlg_gc_thread
      [ 6266.927027] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users
      [ 6267.246822] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0001.changelog_users
      [ 6267.685450] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_160g: @@@@@@ FAIL: mds2: User cl9 still registered 
      
      
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_160g - 'mds2: User cl9 still registered'

      Attachments

        Issue Links

          Activity

            [LU-11161] sanity test_160g fails for DNE with ''mds2: User cl9 still registered''

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34297/
            Subject: LU-11161 tests: start running sanity 160g again
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 5eb52e556975def830fbe0a8c323bff09690b16a

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34297/ Subject: LU-11161 tests: start running sanity 160g again Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 5eb52e556975def830fbe0a8c323bff09690b16a

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34297
            Subject: LU-11161 tests: start running sanity 160g again
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 6455dce5440c587aa495acf27226c1e16144ff74

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34297 Subject: LU-11161 tests: start running sanity 160g again Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 6455dce5440c587aa495acf27226c1e16144ff74

            Landed for 2.13.0

            jamesanunez James Nunez (Inactive) added a comment - Landed for 2.13.0

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33994/
            Subject: LU-11161 tests: start running sanity 160g again
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 22676740969314b1b08a31c24e5ebc4c403e08f2

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33994/ Subject: LU-11161 tests: start running sanity 160g again Project: fs/lustre-release Branch: master Current Patch Set: Commit: 22676740969314b1b08a31c24e5ebc4c403e08f2

            I think the issue with this test is that we are not exceeding the changelog_max_idle_indexes for each MDT, we were just meeting the threshold.

            In my testing, if I reduce the changelog_max_idle_indexes by 1 or write one more file per MDT, the tests passes in a DNE environment. I'll update patch https://review.whamcloud.com/33994 to reflect this.

            jamesanunez James Nunez (Inactive) added a comment - I think the issue with this test is that we are not exceeding the changelog_max_idle_indexes for each MDT, we were just meeting the threshold. In my testing, if I reduce the changelog_max_idle_indexes by 1 or write one more file per MDT, the tests passes in a DNE environment. I'll update patch https://review.whamcloud.com/33994 to reflect this.

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33994
            Subject: LU-11161 tests: start running sanity 160g again
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9ea747d65d84bb21c5d1f75155cf308529ca7f8a

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33994 Subject: LU-11161 tests: start running sanity 160g again Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9ea747d65d84bb21c5d1f75155cf308529ca7f8a

            I think this is a problem with the test script - it is not causing the user to be evicted on the other MDTs when the test expects this.

            What should be happening is that there are some logs created on all of the MDTs (create some files or whatever generates a change log entry), the gc timeout is reduced to some short interval, one user consumes the pending log records, the test sleeps longer than the interval, some new logs are created, and then the idle user that did not consume the log records is evicted.

            This is happening correctly on one MDT, but not on the others. It may be that there are no new records created, or we didn't wait long enough on those MDTs.

            adilger Andreas Dilger added a comment - I think this is a problem with the test script - it is not causing the user to be evicted on the other MDTs when the test expects this. What should be happening is that there are some logs created on all of the MDTs (create some files or whatever generates a change log entry), the gc timeout is reduced to some short interval, one user consumes the pending log records, the test sleeps longer than the interval, some new logs are created, and then the idle user that did not consume the log records is evicted. This is happening correctly on one MDT, but not on the others. It may be that there are no new records created, or we didn't wait long enough on those MDTs.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32844/
            Subject: LU-11161 tests: stop running sanity test 160g
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7955e2c62e7c97c2e56e1bfc8d7598f2e80a4e52

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32844/ Subject: LU-11161 tests: stop running sanity test 160g Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7955e2c62e7c97c2e56e1bfc8d7598f2e80a4e52

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32844
            Subject: LU-11161 tests: stop running sanity test 160g
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3d3cd77e61a0906cf71cdb2a94867e27a70d4be2

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32844 Subject: LU-11161 tests: stop running sanity test 160g Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3d3cd77e61a0906cf71cdb2a94867e27a70d4be2

            People

              jamesanunez James Nunez (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: