Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9306

sanity-hsm test 24d is failing with 'request on 0x200000405:0x24:0x0 is not SUCCEED on mds1'

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.10.0
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      sanity_hsm test_24d is failing. From the test log, we wait for an update for over 200 seconds:

      CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
       sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200000405:0x24:0x0 is not SUCCEED on mds1 
      

      There is nothing obviously wrong in the console logs for any of the nodes.

      The copytool_log for this test is nearly empty and doesn’t provide any information on what is causing this problem. The full copytool_log for this test is:

      1491012689.288932 lhsmtool_posix[24069]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre3
      1491012689.334255 lhsmtool_posix[24070]: waiting for message from kernel
      exiting: Terminated
      

      This test failure could be leading to a cascade of failures. After test 24d fails, the following tests fail 24e, 24f, 25b, 26, 27b, 28, 29b, 29c, 30b, 30c, 31b, and many more. I don’t know if all the failures are related, but we should clean up the first test that’s failing.

      So far, I’ve only seen this test fail for review-dne-part-2. So, the issue may be DNE related?

      This test started to fail on the master branch on 2017-03-25 and has failed about 19 times since then. The patch for LU-8911, https://review.whamcloud.com/#/c/24185/, is the last patch that made modifications to this test and sanity-hsm.

      Here are links to some of the failed test logs:
      2017-04-06 - https://testing.hpdd.intel.com/test_sets/81096390-1ae7-11e7-9073-5254006e85c2
      2017-04-05 - https://testing.hpdd.intel.com/test_sets/ad0ce212-1a3f-11e7-9de9-5254006e85c2
      2017-04-05 - https://testing.hpdd.intel.com/test_sets/28ab074e-19ed-11e7-b742-5254006e85c2
      2017-04-05 - https://testing.hpdd.intel.com/test_sets/2bd0287a-19cd-11e7-8920-5254006e85c2
      2017-04-04 - https://testing.hpdd.intel.com/test_sets/550c4e1a-1952-11e7-9de9-5254006e85c2
      2017-04-03 - https://testing.hpdd.intel.com/test_sets/d986e31e-18c9-11e7-8920-5254006e85c2

      Attachments

        Activity

          [LU-9306] sanity-hsm test 24d is failing with 'request on 0x200000405:0x24:0x0 is not SUCCEED on mds1'
          sguminsx Steve Guminski (Inactive) added a comment - Another on master: https://testing.hpdd.intel.com/test_sessions/991eb176-e246-4d8f-bcca-2a62fc2e179f

          Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26770
          Subject: LU-9306 tests: more debug info for hsm test_24d
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 8c7c03ff53e5c5f63da6e99f42c3124c4f5c2d29

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26770 Subject: LU-9306 tests: more debug info for hsm test_24d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8c7c03ff53e5c5f63da6e99f42c3124c4f5c2d29

          Honestly, I cannot establish the relationship between the current sanity-hsm test_24d failure and the snapshot patches. The unique possible relation is the test_24d will try to check HSM actions with client mounted as read-only that seems something affected by the patch "0001-LU-8900-snapshot-simulate-readonly-device.patch" (https://review.whamcloud.com/24267). But in fact, such patch almost has nothing related with HSM. So I have to make some debug patch to collect more information.

          yong.fan nasf (Inactive) added a comment - Honestly, I cannot establish the relationship between the current sanity-hsm test_24d failure and the snapshot patches. The unique possible relation is the test_24d will try to check HSM actions with client mounted as read-only that seems something affected by the patch "0001- LU-8900 -snapshot-simulate-readonly-device.patch" ( https://review.whamcloud.com/24267 ). But in fact, such patch almost has nothing related with HSM. So I have to make some debug patch to collect more information.

          Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/26734
          Subject: LU-9306 tests: sanity-hsm, register traps in a better order
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: abfc0251f892862145c7576912640cce841fd7ba

          gerrit Gerrit Updater added a comment - Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/26734 Subject: LU-9306 tests: sanity-hsm, register traps in a better order Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: abfc0251f892862145c7576912640cce841fd7ba
          pjones Peter Jones added a comment -

          Fan Yong

          Could this have been caused by any of the ZFS Snapshots patches?

          Peter

          pjones Peter Jones added a comment - Fan Yong Could this have been caused by any of the ZFS Snapshots patches? Peter

          The first recent failure was 2017-03-25 with test https://testing.hpdd.intel.com/test_sets/82d3a2da-115f-11e7-9073-5254006e85c2 which was right after the ZFS Snapshot feature was landed on 2017-03-23. It makes sense to look at the patches that landed on that day to see if any of them could have caused this.

          adilger Andreas Dilger added a comment - The first recent failure was 2017-03-25 with test https://testing.hpdd.intel.com/test_sets/82d3a2da-115f-11e7-9073-5254006e85c2 which was right after the ZFS Snapshot feature was landed on 2017-03-23. It makes sense to look at the patches that landed on that day to see if any of them could have caused this.

          This test has failed 25x in the past week, as often as 7x in a single day.

          adilger Andreas Dilger added a comment - This test has failed 25x in the past week, as often as 7x in a single day.

          Thank you!

          bougetq Quentin Bouget (Inactive) added a comment - Thank you!

          Quentin,

          Both sanity-hsm test 9a and test 29d are skipped because they require three or more clients. I reviewed results for these tests for the past two years and found the last time test 9a was not skipped was September 9, 2015 and earlier; it actually passed 13 times between February and September, 2015. Test 29d was skipped for this whole period.

          jamesanunez James Nunez (Inactive) added a comment - Quentin, Both sanity-hsm test 9a and test 29d are skipped because they require three or more clients. I reviewed results for these tests for the past two years and found the last time test 9a was not skipped was September 9, 2015 and earlier; it actually passed 13 times between February and September, 2015. Test 29d was skipped for this whole period.

          James,

          Could you tell me when was the last time test_9a and test_29d were run please?

          bougetq Quentin Bouget (Inactive) added a comment - James, Could you tell me when was the last time test_9a and test_29d were run please?

          The cleanup is not done correctly because test_24d calls

          trap cleanup_test_24d EXIT
          

          before

          copytool_setup $SINGLEAGT "$MOUNT3"
          

          which sets its own trap on EXIT.

          I have a function to stack traps, I am working on integrating it in sanity-hsm.

          As to why the test itself fails, I must say I still have no idea.

          bougetq Quentin Bouget (Inactive) added a comment - The cleanup is not done correctly because test_24d calls trap cleanup_test_24d EXIT before copytool_setup $SINGLEAGT "$MOUNT3" which sets its own trap on EXIT. I have a function to stack traps, I am working on integrating it in sanity-hsm. As to why the test itself fails, I must say I still have no idea.

          People

            jhammond John Hammond
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: