Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9306

sanity-hsm test 24d is failing with 'request on 0x200000405:0x24:0x0 is not SUCCEED on mds1'

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.10.0
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      sanity_hsm test_24d is failing. From the test log, we wait for an update for over 200 seconds:

      CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
       sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200000405:0x24:0x0 is not SUCCEED on mds1 
      

      There is nothing obviously wrong in the console logs for any of the nodes.

      The copytool_log for this test is nearly empty and doesn’t provide any information on what is causing this problem. The full copytool_log for this test is:

      1491012689.288932 lhsmtool_posix[24069]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre3
      1491012689.334255 lhsmtool_posix[24070]: waiting for message from kernel
      exiting: Terminated
      

      This test failure could be leading to a cascade of failures. After test 24d fails, the following tests fail 24e, 24f, 25b, 26, 27b, 28, 29b, 29c, 30b, 30c, 31b, and many more. I don’t know if all the failures are related, but we should clean up the first test that’s failing.

      So far, I’ve only seen this test fail for review-dne-part-2. So, the issue may be DNE related?

      This test started to fail on the master branch on 2017-03-25 and has failed about 19 times since then. The patch for LU-8911, https://review.whamcloud.com/#/c/24185/, is the last patch that made modifications to this test and sanity-hsm.

      Here are links to some of the failed test logs:
      2017-04-06 - https://testing.hpdd.intel.com/test_sets/81096390-1ae7-11e7-9073-5254006e85c2
      2017-04-05 - https://testing.hpdd.intel.com/test_sets/ad0ce212-1a3f-11e7-9de9-5254006e85c2
      2017-04-05 - https://testing.hpdd.intel.com/test_sets/28ab074e-19ed-11e7-b742-5254006e85c2
      2017-04-05 - https://testing.hpdd.intel.com/test_sets/2bd0287a-19cd-11e7-8920-5254006e85c2
      2017-04-04 - https://testing.hpdd.intel.com/test_sets/550c4e1a-1952-11e7-9de9-5254006e85c2
      2017-04-03 - https://testing.hpdd.intel.com/test_sets/d986e31e-18c9-11e7-8920-5254006e85c2

      Attachments

        Activity

          [LU-9306] sanity-hsm test 24d is failing with 'request on 0x200000405:0x24:0x0 is not SUCCEED on mds1'

          Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26850
          Subject: LU-9306 tests: more debug info for hsm test_24d
          Project: fs/lustre-release
          Branch: pfl
          Current Patch Set: 1
          Commit: 2a7f9003fcb2002e3d2163e8b8e8f628491b1cec

          gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26850 Subject: LU-9306 tests: more debug info for hsm test_24d Project: fs/lustre-release Branch: pfl Current Patch Set: 1 Commit: 2a7f9003fcb2002e3d2163e8b8e8f628491b1cec

          I'm reopening this ticket because it's not clear to me that the root cause of sanity-hsm test 24 failures is known/fixed and I'd still like to see Quentin's patch https://review.whamcloud.com/#/c/26734/ land so this test cleans up properly when it encounters an error.

          jamesanunez James Nunez (Inactive) added a comment - I'm reopening this ticket because it's not clear to me that the root cause of sanity-hsm test 24 failures is known/fixed and I'd still like to see Quentin's patch https://review.whamcloud.com/#/c/26734/ land so this test cleans up properly when it encounters an error.
          pjones Peter Jones added a comment -

          Landed for 2.10

          pjones Peter Jones added a comment - Landed for 2.10

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26770/
          Subject: LU-9306 tests: more debug info for hsm test_24d
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: fc7c513b4cbcc8775076f6490f2df03b52cf4051

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26770/ Subject: LU-9306 tests: more debug info for hsm test_24d Project: fs/lustre-release Branch: master Current Patch Set: Commit: fc7c513b4cbcc8775076f6490f2df03b52cf4051
          sguminsx Steve Guminski (Inactive) added a comment - Another on master: https://testing.hpdd.intel.com/test_sessions/991eb176-e246-4d8f-bcca-2a62fc2e179f

          Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26770
          Subject: LU-9306 tests: more debug info for hsm test_24d
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 8c7c03ff53e5c5f63da6e99f42c3124c4f5c2d29

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26770 Subject: LU-9306 tests: more debug info for hsm test_24d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8c7c03ff53e5c5f63da6e99f42c3124c4f5c2d29

          Honestly, I cannot establish the relationship between the current sanity-hsm test_24d failure and the snapshot patches. The unique possible relation is the test_24d will try to check HSM actions with client mounted as read-only that seems something affected by the patch "0001-LU-8900-snapshot-simulate-readonly-device.patch" (https://review.whamcloud.com/24267). But in fact, such patch almost has nothing related with HSM. So I have to make some debug patch to collect more information.

          yong.fan nasf (Inactive) added a comment - Honestly, I cannot establish the relationship between the current sanity-hsm test_24d failure and the snapshot patches. The unique possible relation is the test_24d will try to check HSM actions with client mounted as read-only that seems something affected by the patch "0001- LU-8900 -snapshot-simulate-readonly-device.patch" ( https://review.whamcloud.com/24267 ). But in fact, such patch almost has nothing related with HSM. So I have to make some debug patch to collect more information.

          Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/26734
          Subject: LU-9306 tests: sanity-hsm, register traps in a better order
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: abfc0251f892862145c7576912640cce841fd7ba

          gerrit Gerrit Updater added a comment - Quentin Bouget (quentin.bouget@cea.fr) uploaded a new patch: https://review.whamcloud.com/26734 Subject: LU-9306 tests: sanity-hsm, register traps in a better order Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: abfc0251f892862145c7576912640cce841fd7ba
          pjones Peter Jones added a comment -

          Fan Yong

          Could this have been caused by any of the ZFS Snapshots patches?

          Peter

          pjones Peter Jones added a comment - Fan Yong Could this have been caused by any of the ZFS Snapshots patches? Peter

          The first recent failure was 2017-03-25 with test https://testing.hpdd.intel.com/test_sets/82d3a2da-115f-11e7-9073-5254006e85c2 which was right after the ZFS Snapshot feature was landed on 2017-03-23. It makes sense to look at the patches that landed on that day to see if any of them could have caused this.

          adilger Andreas Dilger added a comment - The first recent failure was 2017-03-25 with test https://testing.hpdd.intel.com/test_sets/82d3a2da-115f-11e7-9073-5254006e85c2 which was right after the ZFS Snapshot feature was landed on 2017-03-23. It makes sense to look at the patches that landed on that day to see if any of them could have caused this.

          This test has failed 25x in the past week, as often as 7x in a single day.

          adilger Andreas Dilger added a comment - This test has failed 25x in the past week, as often as 7x in a single day.

          People

            jhammond John Hammond
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: