Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9306

sanity-hsm test 24d is failing with 'request on 0x200000405:0x24:0x0 is not SUCCEED on mds1'

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.10.0
    • Lustre 2.10.0
    • None
    • 3
    • 9223372036854775807

    Description

      sanity_hsm test_24d is failing. From the test log, we wait for an update for over 200 seconds:

      CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      CMD: onyx-39vm7 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200000405:0x24:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
      Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
       sanity-hsm test_24d: @@@@@@ FAIL: request on 0x200000405:0x24:0x0 is not SUCCEED on mds1 
      

      There is nothing obviously wrong in the console logs for any of the nodes.

      The copytool_log for this test is nearly empty and doesn’t provide any information on what is causing this problem. The full copytool_log for this test is:

      1491012689.288932 lhsmtool_posix[24069]: action=0 src=(null) dst=(null) mount_point=/mnt/lustre3
      1491012689.334255 lhsmtool_posix[24070]: waiting for message from kernel
      exiting: Terminated
      

      This test failure could be leading to a cascade of failures. After test 24d fails, the following tests fail 24e, 24f, 25b, 26, 27b, 28, 29b, 29c, 30b, 30c, 31b, and many more. I don’t know if all the failures are related, but we should clean up the first test that’s failing.

      So far, I’ve only seen this test fail for review-dne-part-2. So, the issue may be DNE related?

      This test started to fail on the master branch on 2017-03-25 and has failed about 19 times since then. The patch for LU-8911, https://review.whamcloud.com/#/c/24185/, is the last patch that made modifications to this test and sanity-hsm.

      Here are links to some of the failed test logs:
      2017-04-06 - https://testing.hpdd.intel.com/test_sets/81096390-1ae7-11e7-9073-5254006e85c2
      2017-04-05 - https://testing.hpdd.intel.com/test_sets/ad0ce212-1a3f-11e7-9de9-5254006e85c2
      2017-04-05 - https://testing.hpdd.intel.com/test_sets/28ab074e-19ed-11e7-b742-5254006e85c2
      2017-04-05 - https://testing.hpdd.intel.com/test_sets/2bd0287a-19cd-11e7-8920-5254006e85c2
      2017-04-04 - https://testing.hpdd.intel.com/test_sets/550c4e1a-1952-11e7-9de9-5254006e85c2
      2017-04-03 - https://testing.hpdd.intel.com/test_sets/d986e31e-18c9-11e7-8920-5254006e85c2

      Attachments

        Activity

          [LU-9306] sanity-hsm test 24d is failing with 'request on 0x200000405:0x24:0x0 is not SUCCEED on mds1'
          jhammond John Hammond added a comment -

          Peter,

          > Am I right in thinking that the patch just landed to master will address the blocking aspect of this ticket and that any other work discussed will be tracked elsewhere?

          Correct.

          jhammond John Hammond added a comment - Peter, > Am I right in thinking that the patch just landed to master will address the blocking aspect of this ticket and that any other work discussed will be tracked elsewhere? Correct.
          bougetq Quentin Bouget (Inactive) added a comment - - edited

          I will open a new ticket to track the work on trap registration in sanity-hsm.

          Just from the commit messages of the other patches (disable sanity-hsm test 24d and more debug info for hsm test_24d), I don't think they are needed anymore.

          On the other hand, the patch that adds debug info (and was merged) should probably be reverted once we are confident enough the issue was solved. I don't know how this revert should be tracked though.

          bougetq Quentin Bouget (Inactive) added a comment - - edited I will open a new ticket to track the work on trap registration in sanity-hsm. Just from the commit messages of the other patches ( disable sanity-hsm test 24d and more debug info for hsm test_24d ), I don't think they are needed anymore. On the other hand, the patch that adds debug info (and was merged) should probably be reverted once we are confident enough the issue was solved. I don't know how this revert should be tracked though.
          pjones Peter Jones added a comment -

          Am I right in thinking that the patch just landed to master will address the blocking aspect of this ticket and that any other work discussed will be tracked elsewhere?

          pjones Peter Jones added a comment - Am I right in thinking that the patch just landed to master will address the blocking aspect of this ticket and that any other work discussed will be tracked elsewhere?

          Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26883/
          Subject: LU-9306 kuc: initialize kkuc_groups at module init time
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 37bbe9a1bf33b77d3f239ea69ce052e99f09308d

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26883/ Subject: LU-9306 kuc: initialize kkuc_groups at module init time Project: fs/lustre-release Branch: master Current Patch Set: Commit: 37bbe9a1bf33b77d3f239ea69ce052e99f09308d

          John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/26883
          Subject: LU-9306 kuc: initialize kkuc_groups and module init time
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 3547e2146e14f7c112dd002694b3df3bc9d95a35

          gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/26883 Subject: LU-9306 kuc: initialize kkuc_groups and module init time Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3547e2146e14f7c112dd002694b3df3bc9d95a35
          jhammond John Hammond added a comment - - edited

          This is another instance of LU-9038. The CT has not registered when the request was sent, so it will take some time for the request to be reissued. In the fix from LU-9038 https://review.whamcloud.com/25050 we should also check for an empty list (since otherwise the message will be dropped):

          diff --git a/lustre/obdclass/kernelcomm.c b/lustre/obdclass/kernelcomm.c
          index 0787f46..cb52660 100644
          --- a/lustre/obdclass/kernelcomm.c
          +++ b/lustre/obdclass/kernelcomm.c
          @@ -194,6 +194,14 @@ int libcfs_kkuc_group_put(int group, void *payload)
                  ENTRY;
           
                  down_write(&kg_sem);
          +
          +       if (unlikely(kkuc_groups[group].next == NULL) ||
          +           unlikely(OBD_FAIL_CHECK(OBD_FAIL_MDS_HSM_CT_REGISTER_NET))) {
          +               /* no agent have fully registered, CDT will retry */
          +               up_write(&kg_sem);
          +               RETURN(-EAGAIN);
          +       }
          +
                  list_for_each_entry(reg, &kkuc_groups[group], kr_chain) {
                          if (reg->kr_fp != NULL) {
                                  rc = libcfs_kkuc_msg_put(reg->kr_fp, payload);
          
          jhammond John Hammond added a comment - - edited This is another instance of LU-9038 . The CT has not registered when the request was sent, so it will take some time for the request to be reissued. In the fix from LU-9038 https://review.whamcloud.com/25050 we should also check for an empty list (since otherwise the message will be dropped): diff --git a/lustre/obdclass/kernelcomm.c b/lustre/obdclass/kernelcomm.c index 0787f46..cb52660 100644 --- a/lustre/obdclass/kernelcomm.c +++ b/lustre/obdclass/kernelcomm.c @@ -194,6 +194,14 @@ int libcfs_kkuc_group_put(int group, void *payload) ENTRY; down_write(&kg_sem); + + if (unlikely(kkuc_groups[group].next == NULL) || + unlikely(OBD_FAIL_CHECK(OBD_FAIL_MDS_HSM_CT_REGISTER_NET))) { + /* no agent have fully registered, CDT will retry */ + up_write(&kg_sem); + RETURN(-EAGAIN); + } + list_for_each_entry(reg, &kkuc_groups[group], kr_chain) { if (reg->kr_fp != NULL) { rc = libcfs_kkuc_msg_put(reg->kr_fp, payload);

          James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/26877
          Subject: LU-9306 test: disable sanity-hsm test 24d
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 135809281665710c5a6c32db45434c0dd8e9ba05

          gerrit Gerrit Updater added a comment - James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/26877 Subject: LU-9306 test: disable sanity-hsm test 24d Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 135809281665710c5a6c32db45434c0dd8e9ba05

          Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26850
          Subject: LU-9306 tests: more debug info for hsm test_24d
          Project: fs/lustre-release
          Branch: pfl
          Current Patch Set: 1
          Commit: 2a7f9003fcb2002e3d2163e8b8e8f628491b1cec

          gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26850 Subject: LU-9306 tests: more debug info for hsm test_24d Project: fs/lustre-release Branch: pfl Current Patch Set: 1 Commit: 2a7f9003fcb2002e3d2163e8b8e8f628491b1cec

          I'm reopening this ticket because it's not clear to me that the root cause of sanity-hsm test 24 failures is known/fixed and I'd still like to see Quentin's patch https://review.whamcloud.com/#/c/26734/ land so this test cleans up properly when it encounters an error.

          jamesanunez James Nunez (Inactive) added a comment - I'm reopening this ticket because it's not clear to me that the root cause of sanity-hsm test 24 failures is known/fixed and I'd still like to see Quentin's patch https://review.whamcloud.com/#/c/26734/ land so this test cleans up properly when it encounters an error.
          pjones Peter Jones added a comment -

          Landed for 2.10

          pjones Peter Jones added a comment - Landed for 2.10

          People

            jhammond John Hammond
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: