Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5706

conf-sanity test_57a: @@@@@@ FAIL: OST registration from failnode should fail

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0
    • Lustre 2.7.0, Lustre 2.5.4
    • 3
    • 15995

    Description

      This issue was created by maloo for Bruno Faccini <bfaccini62@gmail.com>

      Please provide additional information about the failure here.

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/637ff760-4ac1-11e4-a839-5254006e85c2.

      Seems that recent master reviews builds hit this failure quite frequently. A recent master change may have introduced a regression,

      Attachments

        Activity

          [LU-5706] conf-sanity test_57a: @@@@@@ FAIL: OST registration from failnode should fail

          patch landed to master

          utopiabound Nathaniel Clark added a comment - patch landed to master

          The "56" in the exception list was indeed added by Autotest. Minh helped me remove that exception last Saturday. Now it looks this failure has gone (recent three review-zfs runs):

          https://testing.hpdd.intel.com/test_sets/aa8077d8-5224-11e4-a79f-5254006e85c2
          https://testing.hpdd.intel.com/test_sets/3d7c7ca8-5211-11e4-a79f-5254006e85c2
          https://testing.hpdd.intel.com/test_sets/2a42a2de-51d0-11e4-88de-5254006e85c2

          This buys us more time to cook http://review.whamcloud.com/12236.

          liwei Li Wei (Inactive) added a comment - The "56" in the exception list was indeed added by Autotest. Minh helped me remove that exception last Saturday. Now it looks this failure has gone (recent three review-zfs runs): https://testing.hpdd.intel.com/test_sets/aa8077d8-5224-11e4-a79f-5254006e85c2 https://testing.hpdd.intel.com/test_sets/3d7c7ca8-5211-11e4-a79f-5254006e85c2 https://testing.hpdd.intel.com/test_sets/2a42a2de-51d0-11e4-88de-5254006e85c2 This buys us more time to cook http://review.whamcloud.com/12236 .
          utopiabound Nathaniel Clark added a comment - http://review.whamcloud.com/12236

          This is causing a large number of review-zfs test failures at this point (20 yesterday), so either the original patch should be reverted, or this test disabled since we are wasting our time. This isn't the only problem with review-zfs, but only about 15% of these test sessions are passing right now.

          adilger Andreas Dilger added a comment - This is causing a large number of review-zfs test failures at this point (20 yesterday), so either the original patch should be reverted, or this test disabled since we are wasting our time. This isn't the only problem with review-zfs, but only about 15% of these test sessions are passing right now.

          Oleg, I wasn't implying 54a to be the problem, but one has to know which was the last not-skipped test before 57a. As to 56, I've no idea how it got into the exception list too. (We should really change exception lists only with Git commits.)

          liwei Li Wei (Inactive) added a comment - Oleg, I wasn't implying 54a to be the problem, but one has to know which was the last not-skipped test before 57a. As to 56, I've no idea how it got into the exception list too. (We should really change exception lists only with Git commits.)
          green Oleg Drokin added a comment -

          Andreas: the 12196 did not land to b2_5 yet. And I was under impression there are no b2_5 failures outside of select patches that include it, but alas, apparently there are still some failures so it must be some different recently landed thing that causes this bug.

          LiWei, test 54a is skipped for zfs since Oct 2012, so it cannot be the trigger.
          test 56 is not in always exclude list in b2_5 at least, so I wonder if it's also excluded manually in the testing system via a TEI ticket?

          green Oleg Drokin added a comment - Andreas: the 12196 did not land to b2_5 yet. And I was under impression there are no b2_5 failures outside of select patches that include it, but alas, apparently there are still some failures so it must be some different recently landed thing that causes this bug. LiWei, test 54a is skipped for zfs since Oct 2012, so it cannot be the trigger. test 56 is not in always exclude list in b2_5 at least, so I wonder if it's also excluded manually in the testing system via a TEI ticket?
          liwei Li Wei (Inactive) added a comment - - edited

          As to why "nis" did not exist, could it be that test_57a() has been (incorrectly) depending on prior tests to leave modules loaded? Consider these:

          Now the question is if "excepting tests: 32newtarball 56 export 76a 59 64 69" started to appear around the time Oleg suspected.

          liwei Li Wei (Inactive) added a comment - - edited As to why "nis" did not exist, could it be that test_57a() has been (incorrectly) depending on prior tests to leave modules loaded? Consider these: This failure did not happen on ldiskfs where 54a to 56 were not skipped. This failure happened on ZFS with 54a to 56 skipped (see https://testing.hpdd.intel.com/sub_tests/8000c0b0-4e0c-11e4-8fdd-5254006e85c2 ). This failure did not happen on ZFS with only 54a to 55 (but not 56) skipped (see https://testing.hpdd.intel.com/test_sets/5f54ae08-4e07-11e4-ae94-5254006e85c2 ). test_56()'s reformat() call loaded the modules for test_57a() in this case. Now the question is if "excepting tests: 32newtarball 56 export 76a 59 64 69" started to appear around the time Oleg suspected.
          liwei Li Wei (Inactive) added a comment - - edited

          I don't think this was introduced by my LU-4749 patch. As Yu Jian suggested early on, errors started to occur from the very beginning (taken from a recent Maloo report):

          == conf-sanity test 57a: initial registration from failnode should fail (should return errs) == 09:56:58 (1412675818)
          CMD: shadow-42vm4 /usr/sbin/lctl get_param nis
          shadow-42vm4: error: get_param: /proc/{fs,sys}/{lnet,lustre}/nis: Found no match
          [...]
          

          This resulted in an empty string in "NID", which was then used to generate the following command line:

          [...]
          CMD: shadow-42vm4 tunefs.lustre --failnode= lustre-ost1/ost1
          checking for existing Lustre data: found
          
             Read previous values:
          Target:     lustre-OST0000
          Index:      0
          Lustre FS:  lustre
          Mount type: zfs
          Flags:      0x2
                        (OST )
          Persistent mount opts: 
          Parameters: mgsnode=10.1.5.248@tcp sys.timeout=20
          
          
             Permanent disk data:
          Target:     lustre-OST0000
          Index:      0
          Lustre FS:  lustre
          Mount type: zfs
          Flags:      0x42
                        (OST update )
          Persistent mount opts: 
          Parameters: mgsnode=10.1.5.248@tcp sys.timeout=20 failover.node=Yi:
          
          Writing lustre-ost1/ost1 properties
            lustre:version=1
            lustre:flags=66
            lustre:index=0
            lustre:fsname=lustre
            lustre:svname=lustre-OST0000
            lustre:mgsnode=10.1.5.248@tcp
            lustre:sys.timeout=20
            lustre:failover.node=Yi:
          [...]
          

          Notice that the empty "failnode" argument wasn't caught by tunefs.lustre. (Instead, convert_hostnames() returned garbage when given an empty string---a problem that needs to be fixed as well.)

          liwei Li Wei (Inactive) added a comment - - edited I don't think this was introduced by my LU-4749 patch. As Yu Jian suggested early on, errors started to occur from the very beginning (taken from a recent Maloo report ): == conf-sanity test 57a: initial registration from failnode should fail (should return errs) == 09:56:58 (1412675818) CMD: shadow-42vm4 /usr/sbin/lctl get_param nis shadow-42vm4: error: get_param: /proc/{fs,sys}/{lnet,lustre}/nis: Found no match [...] This resulted in an empty string in "NID", which was then used to generate the following command line: [...] CMD: shadow-42vm4 tunefs.lustre --failnode= lustre-ost1/ost1 checking for existing Lustre data: found Read previous values: Target: lustre-OST0000 Index: 0 Lustre FS: lustre Mount type: zfs Flags: 0x2 (OST ) Persistent mount opts: Parameters: mgsnode=10.1.5.248@tcp sys.timeout=20 Permanent disk data: Target: lustre-OST0000 Index: 0 Lustre FS: lustre Mount type: zfs Flags: 0x42 (OST update ) Persistent mount opts: Parameters: mgsnode=10.1.5.248@tcp sys.timeout=20 failover.node=Yi: Writing lustre-ost1/ost1 properties lustre:version=1 lustre:flags=66 lustre:index=0 lustre:fsname=lustre lustre:svname=lustre-OST0000 lustre:mgsnode=10.1.5.248@tcp lustre:sys.timeout=20 lustre:failover.node=Yi: [...] Notice that the empty "failnode" argument wasn't caught by tunefs.lustre. (Instead, convert_hostnames() returned garbage when given an empty string---a problem that needs to be fixed as well.)

          Oleg, have you reverted http://review.whamcloud.com/11956 from master and http://review.whamcloud.com/12196 from b2_5 yet? This is causing frequent test failures on review-zfs tests.

          adilger Andreas Dilger added a comment - Oleg, have you reverted http://review.whamcloud.com/11956 from master and http://review.whamcloud.com/12196 from b2_5 yet? This is causing frequent test failures on review-zfs tests.
          green Oleg Drokin added a comment - I think this definitely was introduced by https://jira.hpdd.intel.com/browse/LU-4749 as there s a huge uptick in these failures since end of September in master: https://testing.hpdd.intel.com/sub_tests/query?commit=Update+results&page=1&sub_test[query_bugs]=&sub_test[status]=FAIL&sub_test[sub_test_script_id]=dc46d338-6c5a-11e0-b32b-52540025f9af&test_node[architecture_type_id]=&test_node[distribution_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=&test_node[os_type_id]=&test_node_network[network_type_id]=&test_session[query_date]=&test_session[query_recent_period]=&test_session[test_group]=&test_session[test_host]=&test_session[user_id]=&test_set[test_set_script_id]=7f66aa20-3db2-11e0-80c0-52540025f9af&utf8=%E2%9C%93 page 2 does not show, but page 3 starts in 2013...

          People

            utopiabound Nathaniel Clark
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: