Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3056

conf-sanity test_66 - replace nids failed

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0
    • 3
    • 7450

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite runs:
      https://maloo.whamcloud.com/test_sets/95fcddea-97b0-11e2-a652-52540035b04c
      https://maloo.whamcloud.com/test_sets/810da798-9760-11e2-9ec7-52540035b04c

      The sub-test test_66 failed with the following error:

      replace nids failed

      Info required for matching: conf-sanity 66

      All subsequent ZFS test suites (recovery-small, etc) fail with the following error:

      Starting mds1: -o user_xattr,acl  lustre-mdt1/mdt1 /mnt/mds1
      CMD: wtm-16vm3 mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl  		                   lustre-mdt1/mdt1 /mnt/mds1
      wtm-16vm3: mount.lustre: according to /etc/mtab lustre-mdt1/mdt1 is already mounted on /mnt/mds1
      

      Attachments

        Issue Links

          Activity

            [LU-3056] conf-sanity test_66 - replace nids failed

            We no longer see this issue. Please reopen if this starts to trigger again. There is on sense landing a debug patch for a problem that does not happen.

            keith Keith Mannthey (Inactive) added a comment - We no longer see this issue. Please reopen if this starts to trigger again. There is on sense landing a debug patch for a problem that does not happen.

            Still no sign of the "replace nids failed" errors.

            keith Keith Mannthey (Inactive) added a comment - Still no sign of the "replace nids failed" errors.

            Quick update:

            conf_sanity test_66 has not failed in a few weeks. The "replace nids failed" error really dropped off after 2013-04-29. Perhaps some code path has changed.

            keith Keith Mannthey (Inactive) added a comment - Quick update: conf_sanity test_66 has not failed in a few weeks. The "replace nids failed" error really dropped off after 2013-04-29. Perhaps some code path has changed.

            http://review.whamcloud.com/5940 has been resubmitted for testing in the effort to land it after the 2.4 split. We still see the issue on Master a few times a week and it will be good to know more about out what is causing the issue.

            keith Keith Mannthey (Inactive) added a comment - http://review.whamcloud.com/5940 has been resubmitted for testing in the effort to land it after the 2.4 split. We still see the issue on Master a few times a week and it will be good to know more about out what is causing the issue.

            I definitely don't think this needs to be a 2.4.0 blocker, since replace_nids is a very rarely used code path. The only potential reason for increased priority might be the frequency to other patches failing due to this bug, but I don't see very many failures due to this specific bug (several other conf-sanity failures are increasing the test failure rates).

            adilger Andreas Dilger added a comment - I definitely don't think this needs to be a 2.4.0 blocker, since replace_nids is a very rarely used code path. The only potential reason for increased priority might be the frequency to other patches failing due to this bug, but I don't see very many failures due to this specific bug (several other conf-sanity failures are increasing the test failure rates).

            http://review.whamcloud.com/6005 is a conf-sanity run with the debug patch.

            http://review.whamcloud.com/5940 has been rebased for possible inclusion.

            keith Keith Mannthey (Inactive) added a comment - http://review.whamcloud.com/6005 is a conf-sanity run with the debug patch. http://review.whamcloud.com/5940 has been rebased for possible inclusion.

            There is no debug output yet as the problem was thought fixed. I will revisit the patch.

            keith Keith Mannthey (Inactive) added a comment - There is no debug output yet as the problem was thought fixed. I will revisit the patch.

            >The untested debug patch can be found here: http://review.whamcloud.com/5940

            >If we don't want to land the patch I will have it run conf_sanity alot.

            Keith, do you have output with patch applied and test failed?

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - >The untested debug patch can be found here: http://review.whamcloud.com/5940 >If we don't want to land the patch I will have it run conf_sanity alot. Keith, do you have output with patch applied and test failed?

            this message in log

            20000000:00020000:0.0:1365501872.365629:0:20267:0:(mgs_llog.c:1286:mgs_replace_nids()) Only MGS is allowed to be started

            Do anybody know what changed in code tree, so this become wrong?

            /*(wc -l /proc/fs/lustre/devices <= 3) && (num_exports <= 2) */
            
            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - this message in log 20000000:00020000:0.0:1365501872.365629:0:20267:0:(mgs_llog.c:1286:mgs_replace_nids()) Only MGS is allowed to be started Do anybody know what changed in code tree, so this become wrong? /*(wc -l /proc/fs/lustre/devices <= 3) && (num_exports <= 2) */
            utopiabound Nathaniel Clark added a comment - Failure on current master with fix for LU-2988 : https://maloo.whamcloud.com/test_sets/ae3e5ec6-a104-11e2-b1c3-52540035b04c

            People

              keith Keith Mannthey (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: