Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3056

conf-sanity test_66 - replace nids failed

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.0
    • 3
    • 7450

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite runs:
      https://maloo.whamcloud.com/test_sets/95fcddea-97b0-11e2-a652-52540035b04c
      https://maloo.whamcloud.com/test_sets/810da798-9760-11e2-9ec7-52540035b04c

      The sub-test test_66 failed with the following error:

      replace nids failed

      Info required for matching: conf-sanity 66

      All subsequent ZFS test suites (recovery-small, etc) fail with the following error:

      Starting mds1: -o user_xattr,acl  lustre-mdt1/mdt1 /mnt/mds1
      CMD: wtm-16vm3 mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl  		                   lustre-mdt1/mdt1 /mnt/mds1
      wtm-16vm3: mount.lustre: according to /etc/mtab lustre-mdt1/mdt1 is already mounted on /mnt/mds1
      

      Attachments

        Issue Links

          Activity

            [LU-3056] conf-sanity test_66 - replace nids failed

            http://review.whamcloud.com/6005 is a conf-sanity run with the debug patch.

            http://review.whamcloud.com/5940 has been rebased for possible inclusion.

            keith Keith Mannthey (Inactive) added a comment - http://review.whamcloud.com/6005 is a conf-sanity run with the debug patch. http://review.whamcloud.com/5940 has been rebased for possible inclusion.

            There is no debug output yet as the problem was thought fixed. I will revisit the patch.

            keith Keith Mannthey (Inactive) added a comment - There is no debug output yet as the problem was thought fixed. I will revisit the patch.

            >The untested debug patch can be found here: http://review.whamcloud.com/5940

            >If we don't want to land the patch I will have it run conf_sanity alot.

            Keith, do you have output with patch applied and test failed?

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - >The untested debug patch can be found here: http://review.whamcloud.com/5940 >If we don't want to land the patch I will have it run conf_sanity alot. Keith, do you have output with patch applied and test failed?

            this message in log

            20000000:00020000:0.0:1365501872.365629:0:20267:0:(mgs_llog.c:1286:mgs_replace_nids()) Only MGS is allowed to be started

            Do anybody know what changed in code tree, so this become wrong?

            /*(wc -l /proc/fs/lustre/devices <= 3) && (num_exports <= 2) */
            
            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - this message in log 20000000:00020000:0.0:1365501872.365629:0:20267:0:(mgs_llog.c:1286:mgs_replace_nids()) Only MGS is allowed to be started Do anybody know what changed in code tree, so this become wrong? /*(wc -l /proc/fs/lustre/devices <= 3) && (num_exports <= 2) */
            utopiabound Nathaniel Clark added a comment - Failure on current master with fix for LU-2988 : https://maloo.whamcloud.com/test_sets/ae3e5ec6-a104-11e2-b1c3-52540035b04c

            Dup of LU-2988.

            This issue is believed to be fixed. A patch was landed April 1st or so and no errors have been seen since.

            Please reopen if the errors continue.

            keith Keith Mannthey (Inactive) added a comment - Dup of LU-2988 . This issue is believed to be fixed. A patch was landed April 1st or so and no errors have been seen since. Please reopen if the errors continue.

            Ahh thanks I didn't have lu-2988 on my radar.

            I checked as well and I don't see any errors in the past two days.

            keith Keith Mannthey (Inactive) added a comment - Ahh thanks I didn't have lu-2988 on my radar. I checked as well and I don't see any errors in the past two days.

            I haven't seen this conf-sanity/66 fail for any patch based past LU-2988. Granted, at this point the sample set is pretty small at this point.

            utopiabound Nathaniel Clark added a comment - I haven't seen this conf-sanity/66 fail for any patch based past LU-2988 . Granted, at this point the sample set is pretty small at this point.
            keith Keith Mannthey (Inactive) added a comment - - edited

            test_66
            Error: 'replace nids failed'
            Failure Rate: 23.00% of last 100 executions [all branches]

            23% makes me think this may need to be a blocker.

            In general there are no parallel tests. Tests are safe to assume they are the only thing running at this point in time.

            Just a quick breakdown for casual observers:

            From conf_sanity test_66

                    echo "replace MDS nid"
                    do_facet mgs $LCTL replace_nids $FSNAME-MDT0000 $MDS_NID ||
                            error "replace nids failed"
            

            And from that do_facet call we get:

            CMD: wtm-18vm3 /usr/sbin/lctl replace_nids lustre-MDT0000 10.10.16.188@tcp
            wtm-18vm3: error: replace_nids: Operation now in progress
             conf-sanity test_66: @@@@@@ FAIL: replace nids failed 
            

            The error: replace... comes from here:
            In "utils/obd.c jt_replace_nids

                    rc = l2_ioctl(OBD_DEV_ID, OBD_IOC_REPLACE_NIDS, buf);
                    if (rc < 0) {
                            fprintf(stderr, "error: %s: %s\n", jt_cmdname(argv[0]),
                                    strerror(rc = errno));
                    }
            

            This is seen on the MDS:

            LustreError: 28590:0:(mgs_llog.c:1286:mgs_replace_nids()) Only MGS is allowed to be started
            

            And that leads to the above only_mgs_is_running and mgs_replace_nids in lustre/mgs/mgs_llog.c

                    /* We can not change nids if not only MGS is started */
                    if (!only_mgs_is_running(mgs_obd)) {
                            CERROR("Only MGS is allowed to be started\n");
                            GOTO(out, rc = -EINPROGRESS);
                    }
            

            Also

                       (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */
                    return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2);
            

            Should num_devices be 2 or 3?

            The untested debug patch can be found here: http://review.whamcloud.com/5940
            It is a rare path so it should not hurt.

            If we don't want to land the patch I will have it run conf_sanity alot.

            keith Keith Mannthey (Inactive) added a comment - - edited test_66 Error: 'replace nids failed' Failure Rate: 23.00% of last 100 executions [all branches] 23% makes me think this may need to be a blocker. In general there are no parallel tests. Tests are safe to assume they are the only thing running at this point in time. Just a quick breakdown for casual observers: From conf_sanity test_66 echo "replace MDS nid" do_facet mgs $LCTL replace_nids $FSNAME-MDT0000 $MDS_NID || error "replace nids failed" And from that do_facet call we get: CMD: wtm-18vm3 /usr/sbin/lctl replace_nids lustre-MDT0000 10.10.16.188@tcp wtm-18vm3: error: replace_nids: Operation now in progress conf-sanity test_66: @@@@@@ FAIL: replace nids failed The error: replace... comes from here: In "utils/obd.c jt_replace_nids rc = l2_ioctl(OBD_DEV_ID, OBD_IOC_REPLACE_NIDS, buf); if (rc < 0) { fprintf(stderr, "error: %s: %s\n", jt_cmdname(argv[0]), strerror(rc = errno)); } This is seen on the MDS: LustreError: 28590:0:(mgs_llog.c:1286:mgs_replace_nids()) Only MGS is allowed to be started And that leads to the above only_mgs_is_running and mgs_replace_nids in lustre/mgs/mgs_llog.c /* We can not change nids if not only MGS is started */ if (!only_mgs_is_running(mgs_obd)) { CERROR("Only MGS is allowed to be started\n"); GOTO(out, rc = -EINPROGRESS); } Also (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */ return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2); Should num_devices be 2 or 3? The untested debug patch can be found here: http://review.whamcloud.com/5940 It is a rare path so it should not hurt. If we don't want to land the patch I will have it run conf_sanity alot.

            I can't reproduce this bug locally. Statistics shows "Subtest passes: 99/100". Is it possible that something was launched in parallel with tests in that 1 failed test execution?

            LU-2988 is useful for correct modules unloading.

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - I can't reproduce this bug locally. Statistics shows "Subtest passes: 99/100". Is it possible that something was launched in parallel with tests in that 1 failed test execution? LU-2988 is useful for correct modules unloading.

            The error reason is because this statement doesn't return TRUE last call:

            static int only_mgs_is_running(struct obd_device *mgs_obd)
            {
                    /* TDB: Is global variable with devices count exists? */
                    int num_devices = get_devices_count();
                    /* osd, MGS and MGC + self_export
                       (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */
                    return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2);
            }
            

            I am trying to figure out why this happens.

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - The error reason is because this statement doesn't return TRUE last call: static int only_mgs_is_running(struct obd_device *mgs_obd) { /* TDB: Is global variable with devices count exists? */ int num_devices = get_devices_count(); /* osd, MGS and MGC + self_export (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */ return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2); } I am trying to figure out why this happens.

            People

              keith Keith Mannthey (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: