[LU-3056] conf-sanity test_66 - replace nids failed Created: 28/Mar/13  Updated: 22/Sep/23  Resolved: 21/Jun/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Keith Mannthey (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: revzfs

Issue Links:
Related
is related to LU-11990 conf-sanity test_66: replace nids fai... Reopened
is related to LU-3793 conf-sanity, subtest test_66 fails du... Resolved
is related to LU-5137 Test failure conf-sanity test_66: rep... Resolved
Severity: 3
Rank (Obsolete): 7450

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite runs:
https://maloo.whamcloud.com/test_sets/95fcddea-97b0-11e2-a652-52540035b04c
https://maloo.whamcloud.com/test_sets/810da798-9760-11e2-9ec7-52540035b04c

The sub-test test_66 failed with the following error:

replace nids failed

Info required for matching: conf-sanity 66

All subsequent ZFS test suites (recovery-small, etc) fail with the following error:

Starting mds1: -o user_xattr,acl  lustre-mdt1/mdt1 /mnt/mds1
CMD: wtm-16vm3 mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl  		                   lustre-mdt1/mdt1 /mnt/mds1
wtm-16vm3: mount.lustre: according to /etc/mtab lustre-mdt1/mdt1 is already mounted on /mnt/mds1


 Comments   
Comment by Oleg Drokin [ 28/Mar/13 ]

I asked Artem to look at this and he thinks LU-2988 might be the culprit, but is still confirming this theory.

Comment by Artem Blagodarenko (Inactive) [ 01/Apr/13 ]

The error reason is because this statement doesn't return TRUE last call:

static int only_mgs_is_running(struct obd_device *mgs_obd)
{
        /* TDB: Is global variable with devices count exists? */
        int num_devices = get_devices_count();
        /* osd, MGS and MGC + self_export
           (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */
        return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2);
}

I am trying to figure out why this happens.

Comment by Artem Blagodarenko (Inactive) [ 01/Apr/13 ]

I can't reproduce this bug locally. Statistics shows "Subtest passes: 99/100". Is it possible that something was launched in parallel with tests in that 1 failed test execution?

LU-2988 is useful for correct modules unloading.

Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ]

test_66
Error: 'replace nids failed'
Failure Rate: 23.00% of last 100 executions [all branches]

23% makes me think this may need to be a blocker.

In general there are no parallel tests. Tests are safe to assume they are the only thing running at this point in time.

Just a quick breakdown for casual observers:

From conf_sanity test_66

        echo "replace MDS nid"
        do_facet mgs $LCTL replace_nids $FSNAME-MDT0000 $MDS_NID ||
                error "replace nids failed"

And from that do_facet call we get:

CMD: wtm-18vm3 /usr/sbin/lctl replace_nids lustre-MDT0000 10.10.16.188@tcp
wtm-18vm3: error: replace_nids: Operation now in progress
 conf-sanity test_66: @@@@@@ FAIL: replace nids failed 

The error: replace... comes from here:
In "utils/obd.c jt_replace_nids

        rc = l2_ioctl(OBD_DEV_ID, OBD_IOC_REPLACE_NIDS, buf);
        if (rc < 0) {
                fprintf(stderr, "error: %s: %s\n", jt_cmdname(argv[0]),
                        strerror(rc = errno));
        }

This is seen on the MDS:

LustreError: 28590:0:(mgs_llog.c:1286:mgs_replace_nids()) Only MGS is allowed to be started

And that leads to the above only_mgs_is_running and mgs_replace_nids in lustre/mgs/mgs_llog.c

        /* We can not change nids if not only MGS is started */
        if (!only_mgs_is_running(mgs_obd)) {
                CERROR("Only MGS is allowed to be started\n");
                GOTO(out, rc = -EINPROGRESS);
        }

Also

           (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */
        return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2);

Should num_devices be 2 or 3?

The untested debug patch can be found here: http://review.whamcloud.com/5940
It is a rare path so it should not hurt.

If we don't want to land the patch I will have it run conf_sanity alot.

Comment by Nathaniel Clark [ 04/Apr/13 ]

I haven't seen this conf-sanity/66 fail for any patch based past LU-2988. Granted, at this point the sample set is pretty small at this point.

Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ]

Ahh thanks I didn't have lu-2988 on my radar.

I checked as well and I don't see any errors in the past two days.

Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ]

Dup of LU-2988.

This issue is believed to be fixed. A patch was landed April 1st or so and no errors have been seen since.

Please reopen if the errors continue.

Comment by Nathaniel Clark [ 09/Apr/13 ]

Failure on current master with fix for LU-2988:
https://maloo.whamcloud.com/test_sets/ae3e5ec6-a104-11e2-b1c3-52540035b04c

Comment by Artem Blagodarenko (Inactive) [ 09/Apr/13 ]

this message in log

20000000:00020000:0.0:1365501872.365629:0:20267:0:(mgs_llog.c:1286:mgs_replace_nids()) Only MGS is allowed to be started

Do anybody know what changed in code tree, so this become wrong?

/*(wc -l /proc/fs/lustre/devices <= 3) && (num_exports <= 2) */
Comment by Artem Blagodarenko (Inactive) [ 09/Apr/13 ]

>The untested debug patch can be found here: http://review.whamcloud.com/5940

>If we don't want to land the patch I will have it run conf_sanity alot.

Keith, do you have output with patch applied and test failed?

Comment by Keith Mannthey (Inactive) [ 09/Apr/13 ]

There is no debug output yet as the problem was thought fixed. I will revisit the patch.

Comment by Keith Mannthey (Inactive) [ 10/Apr/13 ]

http://review.whamcloud.com/6005 is a conf-sanity run with the debug patch.

http://review.whamcloud.com/5940 has been rebased for possible inclusion.

Comment by Andreas Dilger [ 26/Apr/13 ]

I definitely don't think this needs to be a 2.4.0 blocker, since replace_nids is a very rarely used code path. The only potential reason for increased priority might be the frequency to other patches failing due to this bug, but I don't see very many failures due to this specific bug (several other conf-sanity failures are increasing the test failure rates).

Comment by Keith Mannthey (Inactive) [ 02/May/13 ]

http://review.whamcloud.com/5940 has been resubmitted for testing in the effort to land it after the 2.4 split. We still see the issue on Master a few times a week and it will be good to know more about out what is causing the issue.

Comment by Keith Mannthey (Inactive) [ 04/Jun/13 ]

Quick update:

conf_sanity test_66 has not failed in a few weeks. The "replace nids failed" error really dropped off after 2013-04-29. Perhaps some code path has changed.

Comment by Keith Mannthey (Inactive) [ 21/Jun/13 ]

Still no sign of the "replace nids failed" errors.

Comment by Keith Mannthey (Inactive) [ 21/Jun/13 ]

We no longer see this issue. Please reopen if this starts to trigger again. There is on sense landing a debug patch for a problem that does not happen.

Generated at Sat Feb 10 01:30:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.