[LU-3056] conf-sanity test_66 - replace nids failed Created: 28/Mar/13 Updated: 22/Sep/23 Resolved: 21/Jun/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Keith Mannthey (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | revzfs | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 7450 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite runs: The sub-test test_66 failed with the following error:
Info required for matching: conf-sanity 66 All subsequent ZFS test suites (recovery-small, etc) fail with the following error: Starting mds1: -o user_xattr,acl lustre-mdt1/mdt1 /mnt/mds1 CMD: wtm-16vm3 mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl lustre-mdt1/mdt1 /mnt/mds1 wtm-16vm3: mount.lustre: according to /etc/mtab lustre-mdt1/mdt1 is already mounted on /mnt/mds1 |
| Comments |
| Comment by Oleg Drokin [ 28/Mar/13 ] |
|
I asked Artem to look at this and he thinks |
| Comment by Artem Blagodarenko (Inactive) [ 01/Apr/13 ] |
|
The error reason is because this statement doesn't return TRUE last call: static int only_mgs_is_running(struct obd_device *mgs_obd) { /* TDB: Is global variable with devices count exists? */ int num_devices = get_devices_count(); /* osd, MGS and MGC + self_export (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */ return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2); } I am trying to figure out why this happens. |
| Comment by Artem Blagodarenko (Inactive) [ 01/Apr/13 ] |
|
I can't reproduce this bug locally. Statistics shows "Subtest passes: 99/100". Is it possible that something was launched in parallel with tests in that 1 failed test execution?
|
| Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ] |
|
test_66 23% makes me think this may need to be a blocker. In general there are no parallel tests. Tests are safe to assume they are the only thing running at this point in time. Just a quick breakdown for casual observers: From conf_sanity test_66 echo "replace MDS nid"
do_facet mgs $LCTL replace_nids $FSNAME-MDT0000 $MDS_NID ||
error "replace nids failed"
And from that do_facet call we get: CMD: wtm-18vm3 /usr/sbin/lctl replace_nids lustre-MDT0000 10.10.16.188@tcp wtm-18vm3: error: replace_nids: Operation now in progress conf-sanity test_66: @@@@@@ FAIL: replace nids failed The error: replace... comes from here: rc = l2_ioctl(OBD_DEV_ID, OBD_IOC_REPLACE_NIDS, buf);
if (rc < 0) {
fprintf(stderr, "error: %s: %s\n", jt_cmdname(argv[0]),
strerror(rc = errno));
}
This is seen on the MDS: LustreError: 28590:0:(mgs_llog.c:1286:mgs_replace_nids()) Only MGS is allowed to be started And that leads to the above only_mgs_is_running and mgs_replace_nids in lustre/mgs/mgs_llog.c /* We can not change nids if not only MGS is started */
if (!only_mgs_is_running(mgs_obd)) {
CERROR("Only MGS is allowed to be started\n");
GOTO(out, rc = -EINPROGRESS);
}
Also (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */
return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2);
Should num_devices be 2 or 3? The untested debug patch can be found here: http://review.whamcloud.com/5940 If we don't want to land the patch I will have it run conf_sanity alot. |
| Comment by Nathaniel Clark [ 04/Apr/13 ] |
|
I haven't seen this conf-sanity/66 fail for any patch based past |
| Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ] |
|
Ahh thanks I didn't have lu-2988 on my radar. I checked as well and I don't see any errors in the past two days. |
| Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ] |
|
Dup of This issue is believed to be fixed. A patch was landed April 1st or so and no errors have been seen since. Please reopen if the errors continue. |
| Comment by Nathaniel Clark [ 09/Apr/13 ] |
|
Failure on current master with fix for |
| Comment by Artem Blagodarenko (Inactive) [ 09/Apr/13 ] |
|
this message in log 20000000:00020000:0.0:1365501872.365629:0:20267:0:(mgs_llog.c:1286:mgs_replace_nids()) Only MGS is allowed to be started Do anybody know what changed in code tree, so this become wrong? /*(wc -l /proc/fs/lustre/devices <= 3) && (num_exports <= 2) */
|
| Comment by Artem Blagodarenko (Inactive) [ 09/Apr/13 ] |
|
>The untested debug patch can be found here: http://review.whamcloud.com/5940 >If we don't want to land the patch I will have it run conf_sanity alot. Keith, do you have output with patch applied and test failed? |
| Comment by Keith Mannthey (Inactive) [ 09/Apr/13 ] |
|
There is no debug output yet as the problem was thought fixed. I will revisit the patch. |
| Comment by Keith Mannthey (Inactive) [ 10/Apr/13 ] |
|
http://review.whamcloud.com/6005 is a conf-sanity run with the debug patch. http://review.whamcloud.com/5940 has been rebased for possible inclusion. |
| Comment by Andreas Dilger [ 26/Apr/13 ] |
|
I definitely don't think this needs to be a 2.4.0 blocker, since replace_nids is a very rarely used code path. The only potential reason for increased priority might be the frequency to other patches failing due to this bug, but I don't see very many failures due to this specific bug (several other conf-sanity failures are increasing the test failure rates). |
| Comment by Keith Mannthey (Inactive) [ 02/May/13 ] |
|
http://review.whamcloud.com/5940 has been resubmitted for testing in the effort to land it after the 2.4 split. We still see the issue on Master a few times a week and it will be good to know more about out what is causing the issue. |
| Comment by Keith Mannthey (Inactive) [ 04/Jun/13 ] |
|
Quick update: conf_sanity test_66 has not failed in a few weeks. The "replace nids failed" error really dropped off after 2013-04-29. Perhaps some code path has changed. |
| Comment by Keith Mannthey (Inactive) [ 21/Jun/13 ] |
|
Still no sign of the "replace nids failed" errors. |
| Comment by Keith Mannthey (Inactive) [ 21/Jun/13 ] |
|
We no longer see this issue. Please reopen if this starts to trigger again. There is on sense landing a debug patch for a problem that does not happen. |