[LU-4200] Test failure on test suite conf-sanity test_66: replace nids failed Created: 04/Nov/13 Updated: 22/Sep/23 Resolved: 24/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 11419 | ||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: The sub-test test_66 failed with the following error:
Info required for matching: conf-sanity 66 |
| Comments |
| Comment by Bob Glossman (Inactive) [ 05/Nov/13 ] |
|
another |
| Comment by Bob Glossman (Inactive) [ 20/Nov/13 ] |
|
again: |
| Comment by Andreas Dilger [ 27/Nov/13 ] |
|
It looks like this is failing at least twice a day in recent weeks. |
| Comment by Nathaniel Clark [ 10/Dec/13 ] |
|
MDT/MGS debug log: 20000000:00000001:1.0:1386348576.067947:0:21057:0:(mgs_llog.c:1214:mgs_replace_nids()) Process entered 20000000:00020000:1.0:1386348576.068074:0:21057:0:(mgs_llog.c:1224:mgs_replace_nids()) Only MGS is allowed to be started 20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1225:mgs_replace_nids()) Process leaving via out (rc=18446744073709551501 : -115 : 0xffffffffffffff8d) 20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1265:mgs_replace_nids()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d) |
| Comment by nasf (Inactive) [ 06/Feb/14 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/5f4a3ae6-8e8f-11e3-8d06-52540035b04c |
| Comment by Nathaniel Clark [ 06/Feb/14 ] |
|
Debugging patch: |
| Comment by Andreas Dilger [ 07/Mar/14 ] |
|
I think this is caused by only_mgs_is_running() being confused by what it thinks the right value for obd_num_exports is. I think that obd_num_exports is being incremented because of LWP connections from the OSTs. static int only_mgs_is_running(struct obd_device *mgs_obd) { /* TDB: Is global variable with devices count exists? */ int num_devices = get_devices_count(); /* osd, MGS and MGC + self_export (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */ return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2); } One option is to change only_mgs_is_running() to iterate over the MGS exports and return an error if any of them are not self or LWP connections. |
| Comment by Nathaniel Clark [ 13/Mar/14 ] |
| Comment by Nathaniel Clark [ 24/Mar/14 ] |
|
Patch landed to master |
| Comment by Artem Blagodarenko (Inactive) [ 17/Dec/15 ] |
|
After this patch landed the test is failed in environment there mgs is dedicated to separate node. Connections from OSTs and MDSs are ignored because have OBD_CONNECT_MDS_MDS flag set. I am surprised why connection to MGS have this flag set, probably because actually OBD_CONNECT_MNE_SWAB is set and it have the same value. #define OBD_CONNECT_MNE_SWAB OBD_CONNECT_MDS_MDS Probably we could exclude OBD_CONNECT_MDS_MDS from exceptions, because if LWP connections exists, the we can't assume "MGS is alone". diff --git a/lustre/mgs/mgs_llog.c b/lustre/mgs/mgs_llog.c index e48f888..ef42e34 100644 --- a/lustre/mgs/mgs_llog.c +++ b/lustre/mgs/mgs_llog.c @@ -1165,8 +1165,6 @@ static int only_mgs_is_running(struct obd_device *mgs_obd) /* skip self export */ if (exp == mgs_obd->obd_self_export) continue; - if (exp_connect_flags(exp) & OBD_CONNECT_MDS_MDS) - continue; ++num_exports; utopiabound, why have you ignore exports with OBD_CONNECT_MDS_MDS flag set? Thanks. |