[LU-4200] Test failure on test suite conf-sanity test_66: replace nids failed Created: 04/Nov/13  Updated: 22/Sep/23  Resolved: 24/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11990 conf-sanity test_66: replace nids fai... Reopened
Severity: 3
Rank (Obsolete): 11419

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
http://maloo.whamcloud.com/test_sets/d95b0f48-264f-11e3-8f1c-52540035b04c
http://maloo.whamcloud.com/test_sets/2d26cba2-266f-11e3-b741-52540035b04c

The sub-test test_66 failed with the following error:

replace nids failed

Info required for matching: conf-sanity 66



 Comments   
Comment by Bob Glossman (Inactive) [ 05/Nov/13 ]

another
https://maloo.whamcloud.com/test_sets/9612221a-462c-11e3-b5e8-52540035b04c

Comment by Bob Glossman (Inactive) [ 20/Nov/13 ]

again:
https://maloo.whamcloud.com/test_sets/6d46bfb4-518d-11e3-9ca9-52540035b04c

Comment by Andreas Dilger [ 27/Nov/13 ]

It looks like this is failing at least twice a day in recent weeks.

Comment by Nathaniel Clark [ 10/Dec/13 ]

MDT/MGS debug log:

20000000:00000001:1.0:1386348576.067947:0:21057:0:(mgs_llog.c:1214:mgs_replace_nids()) Process entered
20000000:00020000:1.0:1386348576.068074:0:21057:0:(mgs_llog.c:1224:mgs_replace_nids()) Only MGS is allowed to be started
20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1225:mgs_replace_nids()) Process leaving via out (rc=18446744073709551501 : -115 : 0xffffffffffffff8d)
20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1265:mgs_replace_nids()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d)
Comment by nasf (Inactive) [ 06/Feb/14 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/5f4a3ae6-8e8f-11e3-8d06-52540035b04c

Comment by Nathaniel Clark [ 06/Feb/14 ]

Debugging patch:
http://review.whamcloud.com/9160

Comment by Andreas Dilger [ 07/Mar/14 ]

I think this is caused by only_mgs_is_running() being confused by what it thinks the right value for obd_num_exports is. I think that obd_num_exports is being incremented because of LWP connections from the OSTs.

static int only_mgs_is_running(struct obd_device *mgs_obd)
{               
        /* TDB: Is global variable with devices count exists? */
        int num_devices = get_devices_count();
        /* osd, MGS and MGC + self_export
           (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */
        return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2);
}

One option is to change only_mgs_is_running() to iterate over the MGS exports and return an error if any of them are not self or LWP connections.

Comment by Nathaniel Clark [ 13/Mar/14 ]

http://review.whamcloud.com/9650

Comment by Nathaniel Clark [ 24/Mar/14 ]

Patch landed to master

Comment by Artem Blagodarenko (Inactive) [ 17/Dec/15 ]

After this patch landed the test is failed in environment there mgs is dedicated to separate node. Connections from OSTs and MDSs are ignored because have OBD_CONNECT_MDS_MDS flag set. I am surprised why connection to MGS have this flag set, probably because actually OBD_CONNECT_MNE_SWAB is set and it have the same value.

 #define OBD_CONNECT_MNE_SWAB OBD_CONNECT_MDS_MDS

Probably we could exclude OBD_CONNECT_MDS_MDS from exceptions, because if LWP connections exists, the we can't assume "MGS is alone".

diff --git a/lustre/mgs/mgs_llog.c b/lustre/mgs/mgs_llog.c
index e48f888..ef42e34 100644
--- a/lustre/mgs/mgs_llog.c
+++ b/lustre/mgs/mgs_llog.c
@@ -1165,8 +1165,6 @@ static int only_mgs_is_running(struct obd_device *mgs_obd)
                /* skip self export */
                if (exp == mgs_obd->obd_self_export)
                        continue;
-               if (exp_connect_flags(exp) & OBD_CONNECT_MDS_MDS)
-                       continue;
 
                ++num_exports;

utopiabound, why have you ignore exports with OBD_CONNECT_MDS_MDS flag set?

Thanks.

Generated at Sat Feb 10 01:40:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.