Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4200

Test failure on test suite conf-sanity test_66: replace nids failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • None
    • None
    • 3
    • 11419

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/d95b0f48-264f-11e3-8f1c-52540035b04c
      http://maloo.whamcloud.com/test_sets/2d26cba2-266f-11e3-b741-52540035b04c

      The sub-test test_66 failed with the following error:

      replace nids failed

      Info required for matching: conf-sanity 66

      Attachments

        Issue Links

          Activity

            [LU-4200] Test failure on test suite conf-sanity test_66: replace nids failed

            After this patch landed the test is failed in environment there mgs is dedicated to separate node. Connections from OSTs and MDSs are ignored because have OBD_CONNECT_MDS_MDS flag set. I am surprised why connection to MGS have this flag set, probably because actually OBD_CONNECT_MNE_SWAB is set and it have the same value.

             #define OBD_CONNECT_MNE_SWAB OBD_CONNECT_MDS_MDS
            

            Probably we could exclude OBD_CONNECT_MDS_MDS from exceptions, because if LWP connections exists, the we can't assume "MGS is alone".

            diff --git a/lustre/mgs/mgs_llog.c b/lustre/mgs/mgs_llog.c
            index e48f888..ef42e34 100644
            --- a/lustre/mgs/mgs_llog.c
            +++ b/lustre/mgs/mgs_llog.c
            @@ -1165,8 +1165,6 @@ static int only_mgs_is_running(struct obd_device *mgs_obd)
                            /* skip self export */
                            if (exp == mgs_obd->obd_self_export)
                                    continue;
            -               if (exp_connect_flags(exp) & OBD_CONNECT_MDS_MDS)
            -                       continue;
             
                            ++num_exports;
            

            utopiabound, why have you ignore exports with OBD_CONNECT_MDS_MDS flag set?

            Thanks.

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - After this patch landed the test is failed in environment there mgs is dedicated to separate node. Connections from OSTs and MDSs are ignored because have OBD_CONNECT_MDS_MDS flag set. I am surprised why connection to MGS have this flag set, probably because actually OBD_CONNECT_MNE_SWAB is set and it have the same value. #define OBD_CONNECT_MNE_SWAB OBD_CONNECT_MDS_MDS Probably we could exclude OBD_CONNECT_MDS_MDS from exceptions, because if LWP connections exists, the we can't assume "MGS is alone". diff --git a/lustre/mgs/mgs_llog.c b/lustre/mgs/mgs_llog.c index e48f888..ef42e34 100644 --- a/lustre/mgs/mgs_llog.c +++ b/lustre/mgs/mgs_llog.c @@ -1165,8 +1165,6 @@ static int only_mgs_is_running(struct obd_device *mgs_obd) /* skip self export */ if (exp == mgs_obd->obd_self_export) continue ; - if (exp_connect_flags(exp) & OBD_CONNECT_MDS_MDS) - continue ; ++num_exports; utopiabound , why have you ignore exports with OBD_CONNECT_MDS_MDS flag set? Thanks.

            Patch landed to master

            utopiabound Nathaniel Clark added a comment - Patch landed to master
            utopiabound Nathaniel Clark added a comment - http://review.whamcloud.com/9650

            I think this is caused by only_mgs_is_running() being confused by what it thinks the right value for obd_num_exports is. I think that obd_num_exports is being incremented because of LWP connections from the OSTs.

            static int only_mgs_is_running(struct obd_device *mgs_obd)
            {               
                    /* TDB: Is global variable with devices count exists? */
                    int num_devices = get_devices_count();
                    /* osd, MGS and MGC + self_export
                       (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */
                    return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2);
            }
            

            One option is to change only_mgs_is_running() to iterate over the MGS exports and return an error if any of them are not self or LWP connections.

            adilger Andreas Dilger added a comment - I think this is caused by only_mgs_is_running() being confused by what it thinks the right value for obd_num_exports is. I think that obd_num_exports is being incremented because of LWP connections from the OSTs. static int only_mgs_is_running(struct obd_device *mgs_obd) { /* TDB: Is global variable with devices count exists? */ int num_devices = get_devices_count(); /* osd, MGS and MGC + self_export (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */ return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2); } One option is to change only_mgs_is_running() to iterate over the MGS exports and return an error if any of them are not self or LWP connections.
            utopiabound Nathaniel Clark added a comment - Debugging patch: http://review.whamcloud.com/9160
            yong.fan nasf (Inactive) added a comment - Another failure instance: https://maloo.whamcloud.com/test_sets/5f4a3ae6-8e8f-11e3-8d06-52540035b04c
            utopiabound Nathaniel Clark added a comment - - edited

            MDT/MGS debug log:

            20000000:00000001:1.0:1386348576.067947:0:21057:0:(mgs_llog.c:1214:mgs_replace_nids()) Process entered
            20000000:00020000:1.0:1386348576.068074:0:21057:0:(mgs_llog.c:1224:mgs_replace_nids()) Only MGS is allowed to be started
            20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1225:mgs_replace_nids()) Process leaving via out (rc=18446744073709551501 : -115 : 0xffffffffffffff8d)
            20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1265:mgs_replace_nids()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d)
            
            utopiabound Nathaniel Clark added a comment - - edited MDT/MGS debug log: 20000000:00000001:1.0:1386348576.067947:0:21057:0:(mgs_llog.c:1214:mgs_replace_nids()) Process entered 20000000:00020000:1.0:1386348576.068074:0:21057:0:(mgs_llog.c:1224:mgs_replace_nids()) Only MGS is allowed to be started 20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1225:mgs_replace_nids()) Process leaving via out (rc=18446744073709551501 : -115 : 0xffffffffffffff8d) 20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1265:mgs_replace_nids()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d)

            It looks like this is failing at least twice a day in recent weeks.

            adilger Andreas Dilger added a comment - It looks like this is failing at least twice a day in recent weeks.
            bogl Bob Glossman (Inactive) added a comment - again: https://maloo.whamcloud.com/test_sets/6d46bfb4-518d-11e3-9ca9-52540035b04c
            bogl Bob Glossman (Inactive) added a comment - another https://maloo.whamcloud.com/test_sets/9612221a-462c-11e3-b5e8-52540035b04c

            People

              utopiabound Nathaniel Clark
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: