Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4200

Test failure on test suite conf-sanity test_66: replace nids failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • None
    • None
    • 3
    • 11419

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run:
      http://maloo.whamcloud.com/test_sets/d95b0f48-264f-11e3-8f1c-52540035b04c
      http://maloo.whamcloud.com/test_sets/2d26cba2-266f-11e3-b741-52540035b04c

      The sub-test test_66 failed with the following error:

      replace nids failed

      Info required for matching: conf-sanity 66

      Attachments

        Issue Links

          Activity

            [LU-4200] Test failure on test suite conf-sanity test_66: replace nids failed

            Patch landed to master

            utopiabound Nathaniel Clark added a comment - Patch landed to master
            utopiabound Nathaniel Clark added a comment - http://review.whamcloud.com/9650

            I think this is caused by only_mgs_is_running() being confused by what it thinks the right value for obd_num_exports is. I think that obd_num_exports is being incremented because of LWP connections from the OSTs.

            static int only_mgs_is_running(struct obd_device *mgs_obd)
            {               
                    /* TDB: Is global variable with devices count exists? */
                    int num_devices = get_devices_count();
                    /* osd, MGS and MGC + self_export
                       (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */
                    return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2);
            }
            

            One option is to change only_mgs_is_running() to iterate over the MGS exports and return an error if any of them are not self or LWP connections.

            adilger Andreas Dilger added a comment - I think this is caused by only_mgs_is_running() being confused by what it thinks the right value for obd_num_exports is. I think that obd_num_exports is being incremented because of LWP connections from the OSTs. static int only_mgs_is_running(struct obd_device *mgs_obd) { /* TDB: Is global variable with devices count exists? */ int num_devices = get_devices_count(); /* osd, MGS and MGC + self_export (wc -l /proc/fs/lustre/devices <= 2) && (num_exports <= 2) */ return (num_devices <= 3) && (mgs_obd->obd_num_exports <= 2); } One option is to change only_mgs_is_running() to iterate over the MGS exports and return an error if any of them are not self or LWP connections.
            utopiabound Nathaniel Clark added a comment - Debugging patch: http://review.whamcloud.com/9160
            yong.fan nasf (Inactive) added a comment - Another failure instance: https://maloo.whamcloud.com/test_sets/5f4a3ae6-8e8f-11e3-8d06-52540035b04c
            utopiabound Nathaniel Clark added a comment - - edited

            MDT/MGS debug log:

            20000000:00000001:1.0:1386348576.067947:0:21057:0:(mgs_llog.c:1214:mgs_replace_nids()) Process entered
            20000000:00020000:1.0:1386348576.068074:0:21057:0:(mgs_llog.c:1224:mgs_replace_nids()) Only MGS is allowed to be started
            20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1225:mgs_replace_nids()) Process leaving via out (rc=18446744073709551501 : -115 : 0xffffffffffffff8d)
            20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1265:mgs_replace_nids()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d)
            
            utopiabound Nathaniel Clark added a comment - - edited MDT/MGS debug log: 20000000:00000001:1.0:1386348576.067947:0:21057:0:(mgs_llog.c:1214:mgs_replace_nids()) Process entered 20000000:00020000:1.0:1386348576.068074:0:21057:0:(mgs_llog.c:1224:mgs_replace_nids()) Only MGS is allowed to be started 20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1225:mgs_replace_nids()) Process leaving via out (rc=18446744073709551501 : -115 : 0xffffffffffffff8d) 20000000:00000001:1.0:1386348576.068947:0:21057:0:(mgs_llog.c:1265:mgs_replace_nids()) Process leaving (rc=18446744073709551501 : -115 : ffffffffffffff8d)

            It looks like this is failing at least twice a day in recent weeks.

            adilger Andreas Dilger added a comment - It looks like this is failing at least twice a day in recent weeks.
            bogl Bob Glossman (Inactive) added a comment - again: https://maloo.whamcloud.com/test_sets/6d46bfb4-518d-11e3-9ca9-52540035b04c
            bogl Bob Glossman (Inactive) added a comment - another https://maloo.whamcloud.com/test_sets/9612221a-462c-11e3-b5e8-52540035b04c

            People

              utopiabound Nathaniel Clark
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: