Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • Lustre 2.10.0
    • None
    • trevis-45, failover
        EL7, master branch, v2.9.57, b3575, ldiskfs
    • 3
    • 9223372036854775807

    Description

      https://testing.hpdd.intel.com/test_sessions/adcc6618-5344-4cda-b96f-04bee182c49b

      New for tag 2.9.57 - not seen in past 6 months (Nov '16 - May '17).

      The same failure message is seen with 7 subsequent subtests.

      From test_log:

      CMD: trevis-45vm3 mkdir -p /mnt/lustre-mds1failover; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1failover
      trevis-45vm3 (MDS1-1): mount.lustre: according to /etc/mtab /dev/mapper/lvm--Role_MDS-P1 is already mounted on /mnt/lustre-mds1failover
      Start of /dev/lvm-Role_MDS/P1 on mds1failover failed 17
      trevis-45vm7 (MDS1-2): mount.lustre: mount /dev/mapper/lvm--Role_MDS-P1 at /mnt/lustre-mds1 failed: Invalid argument
      trevis-45vm7: This may have multiple causes.
      trevis-45vm7: Are the mount options correct?
      trevis-45vm7: Check the syslog for more info.
      Start of /dev/lvm-Role_MDS/P1 on mds1 failed 22
       mmp test_2: @@@@@@ FAIL: mount failure on failover pair mds1,mds1failover 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4927:error_noexit()
        = /usr/lib64/lustre/tests/mmp.sh:284:mount_after_interval_sub()
        = /usr/lib64/lustre/tests/mmp.sh:296:mount_after_interval()
        = /usr/lib64/lustre/tests/mmp.sh:446:test_2()
        = /usr/lib64/lustre/tests/test-framework.sh:5215:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5254:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5101:run_test()
        = /usr/lib64/lustre/tests/mmp.sh:449:main()
      Dumping lctl log to /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.*.1494260419.log
      CMD: trevis-45vm1.trevis.hpdd.intel.com,trevis-45vm5,trevis-45vm6,trevis-45vm7,trevis-45vm8 /usr/sbin/lctl dk > /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.debug_log.\$(hostname -s).1494260419.log;
               dmesg > /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.dmesg.\$(hostname -s).1494260419.log
      Resetting fail_loc on all nodes...CMD: trevis-45vm1.trevis.hpdd.intel.com,trevis-45vm5,trevis-45vm6,trevis-45vm7,trevis-45vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      done.
       mmp test_2: @@@@@@ FAIL: test_2 failed with 22 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4939:error()
        = /usr/lib64/lustre/tests/test-framework.sh:5215:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5254:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5101:run_test()
        = /usr/lib64/lustre/tests/mmp.sh:449:main()
      

      Attachments

        Issue Links

          Activity

            [LU-9487] mmp test_2: test_2 failed with 22
            yujian Jian Yu added a comment -

            Yes, Alex. In mmp.sh test_1(), it's likely the failover service is started before the primary one, I just pushed a patch to check and stop the failover service.

            yujian Jian Yu added a comment - Yes, Alex. In mmp.sh test_1(), it's likely the failover service is started before the primary one, I just pushed a patch to check and stop the failover service.

            Jian Yu (jian.yu@intel.com) uploaded a new patch: https://review.whamcloud.com/27253
            Subject: LU-9487 tests: stop primary and failover services in mmp test 1
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0f5520c5413e32caa55275c4dc7310e1a03a60a5

            gerrit Gerrit Updater added a comment - Jian Yu (jian.yu@intel.com) uploaded a new patch: https://review.whamcloud.com/27253 Subject: LU-9487 tests: stop primary and failover services in mmp test 1 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0f5520c5413e32caa55275c4dc7310e1a03a60a5

            I think the root cause is that test #2 expects the primary MDS to be mounted:
            stop_services primary || return ${PIPESTATUS[0]}
            but actually the mount process in mount_after_interval_sub() seem to be racy by desing:
            start $facet $device $opts &
            ...
            start $failover_facet $device $opts

            the test was developed by Jian Yu, it'd helpful to get any hints on this.

            bzzz Alex Zhuravlev added a comment - I think the root cause is that test #2 expects the primary MDS to be mounted: stop_services primary || return ${PIPESTATUS [0] } but actually the mount process in mount_after_interval_sub() seem to be racy by desing: start $facet $device $opts & ... start $failover_facet $device $opts the test was developed by Jian Yu, it'd helpful to get any hints on this.

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27192
            Subject: LU-9487: debugging patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7881652b82308ef5193efa2cb85292ea540e1723

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27192 Subject: LU-9487 : debugging patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7881652b82308ef5193efa2cb85292ea540e1723

            == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343)
            Starting mds1failover: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1failover
            Starting mds1: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
            Started lustre-MDT0000
            Start of /dev/lvm-Role_MDS/P1 on mds1 failed 22
            Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
            Starting ost1failover: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1failover
            Started lustre-OST0000
            Start of /dev/lvm-Role_OSS/P1 on ost1failover failed 22
            Stopping /mnt/lustre-ost1 (opts on trevis-45vm8
            PASS 1 (52s)

            for whatever reason MDS wasn't stopped by test 1, though I see in the scripts this:
            stop $mds_facet $opts || return ${PIPESTATUS[0]}
            stop $oss_facet $opts || return ${PIPESTATUS[0]}

            going to add a debugging

            bzzz Alex Zhuravlev added a comment - == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343) Starting mds1failover: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1failover Starting mds1: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 Started lustre-MDT0000 Start of /dev/lvm-Role_MDS/P1 on mds1 failed 22 Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1 Starting ost1failover: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1failover Started lustre-OST0000 Start of /dev/lvm-Role_OSS/P1 on ost1failover failed 22 Stopping /mnt/lustre-ost1 (opts on trevis-45vm8 PASS 1 (52s) for whatever reason MDS wasn't stopped by test 1, though I see in the scripts this: stop $mds_facet $opts || return ${PIPESTATUS [0] } stop $oss_facet $opts || return ${PIPESTATUS [0] } going to add a debugging

            the log looks a bit strange..
            [ 307.590154] Lustre: DEBUG MARKER: ----============= acceptance-small: mmp ============---- Mon May 8 16:18:34 UTC 2017
            [ 307.956938] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314)
            [ 308.111498] Lustre: DEBUG MARKER: == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314)
            [ 336.896363] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343)

            bzzz Alex Zhuravlev added a comment - the log looks a bit strange.. [ 307.590154] Lustre: DEBUG MARKER: ---- ============= acceptance-small: mmp ============ ---- Mon May 8 16:18:34 UTC 2017 [ 307.956938] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314) [ 308.111498] Lustre: DEBUG MARKER: == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314) [ 336.896363] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343)
            pjones Peter Jones added a comment -

            Alex is looking into this one

            pjones Peter Jones added a comment - Alex is looking into this one

            A clarification on this statement: "New for tag 2.9.57 - not seen in past 6 months (Nov '16 - May '17)."

            This has not been seen in tag testing since November 10, 2016 (master branch, tags 51-56). There are 55 occurrences outside of tag testing (all in failover group testing).

            jcasper James Casper (Inactive) added a comment - A clarification on this statement: "New for tag 2.9.57 - not seen in past 6 months (Nov '16 - May '17)." This has not been seen in tag testing since November 10, 2016 (master branch, tags 51-56). There are 55 occurrences outside of tag testing (all in failover group testing).

            It looks like the MDS didn't clean up properly at the end of the previous test? This needs to be investigated more closely.

            adilger Andreas Dilger added a comment - It looks like the MDS didn't clean up properly at the end of the previous test? This needs to be investigated more closely.

            The test set that follows mmp (replay-single) also fails with the same failure message. replay-single is the last test set in the failover group.

            jcasper James Casper (Inactive) added a comment - The test set that follows mmp (replay-single) also fails with the same failure message. replay-single is the last test set in the failover group.

            People

              bzzz Alex Zhuravlev
              jcasper James Casper (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: