Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.10.0
    • Lustre 2.10.0
    • None
    • trevis-45, failover
        EL7, master branch, v2.9.57, b3575, ldiskfs
    • 3
    • 9223372036854775807

    Description

      https://testing.hpdd.intel.com/test_sessions/adcc6618-5344-4cda-b96f-04bee182c49b

      New for tag 2.9.57 - not seen in past 6 months (Nov '16 - May '17).

      The same failure message is seen with 7 subsequent subtests.

      From test_log:

      CMD: trevis-45vm3 mkdir -p /mnt/lustre-mds1failover; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1failover
      trevis-45vm3 (MDS1-1): mount.lustre: according to /etc/mtab /dev/mapper/lvm--Role_MDS-P1 is already mounted on /mnt/lustre-mds1failover
      Start of /dev/lvm-Role_MDS/P1 on mds1failover failed 17
      trevis-45vm7 (MDS1-2): mount.lustre: mount /dev/mapper/lvm--Role_MDS-P1 at /mnt/lustre-mds1 failed: Invalid argument
      trevis-45vm7: This may have multiple causes.
      trevis-45vm7: Are the mount options correct?
      trevis-45vm7: Check the syslog for more info.
      Start of /dev/lvm-Role_MDS/P1 on mds1 failed 22
       mmp test_2: @@@@@@ FAIL: mount failure on failover pair mds1,mds1failover 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4927:error_noexit()
        = /usr/lib64/lustre/tests/mmp.sh:284:mount_after_interval_sub()
        = /usr/lib64/lustre/tests/mmp.sh:296:mount_after_interval()
        = /usr/lib64/lustre/tests/mmp.sh:446:test_2()
        = /usr/lib64/lustre/tests/test-framework.sh:5215:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5254:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5101:run_test()
        = /usr/lib64/lustre/tests/mmp.sh:449:main()
      Dumping lctl log to /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.*.1494260419.log
      CMD: trevis-45vm1.trevis.hpdd.intel.com,trevis-45vm5,trevis-45vm6,trevis-45vm7,trevis-45vm8 /usr/sbin/lctl dk > /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.debug_log.\$(hostname -s).1494260419.log;
               dmesg > /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.dmesg.\$(hostname -s).1494260419.log
      Resetting fail_loc on all nodes...CMD: trevis-45vm1.trevis.hpdd.intel.com,trevis-45vm5,trevis-45vm6,trevis-45vm7,trevis-45vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      done.
       mmp test_2: @@@@@@ FAIL: test_2 failed with 22 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:4939:error()
        = /usr/lib64/lustre/tests/test-framework.sh:5215:run_one()
        = /usr/lib64/lustre/tests/test-framework.sh:5254:run_one_logged()
        = /usr/lib64/lustre/tests/test-framework.sh:5101:run_test()
        = /usr/lib64/lustre/tests/mmp.sh:449:main()
      

      Attachments

        Issue Links

          Activity

            [LU-9487] mmp test_2: test_2 failed with 22
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27253/
            Subject: LU-9487 tests: stop primary and failover services in mmp test 1
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a23bb09a3bc447e4286218416d1dd5d096724bfc

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27253/ Subject: LU-9487 tests: stop primary and failover services in mmp test 1 Project: fs/lustre-release Branch: master Current Patch Set: Commit: a23bb09a3bc447e4286218416d1dd5d096724bfc
            yujian Jian Yu added a comment -

            Yes, Alex. In mmp.sh test_1(), it's likely the failover service is started before the primary one, I just pushed a patch to check and stop the failover service.

            yujian Jian Yu added a comment - Yes, Alex. In mmp.sh test_1(), it's likely the failover service is started before the primary one, I just pushed a patch to check and stop the failover service.

            Jian Yu (jian.yu@intel.com) uploaded a new patch: https://review.whamcloud.com/27253
            Subject: LU-9487 tests: stop primary and failover services in mmp test 1
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0f5520c5413e32caa55275c4dc7310e1a03a60a5

            gerrit Gerrit Updater added a comment - Jian Yu (jian.yu@intel.com) uploaded a new patch: https://review.whamcloud.com/27253 Subject: LU-9487 tests: stop primary and failover services in mmp test 1 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0f5520c5413e32caa55275c4dc7310e1a03a60a5

            I think the root cause is that test #2 expects the primary MDS to be mounted:
            stop_services primary || return ${PIPESTATUS[0]}
            but actually the mount process in mount_after_interval_sub() seem to be racy by desing:
            start $facet $device $opts &
            ...
            start $failover_facet $device $opts

            the test was developed by Jian Yu, it'd helpful to get any hints on this.

            bzzz Alex Zhuravlev added a comment - I think the root cause is that test #2 expects the primary MDS to be mounted: stop_services primary || return ${PIPESTATUS [0] } but actually the mount process in mount_after_interval_sub() seem to be racy by desing: start $facet $device $opts & ... start $failover_facet $device $opts the test was developed by Jian Yu, it'd helpful to get any hints on this.

            Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27192
            Subject: LU-9487: debugging patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7881652b82308ef5193efa2cb85292ea540e1723

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27192 Subject: LU-9487 : debugging patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7881652b82308ef5193efa2cb85292ea540e1723

            == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343)
            Starting mds1failover: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1failover
            Starting mds1: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
            Started lustre-MDT0000
            Start of /dev/lvm-Role_MDS/P1 on mds1 failed 22
            Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
            Starting ost1failover: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1failover
            Started lustre-OST0000
            Start of /dev/lvm-Role_OSS/P1 on ost1failover failed 22
            Stopping /mnt/lustre-ost1 (opts on trevis-45vm8
            PASS 1 (52s)

            for whatever reason MDS wasn't stopped by test 1, though I see in the scripts this:
            stop $mds_facet $opts || return ${PIPESTATUS[0]}
            stop $oss_facet $opts || return ${PIPESTATUS[0]}

            going to add a debugging

            bzzz Alex Zhuravlev added a comment - == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343) Starting mds1failover: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1failover Starting mds1: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 Started lustre-MDT0000 Start of /dev/lvm-Role_MDS/P1 on mds1 failed 22 Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1 Starting ost1failover: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1failover Started lustre-OST0000 Start of /dev/lvm-Role_OSS/P1 on ost1failover failed 22 Stopping /mnt/lustre-ost1 (opts on trevis-45vm8 PASS 1 (52s) for whatever reason MDS wasn't stopped by test 1, though I see in the scripts this: stop $mds_facet $opts || return ${PIPESTATUS [0] } stop $oss_facet $opts || return ${PIPESTATUS [0] } going to add a debugging

            the log looks a bit strange..
            [ 307.590154] Lustre: DEBUG MARKER: ----============= acceptance-small: mmp ============---- Mon May 8 16:18:34 UTC 2017
            [ 307.956938] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314)
            [ 308.111498] Lustre: DEBUG MARKER: == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314)
            [ 336.896363] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343)

            bzzz Alex Zhuravlev added a comment - the log looks a bit strange.. [ 307.590154] Lustre: DEBUG MARKER: ---- ============= acceptance-small: mmp ============ ---- Mon May 8 16:18:34 UTC 2017 [ 307.956938] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314) [ 308.111498] Lustre: DEBUG MARKER: == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314) [ 336.896363] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343)
            pjones Peter Jones added a comment -

            Alex is looking into this one

            pjones Peter Jones added a comment - Alex is looking into this one

            A clarification on this statement: "New for tag 2.9.57 - not seen in past 6 months (Nov '16 - May '17)."

            This has not been seen in tag testing since November 10, 2016 (master branch, tags 51-56). There are 55 occurrences outside of tag testing (all in failover group testing).

            jcasper James Casper (Inactive) added a comment - A clarification on this statement: "New for tag 2.9.57 - not seen in past 6 months (Nov '16 - May '17)." This has not been seen in tag testing since November 10, 2016 (master branch, tags 51-56). There are 55 occurrences outside of tag testing (all in failover group testing).

            People

              bzzz Alex Zhuravlev
              jcasper James Casper (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: