[LU-9487] mmp test_2: test_2 failed with 22 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.10.0
Affects Version/s: Lustre 2.10.0
Labels:
None
Environment:
trevis-45, failover
EL7, master branch, v2.9.57, b3575, ldiskfs

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

https://testing.hpdd.intel.com/test_sessions/adcc6618-5344-4cda-b96f-04bee182c49b

New for tag 2.9.57 - not seen in past 6 months (Nov '16 - May '17).

The same failure message is seen with 7 subsequent subtests.

From test_log:

CMD: trevis-45vm3 mkdir -p /mnt/lustre-mds1failover; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1failover
trevis-45vm3 (MDS1-1): mount.lustre: according to /etc/mtab /dev/mapper/lvm--Role_MDS-P1 is already mounted on /mnt/lustre-mds1failover
Start of /dev/lvm-Role_MDS/P1 on mds1failover failed 17
trevis-45vm7 (MDS1-2): mount.lustre: mount /dev/mapper/lvm--Role_MDS-P1 at /mnt/lustre-mds1 failed: Invalid argument
trevis-45vm7: This may have multiple causes.
trevis-45vm7: Are the mount options correct?
trevis-45vm7: Check the syslog for more info.
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 22
 mmp test_2: @@@@@@ FAIL: mount failure on failover pair mds1,mds1failover 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4927:error_noexit()
  = /usr/lib64/lustre/tests/mmp.sh:284:mount_after_interval_sub()
  = /usr/lib64/lustre/tests/mmp.sh:296:mount_after_interval()
  = /usr/lib64/lustre/tests/mmp.sh:446:test_2()
  = /usr/lib64/lustre/tests/test-framework.sh:5215:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5254:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5101:run_test()
  = /usr/lib64/lustre/tests/mmp.sh:449:main()
Dumping lctl log to /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.*.1494260419.log
CMD: trevis-45vm1.trevis.hpdd.intel.com,trevis-45vm5,trevis-45vm6,trevis-45vm7,trevis-45vm8 /usr/sbin/lctl dk > /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.debug_log.\$(hostname -s).1494260419.log;
         dmesg > /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.dmesg.\$(hostname -s).1494260419.log
Resetting fail_loc on all nodes...CMD: trevis-45vm1.trevis.hpdd.intel.com,trevis-45vm5,trevis-45vm6,trevis-45vm7,trevis-45vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
done.
 mmp test_2: @@@@@@ FAIL: test_2 failed with 22 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4939:error()
  = /usr/lib64/lustre/tests/test-framework.sh:5215:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5254:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5101:run_test()
  = /usr/lib64/lustre/tests/mmp.sh:449:main()

Attachments

Issue Links

mentioned in: Page No Confluence page found with the given URL.

Activity

[LU-9487] mmp test_2: test_2 failed with 22

Jian Yu added a comment - 23/May/17 6:53 PM

Yes, Alex. In mmp.sh test_1(), it's likely the failover service is started before the primary one, I just pushed a patch to check and stop the failover service.

Jian Yu added a comment - 23/May/17 6:53 PM Yes, Alex. In mmp.sh test_1(), it's likely the failover service is started before the primary one, I just pushed a patch to check and stop the failover service.

Gerrit Updater added a comment - 23/May/17 6:49 PM

Jian Yu (jian.yu@intel.com) uploaded a new patch: https://review.whamcloud.com/27253
Subject: ~~LU-9487~~ tests: stop primary and failover services in mmp test 1
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0f5520c5413e32caa55275c4dc7310e1a03a60a5

Gerrit Updater added a comment - 23/May/17 6:49 PM Jian Yu (jian.yu@intel.com) uploaded a new patch: https://review.whamcloud.com/27253 Subject: LU-9487 tests: stop primary and failover services in mmp test 1 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0f5520c5413e32caa55275c4dc7310e1a03a60a5

Alex Zhuravlev added a comment - 23/May/17 11:21 AM

I think the root cause is that test #2 expects the primary MDS to be mounted:
stop_services primary || return ${PIPESTATUS[0]}
but actually the mount process in mount_after_interval_sub() seem to be racy by desing:
start $facet $device $opts &
...
start $failover_facet $device $opts

the test was developed by Jian Yu, it'd helpful to get any hints on this.

Alex Zhuravlev added a comment - 23/May/17 11:21 AM I think the root cause is that test #2 expects the primary MDS to be mounted: stop_services primary || return ${PIPESTATUS [0] } but actually the mount process in mount_after_interval_sub() seem to be racy by desing: start $facet $device $opts & ... start $failover_facet $device $opts the test was developed by Jian Yu, it'd helpful to get any hints on this.

Gerrit Updater added a comment - 18/May/17 3:02 PM

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27192
Subject: ~~LU-9487~~: debugging patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7881652b82308ef5193efa2cb85292ea540e1723

Gerrit Updater added a comment - 18/May/17 3:02 PM Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27192 Subject: LU-9487 : debugging patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7881652b82308ef5193efa2cb85292ea540e1723

Alex Zhuravlev added a comment - 18/May/17 2:56 PM

== mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343)
Starting mds1failover: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1failover
Starting mds1: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
Started lustre-MDT0000
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 22
Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
Starting ost1failover: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1failover
Started lustre-OST0000
Start of /dev/lvm-Role_OSS/P1 on ost1failover failed 22
Stopping /mnt/lustre-ost1 (opts on trevis-45vm8
PASS 1 (52s)

for whatever reason MDS wasn't stopped by test 1, though I see in the scripts this:
stop $mds_facet $opts || return ${PIPESTATUS[0]}
stop $oss_facet $opts || return ${PIPESTATUS[0]}

going to add a debugging

Alex Zhuravlev added a comment - 18/May/17 2:56 PM == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343) Starting mds1failover: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1failover Starting mds1: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1 Started lustre-MDT0000 Start of /dev/lvm-Role_MDS/P1 on mds1 failed 22 Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1 Starting ost1failover: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1failover Started lustre-OST0000 Start of /dev/lvm-Role_OSS/P1 on ost1failover failed 22 Stopping /mnt/lustre-ost1 (opts on trevis-45vm8 PASS 1 (52s) for whatever reason MDS wasn't stopped by test 1, though I see in the scripts this: stop $mds_facet $opts || return ${PIPESTATUS [0] } stop $oss_facet $opts || return ${PIPESTATUS [0] } going to add a debugging

Alex Zhuravlev added a comment - 17/May/17 5:42 PM

the log looks a bit strange..
[ 307.590154] Lustre: DEBUG MARKER: ----~~============= acceptance-small: mmp ============~~---- Mon May 8 16:18:34 UTC 2017
[ 307.956938] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314)
[ 308.111498] Lustre: DEBUG MARKER: == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314)
[ 336.896363] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343)

Alex Zhuravlev added a comment - 17/May/17 5:42 PM the log looks a bit strange.. [ 307.590154] Lustre: DEBUG MARKER: ---- ============= acceptance-small: mmp ============ ---- Mon May 8 16:18:34 UTC 2017 [ 307.956938] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314) [ 308.111498] Lustre: DEBUG MARKER: == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314) [ 336.896363] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343)

Peter Jones added a comment - 16/May/17 3:34 PM

Alex is looking into this one

Peter Jones added a comment - 16/May/17 3:34 PM Alex is looking into this one

James Casper (Inactive) added a comment - 11/May/17 7:32 PM

A clarification on this statement: "New for tag 2.9.57 - not seen in past 6 months (Nov '16 - May '17)."

This has not been seen in tag testing since November 10, 2016 (master branch, tags 51-56). There are 55 occurrences outside of tag testing (all in failover group testing).

James Casper (Inactive) added a comment - 11/May/17 7:32 PM A clarification on this statement: "New for tag 2.9.57 - not seen in past 6 months (Nov '16 - May '17)." This has not been seen in tag testing since November 10, 2016 (master branch, tags 51-56). There are 55 occurrences outside of tag testing (all in failover group testing).

Andreas Dilger added a comment - 11/May/17 5:49 PM

It looks like the MDS didn't clean up properly at the end of the previous test? This needs to be investigated more closely.

Andreas Dilger added a comment - 11/May/17 5:49 PM It looks like the MDS didn't clean up properly at the end of the previous test? This needs to be investigated more closely.

James Casper (Inactive) added a comment - 10/May/17 3:51 PM

The test set that follows mmp (replay-single) also fails with the same failure message. replay-single is the last test set in the failover group.

James Casper (Inactive) added a comment - 10/May/17 3:51 PM The test set that follows mmp (replay-single) also fails with the same failure message. replay-single is the last test set in the failover group.

People

Assignee:: Alex Zhuravlev

Reporter:: James Casper (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 10/May/17 3:37 PM

Updated:: 03/Jun/17 4:30 AM

Resolved:: 03/Jun/17 4:30 AM