[LU-9487] mmp test_2: test_2 failed with 22 Created: 10/May/17  Updated: 03/Jun/17  Resolved: 03/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Major
Reporter: James Casper Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: None
Environment:

trevis-45, failover
EL7, master branch, v2.9.57, b3575, ldiskfs


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/adcc6618-5344-4cda-b96f-04bee182c49b

New for tag 2.9.57 - not seen in past 6 months (Nov '16 - May '17).

The same failure message is seen with 7 subsequent subtests.

From test_log:

CMD: trevis-45vm3 mkdir -p /mnt/lustre-mds1failover; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1failover
trevis-45vm3 (MDS1-1): mount.lustre: according to /etc/mtab /dev/mapper/lvm--Role_MDS-P1 is already mounted on /mnt/lustre-mds1failover
Start of /dev/lvm-Role_MDS/P1 on mds1failover failed 17
trevis-45vm7 (MDS1-2): mount.lustre: mount /dev/mapper/lvm--Role_MDS-P1 at /mnt/lustre-mds1 failed: Invalid argument
trevis-45vm7: This may have multiple causes.
trevis-45vm7: Are the mount options correct?
trevis-45vm7: Check the syslog for more info.
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 22
 mmp test_2: @@@@@@ FAIL: mount failure on failover pair mds1,mds1failover 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4927:error_noexit()
  = /usr/lib64/lustre/tests/mmp.sh:284:mount_after_interval_sub()
  = /usr/lib64/lustre/tests/mmp.sh:296:mount_after_interval()
  = /usr/lib64/lustre/tests/mmp.sh:446:test_2()
  = /usr/lib64/lustre/tests/test-framework.sh:5215:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5254:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5101:run_test()
  = /usr/lib64/lustre/tests/mmp.sh:449:main()
Dumping lctl log to /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.*.1494260419.log
CMD: trevis-45vm1.trevis.hpdd.intel.com,trevis-45vm5,trevis-45vm6,trevis-45vm7,trevis-45vm8 /usr/sbin/lctl dk > /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.debug_log.\$(hostname -s).1494260419.log;
         dmesg > /test_logs/2017-05-08/lustre-master-el7-x86_64--failover--1_17_1__3575__-70321399293120-051547/mmp.test_2.dmesg.\$(hostname -s).1494260419.log
Resetting fail_loc on all nodes...CMD: trevis-45vm1.trevis.hpdd.intel.com,trevis-45vm5,trevis-45vm6,trevis-45vm7,trevis-45vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
done.
 mmp test_2: @@@@@@ FAIL: test_2 failed with 22 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4939:error()
  = /usr/lib64/lustre/tests/test-framework.sh:5215:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5254:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5101:run_test()
  = /usr/lib64/lustre/tests/mmp.sh:449:main()


 Comments   
Comment by James Casper [ 10/May/17 ]

The test set that follows mmp (replay-single) also fails with the same failure message. replay-single is the last test set in the failover group.

Comment by Andreas Dilger [ 11/May/17 ]

It looks like the MDS didn't clean up properly at the end of the previous test? This needs to be investigated more closely.

Comment by James Casper [ 11/May/17 ]

A clarification on this statement: "New for tag 2.9.57 - not seen in past 6 months (Nov '16 - May '17)."

This has not been seen in tag testing since November 10, 2016 (master branch, tags 51-56). There are 55 occurrences outside of tag testing (all in failover group testing).

Comment by Peter Jones [ 16/May/17 ]

Alex is looking into this one

Comment by Alex Zhuravlev [ 17/May/17 ]

the log looks a bit strange..
[ 307.590154] Lustre: DEBUG MARKER: ----============= acceptance-small: mmp ============---- Mon May 8 16:18:34 UTC 2017
[ 307.956938] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314)
[ 308.111498] Lustre: DEBUG MARKER: == mmp test complete, duration -o sec ================================================================ 16:18:34 (1494260314)
[ 336.896363] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343)

Comment by Alex Zhuravlev [ 18/May/17 ]

== mmp test 1: two mounts at the same time =========================================================== 16:19:03 (1494260343)
Starting mds1failover: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1failover
Starting mds1: /dev/lvm-Role_MDS/P1 /mnt/lustre-mds1
Started lustre-MDT0000
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 22
Starting ost1: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1
Starting ost1failover: /dev/lvm-Role_OSS/P1 /mnt/lustre-ost1failover
Started lustre-OST0000
Start of /dev/lvm-Role_OSS/P1 on ost1failover failed 22
Stopping /mnt/lustre-ost1 (opts on trevis-45vm8
PASS 1 (52s)

for whatever reason MDS wasn't stopped by test 1, though I see in the scripts this:
stop $mds_facet $opts || return ${PIPESTATUS[0]}
stop $oss_facet $opts || return ${PIPESTATUS[0]}

going to add a debugging

Comment by Gerrit Updater [ 18/May/17 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: https://review.whamcloud.com/27192
Subject: LU-9487: debugging patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7881652b82308ef5193efa2cb85292ea540e1723

Comment by Alex Zhuravlev [ 23/May/17 ]

I think the root cause is that test #2 expects the primary MDS to be mounted:
stop_services primary || return ${PIPESTATUS[0]}
but actually the mount process in mount_after_interval_sub() seem to be racy by desing:
start $facet $device $opts &
...
start $failover_facet $device $opts

the test was developed by Jian Yu, it'd helpful to get any hints on this.

Comment by Gerrit Updater [ 23/May/17 ]

Jian Yu (jian.yu@intel.com) uploaded a new patch: https://review.whamcloud.com/27253
Subject: LU-9487 tests: stop primary and failover services in mmp test 1
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0f5520c5413e32caa55275c4dc7310e1a03a60a5

Comment by Jian Yu [ 23/May/17 ]

Yes, Alex. In mmp.sh test_1(), it's likely the failover service is started before the primary one, I just pushed a patch to check and stop the failover service.

Comment by Gerrit Updater [ 03/Jun/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/27253/
Subject: LU-9487 tests: stop primary and failover services in mmp test 1
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a23bb09a3bc447e4286218416d1dd5d096724bfc

Comment by Peter Jones [ 03/Jun/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:26:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.