[LU-5429] insanity test_1: FAIL: Start of /dev/lvm-Role_MDS/P1 on mds1 failed 5 Created: 30/Jul/14  Updated: 22/Nov/18  Resolved: 22/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre Branch: master
MDSCOUNT=4


Issue Links:
Related
is related to LU-4877 mdt_fix_reply()) ASSERTION( md_packed... Resolved
Severity: 3
Rank (Obsolete): 15115

 Description   

While running insanity test with MDSCOUNT=4 on master branch, test 1 failed as follows:

== insanity test 1: MDS/MDS failure == 22:28:38 (1406672918)
CMD: shadow-23vm4 grep -c /mnt/mds1' ' /proc/mounts
Stopping /mnt/mds1 (opts:) on shadow-23vm4
CMD: shadow-23vm4 umount -d /mnt/mds1
CMD: shadow-23vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
Failover mds1 to shadow-23vm4
CMD: shadow-23vm4 grep -c /mnt/mds2' ' /proc/mounts
Stopping /mnt/mds2 (opts:) on shadow-23vm4
CMD: shadow-23vm4 umount -d /mnt/mds2
CMD: shadow-23vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
Reintegrating MDS2
22:29:17 (1406672957) waiting for shadow-23vm4 network 900 secs ...
22:29:17 (1406672957) network interface is UP
CMD: shadow-23vm4 hostname
CMD: shadow-23vm4 mkdir -p /mnt/mds2
CMD: shadow-23vm4 test -b /dev/lvm-Role_MDS/P2
Starting mds2:   /dev/lvm-Role_MDS/P2 /mnt/mds2
CMD: shadow-23vm4 mkdir -p /mnt/mds2; mount -t lustre   		                   /dev/lvm-Role_MDS/P2 /mnt/mds2
CMD: shadow-23vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/sbin:/usr/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all -lnet -lnd -pinger\" 4 
CMD: shadow-23vm4 e2label /dev/lvm-Role_MDS/P2 2>/dev/null
Started lustre-MDT0001
22:29:39 (1406672979) waiting for shadow-23vm4 network 900 secs ...
22:29:39 (1406672979) network interface is UP
CMD: shadow-23vm4 hostname
CMD: shadow-23vm4 mkdir -p /mnt/mds1
CMD: shadow-23vm4 test -b /dev/lvm-Role_MDS/P1
Starting mds1:   /dev/lvm-Role_MDS/P1 /mnt/mds1
CMD: shadow-23vm4 mkdir -p /mnt/mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/mds1
shadow-23vm4: mount.lustre: mount /dev/mapper/lvm--Role_MDS-P1 at /mnt/mds1 failed: Input/output error
shadow-23vm4: Is the MGS running?
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 5
 insanity test_1: @@@@@@ FAIL: test_1 failed with 5

Console log on MDS node:

16:29:54:LustreError: 15c-8: MGC10.1.5.25@tcp: The configuration from log 'lustre-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
16:29:54:Lustre: Evicted from MGS (at MGC10.1.5.25@tcp_0) after server handle changed from 0xcc4f15973e89079d to 0xcc4f15973e891c96
16:29:54:LustreError: 11730:0:(obd_mount_server.c:1299:server_start_targets()) failed to start server lustre-MDT0000: -5
16:29:54:LustreError: 11730:0:(obd_mount_server.c:1771:server_fill_super()) Unable to start targets: -5
16:29:54:LustreError: 11730:0:(obd_mount_server.c:1498:server_put_super()) no obd lustre-MDT0000
16:45:01:Lustre: server umount lustre-MDT0000 complete
16:45:01:LustreError: 11730:0:(obd_mount.c:1342:lustre_fill_super()) Unable to mount  (-5)
16:45:01:Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
16:45:01:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  insanity test_1: @@@@@@ FAIL: test_1 failed with 5 
16:45:01:Lustre: DEBUG MARKER: insanity test_1: @@@@@@ FAIL: test_1 failed with 5

Maloo report: https://testing.hpdd.intel.com/test_sets/f68fd6dc-177d-11e4-a76a-5254006e85c2



 Comments   
Comment by Jian Yu [ 30/Jul/14 ]

More instances on master branch:
https://testing.hpdd.intel.com/test_sets/ac193a30-177d-11e4-a76a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/e0f03cfa-177c-11e4-a76a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/03f91294-177d-11e4-a76a-5254006e85c2

Comment by Andreas Dilger [ 30/Jul/14 ]

It looks like the first two tests (for patch http://review.whamcloud.com/11268 which is just master) are actually hitting LU-5420. The second tests (for http://review.whamcloud.com/11269 with the LU-4877 patch reverted) are possibly hitting a different problem as described here.

Generated at Sat Feb 10 01:51:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.