Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.1.1
-
None
-
Server: 2.1.1 in centos 6.2, kernel 2.6.32-220.4.1.el6, x86_64, lustre server 2.1.1-0.2nasS
Client: 2.1.1 in centos 6.2, unpatched kernel 2.6.32-220.4.1.el6, x86_64. lustre client 2.1.1-0.2nasC
1 mds/mgs (service360)
2 osses (service361,service362)
2 clients (service333, service334)
The lustre git repo can be found at
https://github.com/jlan/lustre-nas/tree/nas-2.1.1
Server: 2.1.1 in centos 6.2, kernel 2.6.32-220.4.1.el6, x86_64, lustre server 2.1.1-0.2nasS Client: 2.1.1 in centos 6.2, unpatched kernel 2.6.32-220.4.1.el6, x86_64. lustre client 2.1.1-0.2nasC 1 mds/mgs (service360) 2 osses (service361,service362) 2 clients (service333, service334) The lustre git repo can be found at https://github.com/jlan/lustre-nas/tree/nas-2.1.1
-
3
-
6409
Description
Reply-dual test 16 failed:
== replay-dual test 16: fail MDS during recovery (3571) == 17:38:52 (1335573532)
Filesystem 1K-blocks Used Available Use% Mounted on
service360@o2ib:/lustre
3937056 205112 3531816 6% /mnt/nbp0-1
total: 25 creates in 0.04 seconds: 678.21 creates/second
total: 1 creates in 0.00 seconds: 389.26 creates/second
Failing mds1 on node service360
Stopping /mnt/mds1 (opts![]()
affected facets: mds1
Failover mds1 to service360
17:39:07 (1335573547) waiting for service360 network 900 secs ...
17:39:07 (1335573547) network interface is UP
Starting mds1: -o errors=panic,acl /dev/sdb1 /mnt/mds1
service360: mount.lustre: mount /dev/sdb1 at /mnt/mds1 failed: Invalid argument
service360: This may have multiple causes.
service360: Are the mount options correct?
service360: Check the syslog for more info.
mount -t lustre /dev/sdb1 /mnt/mds1
Start of /dev/sdb1 on mds1 failed 22
replay-dual test_16: @@@@@@ FAIL: Restart of mds1 failed!
Dumping lctl log to /var/acc-sm/test_logs//1335573120/replay-dual.test_16.*.1335573548.log
tar: Removing leading `/' from member names
/var/acc-sm/test_logs//1335573120/replay-dual-1335573548.tar.bz2
FAIL 16 (45s)
The "Invalid argument" was about extents, which we do not turn on on MDS.
The replay-dual.test_16.dmesg.service360.1335573548.log seemed to suggest the problem
was a corrupted file:
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts:
LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
LDISKFS-fs (sdb1): ldiskfs_check_descriptors: Checksum for group 0 failed (27004!=29265)
LDISKFS-fs (sdb1): group descriptors corrupted!
LU-699 seemed to have encountered a data corruption problem in reply-dual test_1. I applied the patch and rebuilt a lustre server package, but the test still failed.
REPLAY_DUAL-16.tgz is attached.
The failure is 100% reproducible.
Could the data corruption problem caused by trying to fail-over mds to the same node?
In other words, is it a test-case problem or a real problem?