[LU-1362] replay-dual test_16 fails to remount mdt Created: 02/May/12  Updated: 05/Oct/12  Resolved: 05/Oct/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jay Lan (Inactive) Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Server: 2.1.1 in centos 6.2, kernel 2.6.32-220.4.1.el6, x86_64, lustre server 2.1.1-0.2nasS
Client: 2.1.1 in centos 6.2, unpatched kernel 2.6.32-220.4.1.el6, x86_64. lustre client 2.1.1-0.2nasC

1 mds/mgs (service360)
2 osses (service361,service362)
2 clients (service333, service334)

The lustre git repo can be found at
https://github.com/jlan/lustre-nas/tree/nas-2.1.1


Attachments: File REPLAY_DUAL-16.tgz     File replay-dual-14b.tar.bz2     File replay-dual-14b.tar.bz2     File replay-dual-14b.tar.bz2     File replay-dual-14b.tar.bz2     File replay-dual-16.tar.bz2     File replay-dual-16.tar.bz2     File replay-dual-16.tar.bz2    
Severity: 3
Rank (Obsolete): 6409

 Description   

Reply-dual test 16 failed:

== replay-dual test 16: fail MDS during recovery (3571) == 17:38:52 (1335573532)
Filesystem 1K-blocks Used Available Use% Mounted on
service360@o2ib:/lustre
3937056 205112 3531816 6% /mnt/nbp0-1
total: 25 creates in 0.04 seconds: 678.21 creates/second
total: 1 creates in 0.00 seconds: 389.26 creates/second
Failing mds1 on node service360
Stopping /mnt/mds1 (opts
affected facets: mds1
Failover mds1 to service360
17:39:07 (1335573547) waiting for service360 network 900 secs ...
17:39:07 (1335573547) network interface is UP
Starting mds1: -o errors=panic,acl /dev/sdb1 /mnt/mds1
service360: mount.lustre: mount /dev/sdb1 at /mnt/mds1 failed: Invalid argument
service360: This may have multiple causes.
service360: Are the mount options correct?
service360: Check the syslog for more info.
mount -t lustre /dev/sdb1 /mnt/mds1
Start of /dev/sdb1 on mds1 failed 22
replay-dual test_16: @@@@@@ FAIL: Restart of mds1 failed!
Dumping lctl log to /var/acc-sm/test_logs//1335573120/replay-dual.test_16.*.1335573548.log
tar: Removing leading `/' from member names
/var/acc-sm/test_logs//1335573120/replay-dual-1335573548.tar.bz2
FAIL 16 (45s)

The "Invalid argument" was about extents, which we do not turn on on MDS.

The replay-dual.test_16.dmesg.service360.1335573548.log seemed to suggest the problem
was a corrupted file:
LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts:
LDISKFS-fs warning (device sdb1): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
LDISKFS-fs (sdb1): ldiskfs_check_descriptors: Checksum for group 0 failed (27004!=29265)
LDISKFS-fs (sdb1): group descriptors corrupted!

LU-699 seemed to have encountered a data corruption problem in reply-dual test_1. I applied the patch and rebuilt a lustre server package, but the test still failed.

REPLAY_DUAL-16.tgz is attached.

The failure is 100% reproducible.

Could the data corruption problem caused by trying to fail-over mds to the same node?
In other words, is it a test-case problem or a real problem?



 Comments   
Comment by Peter Jones [ 02/May/12 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Jay Lan (Inactive) [ 03/May/12 ]

Attached two replay-dual-*.tar.bz2, one for test_14b and the other test_16.

Comment by Jay Lan (Inactive) [ 03/May/12 ]

Attached two replay-dual-*.tar.bz2, one for test_14b and the other test_16.

Comment by Jay Lan (Inactive) [ 03/May/12 ]

Sorry I ended up attaching replay-dual-14b.tar.bz2 and replay-dual-16.tar.bz2 multiple times. Please clean the extra copied up.

This site formats mds without extents option, but oss with extents option. To get "Invalid argument" errors out of the testing, I added "noextents" to the MDS_MOUNT_OPTIONS and reran the tests.

BTW, I always failed on tests 14b, 16, 20, and 21a. I used to think they were caused by the same problem. But it appeared that test 14b and tst 20 have the same failure signature, while test 16 and test 21a share the same. So, I attached test_logs of test 14b and test 16.

Comment by Lai Siyao [ 18/May/12 ]

This looks to be the same issue of LU-482, Jay, are you using VM in your testing env?

Comment by Jay Lan (Inactive) [ 18/May/12 ]

No, the test system was not running VM.

Comment by Lai Siyao [ 20/May/12 ]

Will it fail if MDS failover node is different node?

Comment by Jay Lan (Inactive) [ 22/May/12 ]

I do not have an extra machine to be a MDS failover node.

Comment by Lai Siyao [ 29/May/12 ]

Jay, replay_barrier() calls mcreate after syncing target, as looks suspicious, I've a patch http://review.whamcloud.com/#change,2931, could you help verify it?

Comment by Jay Lan (Inactive) [ 30/May/12 ]

Hi Siyao, unfortunately I had to report that the patch did not help.

Comment by Brian Murrell (Inactive) [ 04/Oct/12 ]

Jay,

You didn't happen to be using an iSCSI device as your target did you? Was it a Linux iSCSI target or some vendor?

Comment by Jay Lan (Inactive) [ 05/Oct/12 ]

The target was ATA HDS725050KLA360.

Comment by Jay Lan (Inactive) [ 05/Oct/12 ]

I set up my test machines and rerun the test.

The test passed with 2.1.3 server with both sles11sp1 2.1.3 client and centos6.3 2.1.3 client.

The failure I reported was on 2.1.1 centos client.

We can close the case.

Comment by Peter Jones [ 05/Oct/12 ]

ok thanks Jay!

Generated at Sat Feb 10 01:15:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.