[LU-1275] Lustre 2.1.1 REPLAY_SINGLE test_0a FAIL: Restart of mds failed Created: 30/Mar/12  Updated: 05/Mar/14  Resolved: 05/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1, Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jay Lan (Inactive) Assignee: Minh Diep
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Server runs centos 6.2, ofed-1.5.4.1, Lustre 2.1.1.
Client runs sles11sp1, ofed-1.5.4.1, Lustre 1.8.6.
MGS/MDS uses the same device. Two OSS'es. Two clients.


Attachments: File nas.v3.sh     File ncli_nas.v3.sh     File replay-single.s360.0406.FAIL     File replay-single.s360.0406.PASS     File replay-single.test_0a.debug_log.service360.log.1     File replay-single.test_0a.debug_log.service360.log.2    
Severity: 3
Rank (Obsolete): 6096

 Description   

My acc-sm set-ups has been used in testing 1.8.5, 1.8.6, and 1.8.7 successfully.
This is the first time I ran acc-sm against 2.1.1.
The SANITY and SANITYN passed, but all tests in REPLAY_SINGLE failed since
"@@@@@@ FAIL: Restart of mds failed".

== test 0a: empty replay == 12:05:12
Filesystem 1K-blocks Used Available Use% Mounted on
service360@o2ib:/lustre
3937056 205112 3531816 6% /mnt/nbp0-1
Failing mds on node service360
Stopping /mnt/mds (opts
affected facets: mds
df pid is 13509
Failover mds to service360
12:05:26 (1333134326) waiting for service360 network 900 secs ...
12:05:26 (1333134326) network interface is UP
Starting mds: -o errors=panic,acl /dev/sdb1 /mnt/mds
service360: mount.lustre: mount /dev/sdb1 at /mnt/mds failed: Invalid argument
service360: This may have multiple causes.
service360: Are the mount options correct?
service360: Check the syslog for more info.
mount -t lustre /dev/sdb1 /mnt/mds
Start of /dev/sdb1 on mds failed 22
replay-single test_0a: @@@@@@ FAIL: Restart of mds failed!

The /var/log/message of the MGS/MDS node showed:
...
Mar 30 12:05:10 service360 kernel: Lustre: MGC10.151.26.38@o2ib: Reactivating import
Mar 30 12:05:10 service360 kernel: LustreError: 11254:0:(llog_lvfs.c:473:llog_lvfs_next_block()) Invalid llog tail at log id 17/2375643311 offset 14432
Mar 30 12:05:10 service360 kernel: LustreError: 11254:0:(mgs_handler.c:783:mgs_handle()) MGS handle cmd=502 rc=-22
...
The replay-single.test_0a.debug_log.service360.log.[12] are attached.



 Comments   
Comment by Peter Jones [ 30/Mar/12 ]

Minh

Could you please help with this one?

Thanks

Peter

Comment by Jay Lan (Inactive) [ 03/Apr/12 ]

Could you please help on this?
The same test environment worked fine in 1.8.5 and 1.8.6.
One single test failure in 1.8.7 (see LU-1246)
Much more failure between 2.1.1 server and 1.8.7 client, including this one.
Totally not working between 2.1.1 server and 2.1.1 client.

I am going to spend time to convert to auster for 2.1.1 server + 2.1.1 client,
but I really need help in evaluating my environment of 2.1.1 server + 1.8.7 client.

Comment by Minh Diep [ 03/Apr/12 ]

ok, looking into this

Comment by Minh Diep [ 03/Apr/12 ]

can you show me the config file? or local.sh if you modified it

Comment by Jay Lan (Inactive) [ 03/Apr/12 ]

The command used in testing was:

  1. ACC_SM_ONLY="REPLAY_SINGLE" NAME=ncli_nas.v3 RCLIENTS="service332" sh acceptance-small.sh

The ncli_nas.v3 will be attached.

Comment by Jay Lan (Inactive) [ 03/Apr/12 ]

I accidentally also attached nas.v3.sh. It was a wrapper. The end result was to
run the command I wrote in the previous comment. The configuration file is ncli_nas.v3.

Comment by Minh Diep [ 03/Apr/12 ]

thanks. Did you run this on a client that was running 1.8.6?

Comment by Jay Lan (Inactive) [ 03/Apr/12 ]

Yes. It was started from service331, a client. All nodes (mds, 2 oss'es and 2 clients) have the same set of configuration.

Comment by Minh Diep [ 03/Apr/12 ]

I don't have a system to try it out now. could you manually run "mount -t lustre -o errors=panic,acl /dev/sdb1 /mnt/mds" on the mds to see if it works

Comment by Jay Lan (Inactive) [ 03/Apr/12 ]

I know for the fact that "mount -t lustre -o errors=panic,acl /dev/sdb1 /mnt/mds" works because the command has been executed so many times.

However, that brought some thought to me. In fact I ran the acceptance-small.sh in a for-loop:

for i in SANITY SANITYN REPLAY_SINGLE CONF_SANITY RECOVERY_SMALL REPLAY_OST_SINGLE REPLAY_DUAL INSANITY SANITY_QUOTA LNET_SELFTEST MMP; do
mkdir $TMP/$i
umount /mnt/nbp0-1 /mnt/nbp0-2 1> /dev/null 2>&1
echo run $i >$TMP/${i}/${i}.output 2>&1
case $i in
SANITY|SANITYN|REPLAY_SINGLE|CONF_SANITY|RECOVERY_SMALL|REPLAY_OST_SINGLE|REPLAY_DUAL|INSANITY|LNET_SELFTEST|MMP)
ACC_SM_ONLY="$i" NAME=ncli_nas.v3 RCLIENTS="service332" sh acceptance-small.sh >>$TMP/${i}/${i}.output 2>&1;;
SANITY_QUOTA)
ACC_SM_ONLY="$i" RCLIENTS="service332" MDSSIZE=4000000 OSTSIZE=4000000 NAME=ncli_nas.v3 sh acceptance-small.sh >>$TMP/${i}/${i}.output 2>&1;;
*)
echo "Test $i not supported.";;
esac
done

So, by the time the REPLAY_SINGLE is executed, both SANITY and SANITYN has completed. That means it was not the same as starting from ground zero.

So, I rebooted all the machines. Ran "mount -t lustre" to make sure it worked.
Umounted it. Then just ran the acceptance-small.sh with REPLAY_SINGLE without executing SANITY and SANITYN first. Well, it succeeded!

Now, this wrapper worked when the lustre server is 1.8.6 (or 1.8.7). Any suggestion to make it work when server runs 2.1.1?

Comment by Jay Lan (Inactive) [ 03/Apr/12 ]

Since the REPLAY_SINGLE can be executed successfully on a clean environment, you can close this ticket then. I will figure out a way to work around my problem when testing with 2.x servers. Suggestion is welcome

Comment by Minh Diep [ 03/Apr/12 ]

I need to reproduce this in the lab and investigate the cause. In the mean time, please try this. Add MDSDEV1=/dev/sdb1 in the config file to see if it makes any different. If you don't care to reformat the FS before every test, you could put export REFORMAT=true in the config file.

I also suggest you to explore auster script which has an option to send the result back in our maloo result db.

Comment by Jay Lan (Inactive) [ 04/Apr/12 ]

I have this line in my configuration file:
export REFORMAT="--reformat"

Would it have the same effect as "export REFORMAT=true"?

Comment by Minh Diep [ 06/Apr/12 ]

yes

Comment by Jay Lan (Inactive) [ 06/Apr/12 ]

Attached two files, cut from /var/log/messages of the mds server between the MARKER of beginning and end of test 0a.

The *.FAIL was the run that failed. and The *.PASS was the run that passed.

Comment by Jay Lan (Inactive) [ 06/Apr/12 ]

On a second thought I do not feel comfortable to declare this is a test issue (ie, is a problem of test environment setup.) It could also resulted from mds behaving differently in different situations and represents a real problem.

We do not know enough to say either way.

Comment by John Fuchs-Chesney (Inactive) [ 05/Mar/14 ]

Jay – is this still an issue of concern to you?
Is there any further action you'd like us to take?
I'd like to mark this as resolved – am I OK to go ahead and do that?
Thanks,
~ jfc.

Comment by Jay Lan (Inactive) [ 05/Mar/14 ]

Yes, please. No longer a problem. Thanks!

Comment by John Fuchs-Chesney (Inactive) [ 05/Mar/14 ]

Thank you

Comment by John Fuchs-Chesney (Inactive) [ 05/Mar/14 ]

Not clear if this was a test issue – but time has moved on and it is no longer a problem.

Generated at Sat Feb 10 01:15:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.