[LU-1875] Test failure on test suite recovery-random-scale, subtest test_fail_client_mds Created: 10/Sep/12  Updated: 19/Dec/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 10233

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/9e892b2a-f92d-11e1-a1b8-52540035b04c.

The sub-test test_fail_client_mds failed with the following error:

test_fail_client_mds returned 7

client-28vm3: Waiting 2 secs for *.lustre-MDT0000.recovery_status recovery done. status: RECOVERING
client-28vm3: *.lustre-MDT0000.recovery_status status: RECOVERING
client-28vm3: Waiting -3 secs for *.lustre-MDT0000.recovery_status recovery done. status: RECOVERING
client-28vm3: *.lustre-MDT0000.recovery_status recovery not done in 662 sec. status: RECOVERING
mds1 recovery is not completed!
2012-09-06 16:07:47 Terminating clients loads ...
Duration:               86400
Server failover period: 900 seconds
Exited after:           496 seconds
Number of failovers before exit:
mds1 failed over 1 times
Status: FAIL: rc=7
CMD: client-28vm5,client-28vm6 test -f /tmp/client-load.pid &&
        { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
pdsh@client-28vm1: client-28vm6: connect: No route to host


 Comments   
Comment by Peter Jones [ 10/Sep/12 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Hongchao Zhang [ 11/Sep/12 ]

https://maloo.whamcloud.com/test_sets/9e892b2a-f92d-11e1-a1b8-52540035b04c

the OST node (client-28vm4) somehow failed and not contained logs related to Lustre in its 'dmesg' log (the Lustre wasn't loaded?),
then MDT failed to connect to OST and reject reconnection request from Clients, and the recovery was stuck!

Comment by Sarah Liu [ 07/Jul/15 ]

server: lustre-master build # 3092 EL7
client: SLES11 SP3

https://testing.hpdd.intel.com/test_sets/095e7dfe-24ca-11e5-8427-5254006e85c2

Comment by Hongchao Zhang [ 04/Aug/15 ]

the new occurrence is the same as LU-6890

CMD: shadow-17vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/mpi/gcc/openmpi/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh _wait_recovery_complete *.lustre:MDT0000.recovery_status 1475 
shadow-17vm4: error: get_param: */lustre:MDT0000/recovery_status: Found no match
mds1 recovery is not completed!
Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:20:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.