[LU-885] recovery-mds-scale (FLAVOR=mds) fail, network is not avaliable Created: 29/Nov/11  Updated: 03/Oct/19  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

lustre-master build #353 RHEL6-x8_64 for both server and client


Attachments: File recovery-mds-scale-1322611170.tar.bz2    
Issue Links:
Related
is related to LU-893 system hang when running recovery-mds... Resolved
Severity: 3
Rank (Obsolete): 10215

 Description   

Running recovery-mds-scale FLAVOR=mds for about 2 hours(MDS fail over 14 times), network is not available for standby MDS server and it cannot be access after that even doing power cycle. I got this similar issue twice.

 
==== Checking the clients loads AFTER  failover -- failure NOT OK
mds1 has failed over 14 times, and counting...
sleeping 501 seconds ... 
==== Checking the clients loads BEFORE failover -- failure NOT OK     ELAPSED=7904 DURATION=86400 PERIOD=600
Wait mds1 recovery complete before doing next failover ....
affected facets: mds1
client-6: *.lustre-MDT0000.recovery_status status: COMPLETE
Checking clients are in FULL state before doing next failover
client-18: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
client-12: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
client-18: cannot run remote command on client-12,client-13,client-17,client-18 with 
client-12: cannot run remote command on client-12,client-13,client-17,client-18 with 
client-17: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
client-17: cannot run remote command on client-12,client-13,client-17,client-18 with 
client-13: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
client-13: cannot run remote command on client-12,client-13,client-17,client-18 with 
Starting failover on mds1
Failing mds1 on node client-6
+ pm -h powerman --off client-6
Command completed successfully
affected facets: mds1
+ pm -h powerman --on client-6
Command completed successfully
Failover mds1 to client-2
15:35:30 (1322609730) waiting for client-2 network 900 secs ...
waiting ping -c 1 -w 3 client-2, 895 secs left ...
waiting ping -c 1 -w 3 client-2, 890 secs left ...
waiting ping -c 1 -w 3 client-2, 885 secs left ...
waiting ping -c 1 -w 3 client-2, 880 secs left ...
waiting ping -c 1 -w 3 client-2, 875 secs left ...
waiting ping -c 1 -w 3 client-2, 870 secs left ...
waiting ping -c 1 -w 3 client-2, 865 secs left ...
waiting ping -c 1 -w 3 client-2, 860 secs left ...
waiting ping -c 1 -w 3 client-2, 855 secs left ...
waiting ping -c 1 -w 3 client-2, 850 secs left ...
waiting ping -c 1 -w 3 client-2, 845 secs left ...
waiting ping -c 1 -w 3 client-2, 840 secs left ...
waiting ping -c 1 -w 3 client-2, 835 secs left ...
waiting ping -c 1 -w 3 client-2, 830 secs left ...
waiting ping -c 1 -w 3 client-2, 825 secs left ...
waiting ping -c 1 -w 3 client-2, 820 secs left ...


 Comments   
Comment by Sarah Liu [ 01/Dec/11 ]

recovery-radom-scale failed after running about 4 hours(mds1 failed over 24 times) hit the similar issue. the standby mds server is not usable.

waiting ping -c 1 -w 3 client-7, 5 secs left ...
Network not available!
2011-12-01 01:39:58 Terminating clients loads ...
Duration: 86400
Server failover period: 600 seconds
Exited after: 14119 seconds
Number of failovers before exit:
mds1 failed over 24 times
Status: FAIL: rc=1

Comment by Oleg Drokin [ 03/Jan/12 ]

So it appears that the failover node is not coming up for some reason (pings not working and such), somebody need to reproduce this and then check into what's going on at the failover node? vm failed to start? Some rootfs corruption so that it's stuck during booting waiting for root password or whatever?

Potentially a TT ticket if the node fails to start.

Comment by Sarah Liu [ 03/Jan/12 ]

Actually in both recovery-mds-scale(FLAVOR=mds) and recovery-radom-scale tests, the failover node was not accessible after rebooting for several times. Robert was in the lab and helped on this. As he seen, the nodes need a physical power on and it seems nothing special.

Comment by Peter Jones [ 15/Jul/12 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Hongchao Zhang [ 17/Jul/12 ]

this seems to be a problem related to the tool pm, for the nodes were not stuck and only need a power on

Comment by Chris Gearing (Inactive) [ 26/Jul/12 ]

Hongchao how did you replicate this, and have we seen this recently under autotest? If so can you post links to the results here.

Comment by Peter Jones [ 30/Jul/12 ]

Adding Hongchao as a watcher so he sees Chris's question

Comment by Hongchao Zhang [ 30/Jul/12 ]

Hi Chris,
I can't replicate this issue, and there is no new occurrence under autotest recently.

Comment by Peter Jones [ 31/Jul/12 ]

As per Sarah has not reocurred for last three tags so removing as a blocker

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:11:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.