[LU-885] recovery-mds-scale (FLAVOR=mds) fail, network is not avaliable Created: 29/Nov/11 Updated: 03/Oct/19 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-master build #353 RHEL6-x8_64 for both server and client |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 10215 | ||||||||
| Description |
|
Running recovery-mds-scale FLAVOR=mds for about 2 hours(MDS fail over 14 times), network is not available for standby MDS server and it cannot be access after that even doing power cycle. I got this similar issue twice. ==== Checking the clients loads AFTER failover -- failure NOT OK mds1 has failed over 14 times, and counting... sleeping 501 seconds ... ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=7904 DURATION=86400 PERIOD=600 Wait mds1 recovery complete before doing next failover .... affected facets: mds1 client-6: *.lustre-MDT0000.recovery_status status: COMPLETE Checking clients are in FULL state before doing next failover client-18: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec client-12: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec client-18: cannot run remote command on client-12,client-13,client-17,client-18 with client-12: cannot run remote command on client-12,client-13,client-17,client-18 with client-17: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec client-17: cannot run remote command on client-12,client-13,client-17,client-18 with client-13: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec client-13: cannot run remote command on client-12,client-13,client-17,client-18 with Starting failover on mds1 Failing mds1 on node client-6 + pm -h powerman --off client-6 Command completed successfully affected facets: mds1 + pm -h powerman --on client-6 Command completed successfully Failover mds1 to client-2 15:35:30 (1322609730) waiting for client-2 network 900 secs ... waiting ping -c 1 -w 3 client-2, 895 secs left ... waiting ping -c 1 -w 3 client-2, 890 secs left ... waiting ping -c 1 -w 3 client-2, 885 secs left ... waiting ping -c 1 -w 3 client-2, 880 secs left ... waiting ping -c 1 -w 3 client-2, 875 secs left ... waiting ping -c 1 -w 3 client-2, 870 secs left ... waiting ping -c 1 -w 3 client-2, 865 secs left ... waiting ping -c 1 -w 3 client-2, 860 secs left ... waiting ping -c 1 -w 3 client-2, 855 secs left ... waiting ping -c 1 -w 3 client-2, 850 secs left ... waiting ping -c 1 -w 3 client-2, 845 secs left ... waiting ping -c 1 -w 3 client-2, 840 secs left ... waiting ping -c 1 -w 3 client-2, 835 secs left ... waiting ping -c 1 -w 3 client-2, 830 secs left ... waiting ping -c 1 -w 3 client-2, 825 secs left ... waiting ping -c 1 -w 3 client-2, 820 secs left ... |
| Comments |
| Comment by Sarah Liu [ 01/Dec/11 ] |
|
recovery-radom-scale failed after running about 4 hours(mds1 failed over 24 times) hit the similar issue. the standby mds server is not usable. waiting ping -c 1 -w 3 client-7, 5 secs left ... |
| Comment by Oleg Drokin [ 03/Jan/12 ] |
|
So it appears that the failover node is not coming up for some reason (pings not working and such), somebody need to reproduce this and then check into what's going on at the failover node? vm failed to start? Some rootfs corruption so that it's stuck during booting waiting for root password or whatever? Potentially a TT ticket if the node fails to start. |
| Comment by Sarah Liu [ 03/Jan/12 ] |
|
Actually in both recovery-mds-scale(FLAVOR=mds) and recovery-radom-scale tests, the failover node was not accessible after rebooting for several times. Robert was in the lab and helped on this. As he seen, the nodes need a physical power on and it seems nothing special. |
| Comment by Peter Jones [ 15/Jul/12 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Hongchao Zhang [ 17/Jul/12 ] |
|
this seems to be a problem related to the tool pm, for the nodes were not stuck and only need a power on |
| Comment by Chris Gearing (Inactive) [ 26/Jul/12 ] |
|
Hongchao how did you replicate this, and have we seen this recently under autotest? If so can you post links to the results here. |
| Comment by Peter Jones [ 30/Jul/12 ] |
|
Adding Hongchao as a watcher so he sees Chris's question |
| Comment by Hongchao Zhang [ 30/Jul/12 ] |
|
Hi Chris, |
| Comment by Peter Jones [ 31/Jul/12 ] |
|
As per Sarah has not reocurred for last three tags so removing as a blocker |
| Comment by Andreas Dilger [ 29/May/17 ] |
|
Close old ticket. |