[LU-1543] Lustre Servers - MDS / OSS Died & fail over took over Created: 20/Jun/12  Updated: 10/Sep/12  Resolved: 10/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Fabio Verzelloni Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

MDS HW
----------------------------------------------------------------------------------------------------
Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
Vendor ID: AuthenticAMD
CPU family: 16
64Gb RAM
Interconnect IB 40Gb/s

MDT LSI 5480 Pikes Peak
SSDs SLC
----------------------------------------------------------------------------------------------------

OSS HW
----------------------------------------------------------------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
Vendor ID: GenuineIntel
CPU family: 6
64Gb RAM
Interconnect IB 40Gb/s

OST LSI 7900
----------------------------------------------------------------------------------------------------

Router nodes
-------------------
12 router nodes - IB 40Gb/s

Clients
---------
Cray XE6 - Lustre 1.8.6

1 MDS + 1 fail over
12 OSS - 6 OST per OSS


Attachments: File 19_jun.log     File 20_jun.log     File weiss01.log     File weiss11.log    
Severity: 3
Rank (Obsolete): 6378

 Description   

Dear Support,
during the weeks we had problem with the MDS/OSS Lustre servers, indeed last week one of the OSS died and fortunately the fail over took over and last night both MDS (see log weissh01.log) and another OSS (see log weiss11.log) died, also in that case the fail over servers took over.
The problem of yesterday looks related between the two servers, indeed they basically died at the same time around 19.20-30

As I said the file system remained up and running because the fail over servers took over, but with our old lustre configuration (version 1.8.7 – 1 Mds + 1 Mds Fail over – 4 Oss) also under huge stress and a lot of logging of slow down due to heavy IO load, the MDS or OSS didn't died.

If you need access to our cluster, please let me know (fverzell@cscs.ch) so that we can organize to create an account.

Right now we have also a list of ticket that might be related to each other in same aspect, that's the list:

http://jira.whamcloud.com/browse/LU-1447
http://jira.whamcloud.com/browse/LU-1455
http://jira.whamcloud.com/browse/LU-1470
http://jira.whamcloud.com/browse/LU-1503

Regards
Fabio



 Comments   
Comment by Peter Jones [ 20/Jun/12 ]

Fabio

We will definitely take an overall view of all your issues when deciding the best approach. Getting remote access to the cluster in question will undoubtedly be useful. I will contact you directly to make those arrangements

Peter

Comment by Cliff White (Inactive) [ 20/Jun/12 ]

Can we get a list of the address for the Lustre servers?

Comment by Liang Zhen (Inactive) [ 21/Jun/12 ]

I think it could be a dup of LU-1280 and fixes:
http://review.whamcloud.com/#change,2452
http://review.whamcloud.com/#change,2827

Comment by Fabio Verzelloni [ 21/Jun/12 ]

The list of the Lustre servers is the following:

MDS + Failover
Weisshorn01- 148.187.7.101
Weisshorn02- 148.187.7.102

OSS + each couple is the failover (weisshorn03-04, 05-06, ecc..)
Weisshorn03- 148.187.7.103
Weisshorn04- 148.187.7.104
Weisshorn05- 148.187.7.105
Weisshorn06- 148.187.7.106
Weisshorn07- 148.187.7.107
Weisshorn08- 148.187.7.108
Weisshorn09- 148.187.7.109
Weisshorn10- 148.187.7.110
Weisshorn11- 148.187.7.111
Weisshorn12- 148.187.7.112
Weisshorn13- 148.187.7.113
Weisshorn14- 148.187.7.114

Comment by Cliff White (Inactive) [ 05/Jul/12 ]

Has there been any word from Cray on access to gnilnd source? Should we close this issue and revisit after the software version change planned (LU-1503) ?
Please let us know what more we can do to assist.

Comment by Fabio Verzelloni [ 06/Jul/12 ]

Cliff,
we passing the request to CRAY to see if can manage to have access to the code. I'll let you ASAP.

Thanks
Fabio

Comment by Cory Spitz [ 06/Jul/12 ]

FYI, Cray is working on pushing up the gnilnd into the Lustre tree. The tracking ticket is LU-1419.

Comment by James A Simmons [ 29/Aug/12 ]

Any updates?

Comment by Cory Spitz [ 29/Aug/12 ]

Well, the Cray LND code has been pushed to LU-1419, but not landed. I can't help with any other updates.

Comment by James A Simmons [ 29/Aug/12 ]

I mean does Fabio still see the problem.

Comment by Cliff White (Inactive) [ 04/Sep/12 ]

What is the current state? Is there anything more we can do on this issue?

Comment by Cliff White (Inactive) [ 10/Sep/12 ]

I am going to close this issue. Please re-open if you have more information or questions.

Generated at Sat Feb 10 01:17:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.