Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.2.0
-
None
-
MDS HW
----------------------------------------------------------------------------------------------------
Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
Vendor ID: AuthenticAMD
CPU family: 16
64Gb RAM
Interconnect IB 40Gb/s
MDT LSI 5480 Pikes Peak
SSDs SLC
----------------------------------------------------------------------------------------------------
OSS HW
----------------------------------------------------------------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
Vendor ID: GenuineIntel
CPU family: 6
64Gb RAM
Interconnect IB 40Gb/s
OST LSI 7900
----------------------------------------------------------------------------------------------------
Router nodes
-------------------
12 router nodes - IB 40Gb/s
Clients
---------
Cray XE6 - Lustre 1.8.6
1 MDS + 1 fail over
12 OSS - 6 OST per OSSMDS HW ---------------------------------------------------------------------------------------------------- Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64 Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 Vendor ID: AuthenticAMD CPU family: 16 64Gb RAM Interconnect IB 40Gb/s MDT LSI 5480 Pikes Peak SSDs SLC ---------------------------------------------------------------------------------------------------- OSS HW ---------------------------------------------------------------------------------------------------- Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 Vendor ID: GenuineIntel CPU family: 6 64Gb RAM Interconnect IB 40Gb/s OST LSI 7900 ---------------------------------------------------------------------------------------------------- Router nodes ------------------- 12 router nodes - IB 40Gb/s Clients --------- Cray XE6 - Lustre 1.8.6 1 MDS + 1 fail over 12 OSS - 6 OST per OSS
-
3
-
6378
Description
Dear Support,
during the weeks we had problem with the MDS/OSS Lustre servers, indeed last week one of the OSS died and fortunately the fail over took over and last night both MDS (see log weissh01.log) and another OSS (see log weiss11.log) died, also in that case the fail over servers took over.
The problem of yesterday looks related between the two servers, indeed they basically died at the same time around 19.20-30
As I said the file system remained up and running because the fail over servers took over, but with our old lustre configuration (version 1.8.7 – 1 Mds + 1 Mds Fail over – 4 Oss) also under huge stress and a lot of logging of slow down due to heavy IO load, the MDS or OSS didn't died.
If you need access to our cluster, please let me know (fverzell@cscs.ch) so that we can organize to create an account.
Right now we have also a list of ticket that might be related to each other in same aspect, that's the list:
http://jira.whamcloud.com/browse/LU-1447
http://jira.whamcloud.com/browse/LU-1455
http://jira.whamcloud.com/browse/LU-1470
http://jira.whamcloud.com/browse/LU-1503
Regards
Fabio