[LU-1470] MDS Crash & reboot Created: 04/Jun/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Fabio Verzelloni Assignee: Zhenyu Xu
Resolution: Incomplete Votes: 0
Labels: None
Environment:

MDS HW
----------------------------------------------------------------------------------------------------
Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
Vendor ID: AuthenticAMD
CPU family: 16
64Gb RAM
Interconnect IB 40Gb/s

MDT LSI 5480 Pikes Peak
SSDs SLC
----------------------------------------------------------------------------------------------------

OSS HW
----------------------------------------------------------------------------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
Vendor ID: GenuineIntel
CPU family: 6
64Gb RAM
Interconnect IB 40Gb/s

OST LSI 7900
----------------------------------------------------------------------------------------------------

1 MDS + 1 fail over
12 OSS - 6 OST per OSS


Attachments: File cluster.log.gz     File messages-20120603    
Severity: 3
Rank (Obsolete): 4039

 Description   

MDS hang and the fail over take over, attached the /var/log/messages.
Not clear to me the reason of the reboot.

Fabio



 Comments   
Comment by Peter Jones [ 04/Jun/12 ]

Bobijam will help with this one

Comment by Zhenyu Xu [ 04/Jun/12 ]

from the messages, the system received heartbeat shutdown notice from weisshorn02 node, that maked MDS reboot. Before that

May 30 17:36:13 weisshorn02 heartbeat: [3277]: info: Heartbeat shutdown in progress. (3277)
May 30 17:36:14 weisshorn01 heartbeat: [5612]: info: Received shutdown notice from 'weisshorn02.admin.cscs.ch'.

Comment by Zhenyu Xu [ 04/Jun/12 ]

Can not find the evidence of MDS crash, would you mind collecting MDS debug logs and tell me when did the MDS crash?

Also from the log you've uploaded, there are several messages showing that "ldap_result() failed: Can't contact LDAP server", does the network have problem, is it the same network which heartbeat uses?

Generated at Sat Feb 10 01:16:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.