Details
-
Bug
-
Resolution: Not a Bug
-
Critical
-
None
-
Lustre 1.8.x (1.8.0 - 1.8.5)
-
OS RHEL 5.5 cluster, MDT, OST on LVM volumes, SAN, storage HP XP24k
-
3
-
4000
Description
Our customer experienced MDT remounted read-only after MDS relocation from sklusp01b to sklusp01a cluster node.
They also performed relocation of OSS services during the same time.
When they noticed the RO status they tried to stop the MDS. The attempt to stop MDS was unsuccessful, the server got unresponsive and the other cluster node (sklusp01b) fenced the sklusp01a MDS server and took over the MDT, The sklusp01b was stopped after take-over and
then they run fsck which ended with huge number of errors, The repair was unsuccessful. It ended with recreation of whole Lustre FS and restore from backup.
Is it possible to determine the root cause from logs?
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA