[LU-2143] MDS became read-only repeatedly after e2fsck Created: 10/Oct/12  Updated: 06/Nov/13  Resolved: 06/Nov/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.8
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Lu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File thislog2.rar    
Severity: 3
Rank (Obsolete): 5151

 Description   

Two of our MDS have repeatedly got read-only error recently after once e2fsck on lustre 1.8.5. After the MDT mounted for a while, the kernel will reports errors like:
Oct 8 20:16:44 mainmds kernel: LDISKFS-fs error (device cciss!c0d1): ldiskfs_ext_check_inode: bad header/extent in inode #50736178: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Oct 8 20:16:44 mainmds kernel: Aborting journal on device cciss!c0d1-8. And make the MDS read-only.
We believe there is some structure wrong in the local file system of MDT, so we have tried to use e2fsck to fix it( following the process in lustre manual). However, with the loop always goes like this:
1. run e2fsck, fixed or not fixed some errors
2. mount MDT, report read-only after some client operations, and the whole system became unusable.
3. e2fsck again.

We have tried with three different version lustre: 1.8.5, 1.8.6, and 1.8.8-wc and their corresponding e2fsprog, the problem still exists.
we have also tried to dd the MDT device and mount the replica, the problem still exists. Besides, we have not seen any error reported on hardware monitor. It is much more like an ldiskfs error than hardware error.

Please found one kernel dump log as attached.
Thank you!


Generated at Sat Feb 10 01:22:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.