[LU-1547] MDT remounted read-only, MDS hung, MDT corrupted Created: 21/Jun/12  Updated: 11/Mar/14  Resolved: 11/Mar/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: HP Slovakia team (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Not a Bug Votes: 0
Labels: ldiskfs
Environment:

OS RHEL 5.5 cluster, MDT, OST on LVM volumes, SAN, storage HP XP24k


Attachments: Text File fsck.out     Text File sklusp01a-messages     Text File sklusp01b-messages    
Severity: 3
Rank (Obsolete): 4000

 Description   

Our customer experienced MDT remounted read-only after MDS relocation from sklusp01b to sklusp01a cluster node.
They also performed relocation of OSS services during the same time.
When they noticed the RO status they tried to stop the MDS. The attempt to stop MDS was unsuccessful, the server got unresponsive and the other cluster node (sklusp01b) fenced the sklusp01a MDS server and took over the MDT, The sklusp01b was stopped after take-over and
then they run fsck which ended with huge number of errors, The repair was unsuccessful. It ended with recreation of whole Lustre FS and restore from backup.
Is it possible to determine the root cause from logs?



 Comments   
Comment by Peter Jones [ 21/Jun/12 ]

Niu is investigating this one

Comment by Niu Yawei (Inactive) [ 21/Jun/12 ]
Jun 17 23:03:08 sklusp01a kernel: LDISKFS-fs error (device dm-11): ldiskfs_lookup: unlinked inode 27720411 in dir #29287441
Jun 17 23:03:08 sklusp01a kernel: Remounting filesystem read-only

Looks ldiskfs fail to find inode, which is an fs inconsistence error and caused RO. Not sure if it's an ext4 problem, I'll investigate it more. Andreas, Johann, any comments? Thanks.

Comment by HP Slovakia team (Inactive) [ 21/Jun/12 ]

New may be important information:
The MDT resides on a cluster LVM volume. Before the issue happaned the MDS was running on node sklusp01b and the customer
performed lvconvert -m1 to MDT on the other node sklusp01a to create a cross site mirror. When the resync finished they relocated the MDS to sklusp01a and the issue happaned about 10 minutes after relocation. Is it OK to run MDS on one node and lvconvert on other node for the same cluster LVM volume?

Comment by Johann Lombardi (Inactive) [ 21/Jun/12 ]

Could you please clarify what you intended to do with the lvconvert command?
I don't think it is safe to run such a command on one node while the volume is being accessed on another node (unless you use CLVM?).

It is likely the root cause of your problem.

Comment by Andreas Dilger [ 21/Jun/12 ]

The initial recovery appears to find a valid Lustre filesystem to mount:

Jun 17 22:52:52 sklusp01a kernel: Lustre: 11216:0:(mds_fs.c:677:mds_init_server_data()) RECOVERY: service l1-MDT0000, 56 recoverable clients, 0 delayed clients, last_transno 133173826553

Later on, it finds a single error in the filesystem when it is cleaning up the orphan inodes:

Jun 17 23:03:08 sklusp01a kernel: LDISKFS-fs error (device dm-11): ldiskfs_lookup: unlinked inode 27720411 in dir #29287441
Jun 17 23:03:08 sklusp01a kernel: Remounting filesystem read-only
Jun 17 23:03:08 sklusp01a kernel: LDISKFS-fs warning (device dm-11): kmmpd: kmmpd being stopped since filesystem has been remounted as readonly.

After failover to sklusp01b (which was quickly shut down), the MDS service is again started on sklusp01a and sees the same error:

Jun 18 00:25:03 sklusp01a kernel: LDISKFS-fs error (device dm-14): ldiskfs_lookup: unlinked inode 27720411 in dir #29287441
Jun 18 00:25:03 sklusp01a kernel: Remounting filesystem read-only

At least during these times, the filesystem was intact enough to be able to mount and read basic Lustre configuration files. I can't comment on the severity of the corruption seen by e2fsck, but the kernel only saw a relatively minor problem (directory entry for an open-unlinked inode was actually deleted, which may possibly relate to nlink problems previously fixed in https://bugzilla.lustre.org/show_bug.cgi?id=22177 for 1.8.3).

It also appears you have MMP enabled on this filesystem, which would normally prevent it from being mounted on two nodes at the same time. From the timestamps in the logs, it does not appear that the two MDS services were active at the same time on the two nodes.

Unfortunately, I'm not familiar enough with the details of CLVM and what lvconvert does in this case to comment on whether this is safe to do on a running system or not. It is possible that "lconvert" and/or the mirror resync process incorrectly mirrored the LV between nodes, possibly getting some part of the device inconsistent between the two MDS nodes. It is also possible (depending on how IO was being done by LVM to keep the mirrors in sync) that data was still in cache on sklusp01b, and not flushed to disk on sklusp01a at the time of failover.

The MDS does operations asynchronously in memory, and only flushes them to disk every few seconds at transaction commit time. Conversely, the OSS does writes synchronously to disk because this avoids too much memory pressure at high IO rates, so it may be the same inconsistency would not be visible on the OSS due to frequent sync of data to disk.

Having the output from e2fsck would allow a guess at what type of corruption was seen, and how it might be introduced.

Comment by HP Slovakia team (Inactive) [ 21/Jun/12 ]

I have attached the fsck -fn ... output. It was run when after failover to sklusp01b the MDS was stopped.

Comment by HP Slovakia team (Inactive) [ 21/Jun/12 ]

To Johann's question:
The customer has disaster tolerant design with two datacenters (A and B) MDT and all 6 OSTs are on CLVM volumes mirorred across sites, MDS and OSSs are configured as cluster services. The MDS usually run on site A (sklusp01a), 3 OSSs on site A other 3 on B. During the weekend they had a planned power outage on site A. Before the PO they relocated the MDS and 3 OSSs from A to B and did lvconvert -m0 before the SAN, storage and servers were shut down. When the power outage was over, storage, SAN and servers were started again the CLVM volumes were converted back to mirror.
Akos

Comment by John Fuchs-Chesney (Inactive) [ 07/Mar/14 ]

Akos and HP Slovakia team,
Is there any further action required on this ticket, or can I mark it as resolved?
Many thanks,
~ jfc.

Comment by HP Slovakia team (Inactive) [ 11/Mar/14 ]

John,

ticket can be closed. The issue was caused by a bug in CLVM.
Best regards,
Akos

Comment by Peter Jones [ 11/Mar/14 ]

Thanks Akos!

Generated at Sat Feb 10 01:17:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.