[LU-4402] Ldiskfs errors ldiskfs_ext_find_extent, ldiskfs_ext_get_blocks, corruption Created: 20/Dec/13 Updated: 08/Feb/14 Resolved: 01/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Blake Caldwell | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL6.4/distro IB/2.6.32-358.18.1.el6 |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 12084 |
| Description |
|
Starting with an otherwise operating filesystem, we had a nfs issue on the management server that does nfsroot to the nodes. This caused the nodes to hang on shell probes, ssh, etc, but lustre appeared to work okay, until the mount came back and there was a spew of I/O errors. We had -o errors=panic, so the nodes rebooted and we have a crash dump as well. A few of the interesting/disturbing messages are below and a complete log capture of the interval is attached. We rebooted every single one of our lustre systems that mounted this nfsroot and started up lustre. At this point, an e2fsck seems prudent given the messages? Please advise. And for clarity's sake, this is from a completely separate system than Dec 19 14:07:28 atlas-oss3b4 kernel: [1987655.565953] end_request: I/O error, dev dm-0, sector 2641342080 Dec 19 14:14:50 atlas-oss1a5 kernel: [691359.246757] LDISKFS-fs error (device dm-9): file system corruption: inode #591204 logical block 447 mapped to 137004702958273 (size 1) Dec 19 14:21:50 atlas-oss1d3 kernel: [691780.266369] LDISKFS-fs error (device dm-2): ldiskfs_ext_find_extent: bad header/extent in inode #199528: invalid magic - magic 0, entries 0, max 0(0), depth 0(0 Dec 19 14:22:41 atlas-oss1d3 kernel: [691831.895277] LDISKFS-fs error (device dm-2): ldiskfs_ext_get_blocks: inode #199528: (comm ll_ost_io02_007) bad extent address iblock: 447, depth: 1 pblock 0 |
| Comments |
| Comment by Alex Zhuravlev [ 20/Dec/13 ] |
|
>Dec 19 14:07:25 atlas-mgs2 kernel: [1988887.590310] sd 6:0:8:1: [sdw] Warning! Received an indication that the LUN assignments on this target have changed. The Linux SCSI layer does not automatically remap LUN assignments. sounds like at some point underlying device changed? |
| Comment by Blake Caldwell [ 20/Dec/13 ] |
|
Well I'm going to say that it's not directly correlated, but there's more to it. That host is not actually part of the filesystem. It connects to, but does not have any LUNs mounted on the same SAS attached array that the MGS and MDS share, so perhaps it indicates an event on one of the other systems. We've grown accustomed to seeing those messages with RHEL hosts when something happens on the storage array (object storage arrays from a different brand do this too) and it has not indicated an actual LUN assignment change. It could be caused from an event on one of the other systems atlas-mds1, atlas-mds3, atlas-mgs1 such as an IO error after the nfsroot returned. |
| Comment by Alex Zhuravlev [ 20/Dec/13 ] |
|
thanks for clarification. the confusing thing is that it's number of the systems started to observe corruptions: |
| Comment by Blake Caldwell [ 20/Dec/13 ] |
|
I agree its strange the number of errors. atlas-oss1d1 and atlas-oss1b2 use physically separate storage arrays. Of those listed only atlas-oss1d1 and atlas-oss1d4 share the same storage device. I checked the logs on the OST storage controllers and all that they saw was the hosts log out and then back in when they were rebooted. The common piece that sticks out to me is that all systems had their nfsroot filesystems disrupted. They have recovered from this transparently a hundred times before. |
| Comment by Jodi Levi (Inactive) [ 07/Jan/14 ] |
|
Are there any next steps on this ticket given the information that has been posted? ie. should this ticket be closed? |
| Comment by Blake Caldwell [ 08/Jan/14 ] |
|
What would be the best way of validating what it is saying about this message... assuming the inode still exists? Since this is the ldiskfs layer, how do we correlate to inode #395571? Debugfs? Anything live without a downtime for e2fsck? Dec 19 14:07:34 atlas-oss2f4 kernel: [1987662.210829] LDISKFS-fs error (device dm-8): ldiskfs_ext_find_extent: bad header/extent in inode #395571: invalid magic - magic 5fa6, entries 39658, max 42407(0), depth 37176(0) |
| Comment by Alex Zhuravlev [ 14/Jan/14 ] |
|
you can try on the mounted filesystem: debugfs -R "stat <395571>" |
| Comment by Blake Caldwell [ 21/Jan/14 ] |
|
Thanks. It turns out that since just had a downtime and I was able to run e2fsck across all OSTs. There were only 2 problems found and they did not correlate to the issues in the log messages. 1 problem was with inode 3, which I gather is a user quota file. The 2nd inode had a fid, but could not be found with fid2path. If the i_size for inode 3 difference is tolerable, then I believe we have arrived at the end of the road with this case. [root@atlas-oss2e4 ~]# e2fsck -f /dev/mapper/atlas-ddn2e-l22 [root@atlas-oss2i1 ~]# debugfs -R "stat <209>" /dev/mapper/atlas-ddn2i-l2 [root@atlas-oss2i1 ~]# debugfs -R "ncheck 209" /dev/mapper/atlas-ddn2i-l2 |
| Comment by James Nunez (Inactive) [ 31/Jan/14 ] |
|
Blake, Should we close this ticket or is there something else you need to be resolved? Thanks, |
| Comment by Blake Caldwell [ 31/Jan/14 ] |
|
This can be closed. Thanks James. |