[LU-8071] lvcreate --snapshot of MDT hangs in ldiskfs_journal_start_sb - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.5.3
Labels:
None
Environment:
CentOS-6.7
lustre-2.5.3
lvm2-2.02.118-3.el6_7.4
Also note that the MDT uses an external journal device.

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Similar to LU-7616 "creation of LVM snapshot on ldiskfs based MDT hangs until MDT activity/use is halted", but opening a new case for tracking.

The goal is to use LVM snapshots and tar to make file level MDT backups. Procedure worked fine 2 or 3 times, then we triggered the following problem on a recent attempt.

The MDS became extremely sluggish, and all MDT threads went into D state, when running the following command:

lvcreate -l95%FREE -s -p r -n mdt_snap /dev/nbp9-vg/mdt9

(the command never returned, and any further lv* commands hung as well)

In the logs...

Apr 25 17:09:35 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/super.c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Not tainted)
Apr 25 17:14:45 nbp9-mds ]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
Apr 25 17:14:45 nbp9-mds kernel: ---[ end trace c9f3339c0e103edf ]---

Apr 25 17:14:57 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/super.c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Tainted: G        W  ---------------   )
Apr 25 17:14:57 nbp9-mds kernel: Hardware name: AltixXE270
Apr 25 17:14:57 nbp9-mds kernel: Modules linked in: dm_snapshot dm_bufio osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic crc32c_intel libcfs(U) sunrpc bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_round_robin scsi_dh_rdac dm_multipath microcode iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich mfd_core shpchp sg igb dca ptp pps_core tcp_bic ext3 jbd sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic pata_jmicron mptfc scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) memtrack(U) usb_storage radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod gru [last unloaded: scsi_wait_scan]
Apr 25 17:14:57 nbp9-mds kernel: Pid: 85906, comm: mdt_rdpg00_042 Tainted: G        W  ---------------    2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1
Apr 25 17:14:57 nbp9-mds kernel: Call Trace:
Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff81074127>] ? warn_slowpath_common+0x87/0xc0
Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff8107417a>] ? warn_slowpath_null+0x1a/0x20
Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0a1c33e>] ? ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]
Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0d6069f>] ? osd_trans_start+0x1df/0x660 [osd_ldiskfs]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0ef3619>] ? lod_trans_start+0x1b9/0x250 [lod]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f7af07>] ? mdd_trans_start+0x17/0x20 [mdd]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f61ece>] ? mdd_close+0x6be/0xb80 [mdd]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e48be9>] ? mdt_mfd_close+0x4a9/0x1bc0 [mdt]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0899525>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa08c07f6>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa089a53e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa06ebf05>] ? class_handle2object+0x95/0x190 [obdclass]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e4b6a2>] ? mdt_close+0x642/0xa80 [mdt]
Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e1fada>] ? mdt_handle_common+0x52a/0x1470 [mdt]
Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: sdc - rdac checker reports path is down
Apr 25 17:15:10 nbp9-mds multipathd: checker failed path 8:32 in map nbp9_MGS_MDS
Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: remaining active paths: 1
Apr 25 17:15:10 nbp9-mds multipathd: sdd: remove path (uevent)
Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: failed in domap for removal of path sdd
Apr 25 17:15:10 nbp9-mds multipathd: uevent trigger error
Apr 25 17:15:10 nbp9-mds multipathd: sdc: remove path (uevent)
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20

Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: map in use
Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: can't flush

The server had to be rebooted and e2fsck run to get it back into production.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt.all
917 kB
26/Apr/16 7:32 PM
dmesg.out
494 kB
26/Apr/16 7:32 PM
lostfound_nonzero.lst
80 kB
04/May/16 1:40 PM
nagtest.toobig.stripes
36 kB
03/May/16 10:12 PM

Issue Links

is related to

LU-7616 creation of LVM snapshot on ldiskfs based MDT hangs until MDT activity/use is halted

Open

Activity

[LU-8071] lvcreate --snapshot of MDT hangs in ldiskfs_journal_start_sb

Nathan Dauchy (Inactive) added a comment - 06/May/16 3:51 PM - edited

At some point in the future we plan to migrate to RHEL/CentOS-7, but right now CentOS-6.7 is the priority. It looks like there is additional code that uses vfs_check_frozen in newer ext4/super.c source, but that it is not in our 2.6.32-504.30 kernel, nor is it mentioned in the changelog for a very recent CentOS-6.7 2.6.32-573.22 kernel. Are you saying that we just need to incorporate all the existing fixes, or that a new lustre patch is needed to perform a check which is unique to ldiskfs?

If it would be more expedient to use some sort of external pause/resume commands before and after the lvm snapshot is created, then we could work that into the process as well.

Also, just curious if this should work for a running MDT or ldiskfs:

service322 ~ # fsfreeze -f /mnt/lustre/nbptest-mdt
fsfreeze: /mnt/lustre/nbptest-mdt: freeze failed: Operation not supported

Nathan Dauchy (Inactive) added a comment - 06/May/16 3:51 PM - edited At some point in the future we plan to migrate to RHEL/CentOS-7, but right now CentOS-6.7 is the priority. It looks like there is additional code that uses vfs_check_frozen in newer ext4/super.c source, but that it is not in our 2.6.32-504.30 kernel, nor is it mentioned in the changelog for a very recent CentOS-6.7 2.6.32-573.22 kernel. Are you saying that we just need to incorporate all the existing fixes, or that a new lustre patch is needed to perform a check which is unique to ldiskfs? If it would be more expedient to use some sort of external pause/resume commands before and after the lvm snapshot is created, then we could work that into the process as well. Also, just curious if this should work for a running MDT or ldiskfs: service322 ~ # fsfreeze -f /mnt/lustre/nbptest-mdt fsfreeze: /mnt/lustre/nbptest-mdt: freeze failed: Operation not supported

Andreas Dilger added a comment - 05/May/16 10:37 PM

It appears that we may be missing calls in osd-ldiskfs to check if the block device is currently (being) frozen before submitting IO to the underlying storage (i.e. all modifications blocked at a barrier while in the middle of creating a snapshot.

In newer kernels (3.5+, e.g. RHEL7) this is handled with sb_start_write() and sb_end_write() at the upper layers, and sb_start_intwrite() and sb_end_intwrite() for filesystem internal writes, and should be taken before i_mutex (see e.g. kernel commit 8e8ad8a57c75f3b for examples).

In older kernels (e.g. RHEL6) it uses vfs_check_frozen() to see if the device is currently frozen, with level=SB_FREEZE_WRITE to block new modifications from starting, and level=SB_FREEZE_TRANS to only block internal threads if the freeze barrier is complete.

Andreas Dilger added a comment - 05/May/16 10:37 PM It appears that we may be missing calls in osd-ldiskfs to check if the block device is currently (being) frozen before submitting IO to the underlying storage (i.e. all modifications blocked at a barrier while in the middle of creating a snapshot. In newer kernels (3.5+, e.g. RHEL7) this is handled with sb_start_write() and sb_end_write() at the upper layers, and sb_start_intwrite() and sb_end_intwrite() for filesystem internal writes, and should be taken before i_mutex (see e.g. kernel commit 8e8ad8a57c75f3b for examples). In older kernels (e.g. RHEL6) it uses vfs_check_frozen() to see if the device is currently frozen, with level=SB_FREEZE_WRITE to block new modifications from starting, and level=SB_FREEZE_TRANS to only block internal threads if the freeze barrier is complete.

nasf (Inactive) added a comment - 05/May/16 12:30 AM - edited

OK, I think we have a decent handle on the lost+found contents now, and the e2fsck and lfsck behaviour. It sounds like perhaps there is room for an improvement in lfsck to handle the conflicts more elegantly?

If two orphan OST-objects conflict with each other when rebuilding the lost LOV EA, from the LFSCK view, it is NOT easy to know which one is right. Because the LFSCK cannot analysis the OST-object data. We found the trouble just because of the abnormal file size based on human judgement. But from the LFSCK view, since the file/object size is in the valid range, it cannot say current OST-object A-D are invalid. So on the this degree, the LFSCK needs some human assistant to make the right choice.

As for the reason of lost stripe information, I do not know, I only have some guess or suspicion. Rebooting the system during "lvcreate --snapshot" is mostly suspected. For large LOV EA (with ten of or hundred stripes), the LOV EA will be stored in separated block. If someone breaks the process during the interval of inode creation and LOV EA storing, then the inode may lost its LOV EA. On the other hand, the implementation of LMV snapshot also may affect that. For example, there maybe some internal data re-location during the "lvcreate --snapshot", if someone breaks the process, the system may be in some internal inconsistent status.

According to the stack traces in this ticket, there were some users accessing the system during "lvcreate --snapshot", right? If yes, I think that it would cause some Copy-On-Write, that may be blocked by the in-processing "lvcreate --snapshot" or handled with it currently, that depends on the LVM implementation. Unfortunately, it is out of Lustre scope, and I am not familiar with that.

Honestly, I am not the fun of LVM snapshot because of many criticisms about the performance.

Andreas, do you have any idea about the LVM/e2fsck trouble?

nasf (Inactive) added a comment - 05/May/16 12:30 AM - edited OK, I think we have a decent handle on the lost+found contents now, and the e2fsck and lfsck behaviour. It sounds like perhaps there is room for an improvement in lfsck to handle the conflicts more elegantly? If two orphan OST-objects conflict with each other when rebuilding the lost LOV EA, from the LFSCK view, it is NOT easy to know which one is right. Because the LFSCK cannot analysis the OST-object data. We found the trouble just because of the abnormal file size based on human judgement. But from the LFSCK view, since the file/object size is in the valid range, it cannot say current OST-object A-D are invalid. So on the this degree, the LFSCK needs some human assistant to make the right choice. As for the reason of lost stripe information, I do not know, I only have some guess or suspicion. Rebooting the system during "lvcreate --snapshot" is mostly suspected. For large LOV EA (with ten of or hundred stripes), the LOV EA will be stored in separated block. If someone breaks the process during the interval of inode creation and LOV EA storing, then the inode may lost its LOV EA. On the other hand, the implementation of LMV snapshot also may affect that. For example, there maybe some internal data re-location during the "lvcreate --snapshot", if someone breaks the process, the system may be in some internal inconsistent status. According to the stack traces in this ticket, there were some users accessing the system during "lvcreate --snapshot", right? If yes, I think that it would cause some Copy-On-Write, that may be blocked by the in-processing "lvcreate --snapshot" or handled with it currently, that depends on the LVM implementation. Unfortunately, it is out of Lustre scope, and I am not familiar with that. Honestly, I am not the fun of LVM snapshot because of many criticisms about the performance. Andreas, do you have any idea about the LVM/e2fsck trouble?

Nathan Dauchy (Inactive) added a comment - 04/May/16 7:21 PM

OK, I think we have a decent handle on the lost+found contents now, and the e2fsck and lfsck behaviour. It sounds like perhaps there is room for an improvement in lfsck to handle the conflicts more elegantly?

What is still a significant unknown is how this corruption happened to begin with and what we can do to avoid it in the future. For now, we are no longer performing LVM snapshots of the MDT. Do you have a more complete explanation for the source of the lost striping information on the MDT? It is probably not just the LVM snapshot or e2fsck, as each have been used widely before. The server reboot while "lvcreate --snapshot" was running could have something to do with it I suppose. Do the stack traces and ldiskfs_journal_start_sb messages provide any clues?

Nathan Dauchy (Inactive) added a comment - 04/May/16 7:21 PM OK, I think we have a decent handle on the lost+found contents now, and the e2fsck and lfsck behaviour. It sounds like perhaps there is room for an improvement in lfsck to handle the conflicts more elegantly? What is still a significant unknown is how this corruption happened to begin with and what we can do to avoid it in the future. For now, we are no longer performing LVM snapshots of the MDT. Do you have a more complete explanation for the source of the lost striping information on the MDT? It is probably not just the LVM snapshot or e2fsck, as each have been used widely before. The server reboot while "lvcreate --snapshot" was running could have something to do with it I suppose. Do the stack traces and ldiskfs_journal_start_sb messages provide any clues?

nasf (Inactive) added a comment - 04/May/16 3:01 PM - edited

There are 3 kinds of files under your lustre lost+found directory. They are:

1) The file name contains infix "-C-". That means multiple OST-objects claim the same MDT-object and the same slot in the LOV EA. Then the layout LFSCK will create new MDT-object(s) to hold the conflict OST-object(s). Searching your "lostfound_nonzero.lst" file, there are 4 files with "-C-" name. As described in former comment, they are corresponding to the file "nagtest.toobig.stripes" and conflict with the OST-objects A-D. If you think that the layout LFSCK made wrong decision when re-generated the "nagtest.toobig.stripes" LOV EA, we need to make new patch to recover it. But since there is only one file with conflict OST-objects, you also can recover the file "nagtest.toobig.stripes" manually.

2) The file name contains infix "-R-". That means the orphan OST-object knows its parent MDT-object FID, but does not know the position (the file name) in the namespace. So have to create a file with the known PFID. Under such case, if you can remember something from file data, you can rename the file back to the normal namespace.

3) The file name contains infix "-O-". They are orphan MDT-objects. The backend e2fsck put them into the local /lost+found directory, that is invisible to the client. The LFSCK scans the MDT-objects under local /lost+found directory, if the MDT-object has valid linkEA, then the LFSCK will move it back to normal namespace, otherwise, it will be put under the global Lustre lost+found directory that is visible to the client. Similar as above case 2), if you can remember something from file data, you can rename the file back to the normal namespace. On the other hand, I noticed that the files with "-O-" name all have the size 5792, that is mysterious. Unless the e2fsck found some corrupted inodes, otherwise, it is almost impossible. So you have to check these file with more human knowledge and if you think they are invalid, then remove them directly.

nasf (Inactive) added a comment - 04/May/16 3:01 PM - edited There are 3 kinds of files under your lustre lost+found directory. They are: 1) The file name contains infix "-C-". That means multiple OST-objects claim the same MDT-object and the same slot in the LOV EA. Then the layout LFSCK will create new MDT-object(s) to hold the conflict OST-object(s). Searching your "lostfound_nonzero.lst" file, there are 4 files with "-C-" name. As described in former comment, they are corresponding to the file "nagtest.toobig.stripes" and conflict with the OST-objects A-D. If you think that the layout LFSCK made wrong decision when re-generated the "nagtest.toobig.stripes" LOV EA, we need to make new patch to recover it. But since there is only one file with conflict OST-objects, you also can recover the file "nagtest.toobig.stripes" manually. 2) The file name contains infix "-R-". That means the orphan OST-object knows its parent MDT-object FID, but does not know the position (the file name) in the namespace. So have to create a file with the known PFID. Under such case, if you can remember something from file data, you can rename the file back to the normal namespace. 3) The file name contains infix "-O-". They are orphan MDT-objects. The backend e2fsck put them into the local /lost+found directory, that is invisible to the client. The LFSCK scans the MDT-objects under local /lost+found directory, if the MDT-object has valid linkEA, then the LFSCK will move it back to normal namespace, otherwise, it will be put under the global Lustre lost+found directory that is visible to the client. Similar as above case 2), if you can remember something from file data, you can rename the file back to the normal namespace. On the other hand, I noticed that the files with "-O-" name all have the size 5792, that is mysterious. Unless the e2fsck found some corrupted inodes, otherwise, it is almost impossible. So you have to check these file with more human knowledge and if you think they are invalid, then remove them directly.

Nathan Dauchy (Inactive) added a comment - 04/May/16 1:40 PM

Full listing of nonzero files in lost+found, gathered as:

cd /nobackupp9/.lustre/lost+found/MDT0000
find . -type f -a ! -size 0 | xargs ls -ld > /u/ndauchy/tmp/nbp9_diag/lostfound_nonzero.lst

Nathan Dauchy (Inactive) added a comment - 04/May/16 1:40 PM Full listing of nonzero files in lost+found, gathered as: cd /nobackupp9/.lustre/lost+found/MDT0000 find . -type f -a ! -size 0 | xargs ls -ld > /u/ndauchy/tmp/nbp9_diag/lostfound_nonzero.lst

Nathan Dauchy (Inactive) added a comment - 04/May/16 1:34 PM - edited

The closest match I could find in lost+found was:

[0x20011c02e:0x1e13:0x0]-[0x20001d596:0x97:0x0]-3-C-0
[0x20011c02e:0x69c7:0x0]-[0x20001d596:0x97:0x0]-0-C-0
[0x20011c030:0x1219c:0x0]-[0x20001d596:0x97:0x0]-2-C-0
[0x20011c030:0x13ffc:0x0]-[0x20001d596:0x97:0x0]-1-C-0

And the file ownership is the same as /nobackupp9/dtalcott/nagtest.toobig

Nathan Dauchy (Inactive) added a comment - 04/May/16 1:34 PM - edited The closest match I could find in lost+found was: [0x20011c02e:0x1e13:0x0]-[0x20001d596:0x97:0x0]-3-C-0 [0x20011c02e:0x69c7:0x0]-[0x20001d596:0x97:0x0]-0-C-0 [0x20011c030:0x1219c:0x0]-[0x20001d596:0x97:0x0]-2-C-0 [0x20011c030:0x13ffc:0x0]-[0x20001d596:0x97:0x0]-1-C-0 And the file ownership is the same as /nobackupp9/dtalcott/nagtest.toobig

Nathan Dauchy (Inactive) added a comment - 04/May/16 1:18 PM

To follow up on question from yesterday, which you probably already guessed...

There are 240 OSTs on the corrupted file system (on 16 OSS).

Nathan Dauchy (Inactive) added a comment - 04/May/16 1:18 PM To follow up on question from yesterday, which you probably already guessed... There are 240 OSTs on the corrupted file system (on 16 OSS).

nasf (Inactive) added a comment - 04/May/16 8:33 AM

By default, the repairing behaviour will be recorded in Lustre debug log via label "D_LFSCK". But because Lustre kernel log is in RAM only, and if you did not dump them periodically, then it will be overwritten. I am not sure whether you have such log or not.

On the other hand, the log "nagtest.toobig.stripes" shows that there are 244 OST-objects claiming as the stripe of the [0x20001d596:0x97:0x0], but the "lfs getstripe" output shows that the file "nagtest.toobig" only has 240 stripes. So there are (at least) 4 OST-objects are fake. I found that there are 4 OST-objects with the size 251658240 bytes are conflict with another 4 OST-objects with the size 4194304 bytes, as shown following:

A) 65.out.service162:F,2833,10491,1179,362692,251658240,lma,[0x100000000:0x93bfc2:0x0],0,fid,[0x20001d596:0x97:0x0],0,1461812807,1461812805,1461812807
B) 104.out.service169:F,240,10491,1179,30814,251658240,lma,[0x100000000:0x9e78e3:0x0],0,fid,[0x20001d596:0x97:0x1],1,1461812807,1461812805,1461812807
C) 121.out.service170:F,2375,10491,1179,304121,251658240,lma,[0x100000000:0x969e66:0x0],0,fid,[0x20001d596:0x97:0x2],2,1461812807,1461812805,1461812807
D) 87.out.service168:F,4361,10491,1179,558333,251658240,lma,[0x100000000:0x97b343:0x0],0,fid,[0x20001d596:0x97:0x3],3,1461812807,1461812805,1461812807

a) 56.out.service169:F,32528,10491,1179,4163662,4194304,lma,[0x100000000:0x512a2:0x0],0,fid,[0x20001d596:0x97:0x0],0,1461629209,1461629204,1461629209
b) 207.out.service176:F,37845,10491,1179,4844182,4194304,lma,[0x100000000:0x5139a:0x0],0,fid,[0x20001d596:0x97:0x1],1,1461629209,1461629204,1461629209
c) 137.out.service170:F,40964,10491,1179,5243393,4194304,lma,[0x100000000:0x512d2:0x0],0,fid,[0x20001d596:0x97:0x2],2,1461629209,1461629204,1461629209
d) 236.out.service173:F,35030,10491,1179,4483887,4194304,lma,[0x100000000:0x51320:0x0],0,fid,[0x20001d596:0x97:0x3],3,1461629209,1461629204,1461629209

The OST-objects A-D with the size 251658240 bytes claim as the first 4 stripes of the file "nagtest.toobig". The reason may be as following: the file "nagtest.toobig" lost its LOV EA because some unknown reason. Either the OST-objects A-D or a-d have corrupted PFID EA. According to the size attribute, the OST-objects A-D seems have corrupted PFID EA. During the LFSCK processing, the layout LFSCK tried to re-geneate the "nagtest.toobig" LOV EA with these 244 orphan OST-objects. Unfortunately, the OST-objects A-D were found earlier than the OST-objects a-d, then the regenerated LOV EA contains the OST-objects A-D information. And then when the orphan OST-objects a-d were handled, the layout LFSCK found that they were conflict with some others, then the layout LFSCK should have created some new files with the name "[0x20001d596:0x97:0x0]-[0x100000000:0x512a2:0x0]-C-0" or similar under the directory ".lustre/lost+found/MDT0000/". Please check for verification.

nasf (Inactive) added a comment - 04/May/16 8:33 AM By default, the repairing behaviour will be recorded in Lustre debug log via label "D_LFSCK". But because Lustre kernel log is in RAM only, and if you did not dump them periodically, then it will be overwritten. I am not sure whether you have such log or not. On the other hand, the log "nagtest.toobig.stripes" shows that there are 244 OST-objects claiming as the stripe of the [0x20001d596:0x97:0x0] , but the "lfs getstripe" output shows that the file "nagtest.toobig" only has 240 stripes. So there are (at least) 4 OST-objects are fake. I found that there are 4 OST-objects with the size 251658240 bytes are conflict with another 4 OST-objects with the size 4194304 bytes, as shown following: A) 65.out.service162:F,2833,10491,1179,362692,251658240,lma,[0x100000000:0x93bfc2:0x0],0,fid,[0x20001d596:0x97:0x0],0,1461812807,1461812805,1461812807 B) 104.out.service169:F,240,10491,1179,30814,251658240,lma,[0x100000000:0x9e78e3:0x0],0,fid,[0x20001d596:0x97:0x1],1,1461812807,1461812805,1461812807 C) 121.out.service170:F,2375,10491,1179,304121,251658240,lma,[0x100000000:0x969e66:0x0],0,fid,[0x20001d596:0x97:0x2],2,1461812807,1461812805,1461812807 D) 87.out.service168:F,4361,10491,1179,558333,251658240,lma,[0x100000000:0x97b343:0x0],0,fid,[0x20001d596:0x97:0x3],3,1461812807,1461812805,1461812807 a) 56.out.service169:F,32528,10491,1179,4163662,4194304,lma,[0x100000000:0x512a2:0x0],0,fid,[0x20001d596:0x97:0x0],0,1461629209,1461629204,1461629209 b) 207.out.service176:F,37845,10491,1179,4844182,4194304,lma,[0x100000000:0x5139a:0x0],0,fid,[0x20001d596:0x97:0x1],1,1461629209,1461629204,1461629209 c) 137.out.service170:F,40964,10491,1179,5243393,4194304,lma,[0x100000000:0x512d2:0x0],0,fid,[0x20001d596:0x97:0x2],2,1461629209,1461629204,1461629209 d) 236.out.service173:F,35030,10491,1179,4483887,4194304,lma,[0x100000000:0x51320:0x0],0,fid,[0x20001d596:0x97:0x3],3,1461629209,1461629204,1461629209 The OST-objects A-D with the size 251658240 bytes claim as the first 4 stripes of the file "nagtest.toobig". The reason may be as following: the file "nagtest.toobig" lost its LOV EA because some unknown reason. Either the OST-objects A-D or a-d have corrupted PFID EA. According to the size attribute, the OST-objects A-D seems have corrupted PFID EA. During the LFSCK processing, the layout LFSCK tried to re-geneate the "nagtest.toobig" LOV EA with these 244 orphan OST-objects. Unfortunately, the OST-objects A-D were found earlier than the OST-objects a-d, then the regenerated LOV EA contains the OST-objects A-D information. And then when the orphan OST-objects a-d were handled, the layout LFSCK found that they were conflict with some others, then the layout LFSCK should have created some new files with the name " [0x20001d596:0x97:0x0] - [0x100000000:0x512a2:0x0] -C-0" or similar under the directory ".lustre/lost+found/MDT0000/". Please check for verification.

Mahmoud Hanafi added a comment - 03/May/16 10:11 PM

Is there a way log what files that lfsck is making changes to.

As for the /nobackupp9/dtalcott/nagtest.toobig. I am attaching object scan of all osts for this file. You can see that there was 2 stripes 0. All the stripe for this file are 4194304 in size but the stripe on OST-65 is 251658240.

#filename:type,blockgroup,uid,gid,inode,size,type,fid,stripe,ctime,atime,mtime
56.out.service169:F,32528,10491,1179,4163662,4194304,lma,[0x100000000:0x512a2:0x0],0,fid,[0x20001d596:0x97:0x0],0,1461629209,1461629204,1461629209
65.out.service162:F,2833,10491,1179,362692,251658240,lma,[0x100000000:0x93bfc2:0x0],0,fid,[0x20001d596:0x97:0x0],0,1461812807,1461812805,1461812807

and ost 65 has stripe 62
#filename:type,blockgroup,uid,gid,inode,size,type,fid,stripe,ctime,atime,mtime
65.out.service162:F,39432,10491,1179,5047326,4194304,lma,[0x100000000:0x51208:0x0],0,fid,[0x20001d596:0x97:0xd8],216,1461629208,1461629204,1461629208

Mahmoud Hanafi added a comment - 03/May/16 10:11 PM Is there a way log what files that lfsck is making changes to. As for the /nobackupp9/dtalcott/nagtest.toobig. I am attaching object scan of all osts for this file. You can see that there was 2 stripes 0. All the stripe for this file are 4194304 in size but the stripe on OST-65 is 251658240. #filename:type,blockgroup,uid,gid,inode,size,type,fid,stripe,ctime,atime,mtime 56.out.service169:F,32528,10491,1179,4163662,4194304,lma, [0x100000000:0x512a2:0x0] ,0,fid, [0x20001d596:0x97:0x0] ,0,1461629209,1461629204,1461629209 65.out.service162:F,2833,10491,1179,362692,251658240,lma, [0x100000000:0x93bfc2:0x0] ,0,fid, [0x20001d596:0x97:0x0] ,0,1461812807,1461812805,1461812807 and ost 65 has stripe 62 #filename:type,blockgroup,uid,gid,inode,size,type,fid,stripe,ctime,atime,mtime 65.out.service162:F,39432,10491,1179,5047326,4194304,lma, [0x100000000:0x51208:0x0] ,0,fid, [0x20001d596:0x97:0xd8] ,216,1461629208,1461629204,1461629208

nasf (Inactive) added a comment - 03/May/16 6:50 AM - edited

How many OSTs in your system? Have you ever created the file nagtest.toobig with 240 stripes? If no, then it may because that some OST-object(s) contained bad PFID EA, and claimed that its parent is the file nagtest.toobig, but LFSCK could NOT know whether it is right or not. Since nobody conflict with it/them, then the nagtest.toobig's LOV EA is enlarged to contain those OST-object(s) as to the size increased.

According to the "lfs getstripe" output, we know every OST-object's ID, then we can use such ID to check on related OST. For example:

obdidx objid objid group
65 9682882 0x93bfc2 0

Here "65" corresponding to the OST0041 (HEX), the "9682882" corresponding to the sub-dir "9682882 % 32 = 2" under /O, you can check such OST-object via: debugfs -c -R "stat /O/0/d2/9682882" $OST0041_dev

On the other hand, the OST-objects with ID "332xx" looks abnormal. Usually, the OST-object's ID on different OSTs should be quite scattered. But your case looks some adjacent numbers. That makes me to suspect that the LOV EA is over-written by some pattern data.

nasf (Inactive) added a comment - 03/May/16 6:50 AM - edited How many OSTs in your system? Have you ever created the file nagtest.toobig with 240 stripes? If no, then it may because that some OST-object(s) contained bad PFID EA, and claimed that its parent is the file nagtest.toobig, but LFSCK could NOT know whether it is right or not. Since nobody conflict with it/them, then the nagtest.toobig's LOV EA is enlarged to contain those OST-object(s) as to the size increased. According to the "lfs getstripe" output, we know every OST-object's ID, then we can use such ID to check on related OST. For example: obdidx objid objid group 65 9682882 0x93bfc2 0 Here "65" corresponding to the OST0041 (HEX), the "9682882" corresponding to the sub-dir "9682882 % 32 = 2" under /O, you can check such OST-object via: debugfs -c -R "stat /O/0/d2/9682882" $OST0041_dev On the other hand, the OST-objects with ID "332xx" looks abnormal. Usually, the OST-object's ID on different OSTs should be quite scattered. But your case looks some adjacent numbers. That makes me to suspect that the LOV EA is over-written by some pattern data.

People

Assignee:: nasf (Inactive)

Reporter:: Nathan Dauchy (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 26/Apr/16 7:27 PM

Updated:: 14/Jun/18 9:41 PM

Resolved:: 02/Jun/16 6:07 PM