[LU-6895] sanity-lfsck test 4 hung: bad entry in directory: rec_len is smaller than minimal - inode=3925999616 Created: 22/Jul/15 Updated: 13/Jan/16 Resolved: 18/Dec/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jian Yu | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Tag: 2.7.56 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
sanity-lfsck test 4 hung as follows: Starting client: shadow-23vm5: -o user_xattr,flock shadow-23vm8@tcp:/lustre /mnt/lustre CMD: shadow-23vm5 mkdir -p /mnt/lustre CMD: shadow-23vm5 mount -t lustre -o user_xattr,flock shadow-23vm8@tcp:/lustre /mnt/lustre On client shadow-23vm5: [ 9794.882548] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre [ 9794.888889] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock shadow-23vm8@tcp:/lustre /mnt/lustre [ 9794.910641] LustreError: 11-0: lustre-MDT0000-mdc-ffff880071b43400: operation mds_connect to node 10.1.5.29@tcp failed: rc = -30 [ 9799.909475] LustreError: 11-0: lustre-MDT0000-mdc-ffff880071b43400: operation mds_connect to node 10.1.5.29@tcp failed: rc = -30 On MDS shadow-23vm8: [ 9799.279381] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace [ 9799.343571] LustreError: 9974:0:(tgt_lastrcvd.c:929:tgt_client_new()) lustre-MDT0000: unable to set MULTI RPCS incompatibility flag [ 9801.154230] LustreError: 10003:0:(qsd_entry.c:291:qsd_update_index()) lustre-MDT0000: failed to update global index copy for id 0, rc:-30 [ 9801.154261] Lustre: 10003:0:(qsd_reint.c:487:qsd_reint_main()) lustre-MDT0000: reint global for [0x200000006:0x1010000:0x0] failed. -30 [ 9804.341804] LustreError: 9974:0:(tgt_lastrcvd.c:952:tgt_client_new()) lustre-MDT0000: Failed to write client lcd at idx 0, rc -30 [ 9809.342270] LustreError: 9974:0:(tgt_lastrcvd.c:952:tgt_client_new()) lustre-MDT0000: Failed to write client lcd at idx 0, rc -30 [ 9814.342330] LustreError: 9974:0:(tgt_lastrcvd.c:952:tgt_client_new()) lustre-MDT0000: Failed to write client lcd at idx 0, rc -30 Maloo report: https://testing.hpdd.intel.com/test_sets/2642815a-2f37-11e5-bc70-5254006e85c2 |
| Comments |
| Comment by Oleg Drokin [ 23/Jul/15 ] |
|
In MDS logs we can see: [ 9798.706830] LDISKFS-fs error (device dm-0): htree_dirblock_to_tree:939: inode #25142: block 7408: comm lfsck: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=3925999616, rec_len=1, name_len=0 [ 9798.706838] Aborting journal on device dm-0-8. Now the big question is - how come we have this corruption all of a sudden in this presumably pretty fresh filesystem? |
| Comment by Alex Zhuravlev [ 27/Jul/15 ] |
|
I've been running sanity-lfsck on SLES for 24+ hours, but still can't reproduce the issue. |
| Comment by Jian Yu [ 27/Jul/15 ] |
|
For master branch, the latest occurrences are: Here is a for-test-only patch trying to reproduce the failure on the latest master branch: http://review.whamcloud.com/15737 |
| Comment by Jian Yu [ 28/Jul/15 ] |
|
More failure instances on master branch against RHEL 7.1 distro: |
| Comment by Alex Zhuravlev [ 13/Aug/15 ] |
|
is there a way to access disk image left after a failed test? |
| Comment by Jian Yu [ 13/Aug/15 ] |
|
Hi Alex, I think no for autotest because the test nodes were re-provisioned for testing other patches. |
| Comment by Jian Yu [ 15/Aug/15 ] |
|
Failed to provision RHEL 7.1 server node (TEI-3793) and SLES11 SP3 server node (not supported) on eagle cluster. There is no free node on onyx and shadow cluster. |
| Comment by Gerrit Updater [ 20/Aug/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16035 |
| Comment by nasf (Inactive) [ 20/Aug/15 ] |
|
Hope the above patch can fix the issue of "LustreError: 108955:0:(osd_handler.c:5497:osd_dirent_check_repair()) ASSERTION( !(lma->lma_compat & LMAC_NOT_IN_OI) ) failed:" |
| Comment by Gerrit Updater [ 20/Aug/15 ] |
|
Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/16036 |
| Comment by Andreas Dilger [ 25/Aug/15 ] |
|
The patch here may also relate to |
| Comment by nasf (Inactive) [ 26/Aug/15 ] |
|
The patch http://review.whamcloud.com/16035 (or http://review.whamcloud.com/16044) only avoids the LBUG of "osd_dirent_check_repair()) ASSERTION( !(lma->lma_compat & LMAC_NOT_IN_OI) ) failed:". I do not think it will resolve the issue of MDS crash in |
| Comment by Jian Yu [ 28/Aug/15 ] |
|
In addition, the original failure of this ticket still needs to be resolved after landing nasf's patches. |
| Comment by nasf (Inactive) [ 28/Aug/15 ] |
|
I have checked all the failure instances, and found that they all failed at the same child entry: inode=3925999616 https://testing.hpdd.intel.com/test_logs/279ce69e-2f37-11e5-bc70-5254006e85c2/show_text [ 9798.706830] LDISKFS-fs error (device dm-0): htree_dirblock_to_tree:939: inode #25142: block 7408: comm lfsck: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=3925999616, rec_len=1, name_len=0 https://testing.hpdd.intel.com/test_logs/b22e5a94-26d7-11e5-b3d7-5254006e85c2/show_text [11790.484265] LDISKFS-fs error (device dm-0): htree_dirblock_to_tree:1270: inode #25135: block 8434: comm lfsck: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=3925999616, rec_len=1, name_len=0 https://testing.hpdd.intel.com/test_logs/c8e97dc8-256e-11e5-b033-5254006e85c2/show_text [12907.839356] LDISKFS-fs error (device dm-0): htree_dirblock_to_tree:939: inode #25136: block 7408: comm lfsck: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=3925999616, rec_len=1, name_len=0 https://testing.hpdd.intel.com/test_logs/61b52aca-00bc-11e5-9650-5254006e85c2/show_text [20101.639390] LDISKFS-fs error (device dm-0): htree_dirblock_to_tree:954: inode #25139: block 7407: comm lfsck: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=3925999616, rec_len=1, name_len=0 printf %x 3925999616 It looks quite like: #define LDISKFS_XATTR_MAGIC 0xEA020000 I cannot say that the inode# is just the LDISKFS_XATTR_MAGIC, because I cannot establish any relationship between them. But since all the failure instances all failed at the same child entry, we have to give more attention. |
| Comment by Andreas Dilger [ 28/Aug/15 ] |
|
So it seems that the directory block and an external inode block are getting mixed up: struct ext4_dir_entry_2 {
__le32 inode; /* Inode number */
__le16 rec_len; /* Directory entry length */
__u8 name_len; /* Name length */
__u8 file_type;
char name[EXT4_NAME_LEN]; /* File name */
};
struct ext4_xattr_header {
__le32 h_magic; /* magic number for identification */
__le32 h_refcount; /* reference count */
__le32 h_blocks; /* number of disk blocks used */
__le32 h_hash; /* hash value of all attributes */
__le32 h_checksum; /* crc32c(uuid+id+xattrblock) */
/* id = inum if refcount=1, blknum otherwise */
__u32 h_reserved[3]; /* zero right now */
};
The h_refcount=0x00000001 only is used for external xattr blocks and matches rec_len=0x0001, name_len=0x00. |
| Comment by Gerrit Updater [ 15/Sep/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16035/ |
| Comment by Gerrit Updater [ 16/Sep/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16439 |
| Comment by Gerrit Updater [ 16/Sep/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16440 |
| Comment by Gerrit Updater [ 17/Sep/15 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/16468 |
| Comment by Andreas Dilger [ 01/Oct/15 ] |
|
Need to add a test for this before closing this bug. Best would be to add a full e2fsck and LFSCK run in conf-sanity test_32, and fix that test to also run on ZFS. |
| Comment by Alex Zhuravlev [ 01/Oct/15 ] |
|
Andreas, I don't think we have a fix for the issue. AFAIU, Fan Yong's patch can potentially help in case of 2nd and subsequent dir entries, but it's always offset 0, i.e. the 1st entry is corrupted. |
| Comment by Peter Jones [ 05/Oct/15 ] |
|
Does this issue affect both ZFS and ldiskfs or just the latter? |
| Comment by Jian Yu [ 05/Oct/15 ] |
|
Hi Peter, It only affects ldiskfs because OI Scrub is not implemented for ZFS. |
| Comment by Peter Jones [ 05/Oct/15 ] |
|
Ah ok - thanks! |
| Comment by James Nunez (Inactive) [ 09/Oct/15 ] |
Adding LFSCK and e2fsck is being tracked under LU-7212. |
| Comment by Gerrit Updater [ 14/Oct/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16440/ |
| Comment by Gerrit Updater [ 14/Oct/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16439/ |
| Comment by Joseph Gmitter (Inactive) [ 14/Oct/15 ] |
|
Landed for 2.8 |
| Comment by Gerrit Updater [ 14/Oct/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16821 |
| Comment by Gerrit Updater [ 28/Oct/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16821/ |
| Comment by Joseph Gmitter (Inactive) [ 28/Oct/15 ] |
|
Latest patch has landed that caused the reopen. Fixed for 2.8 |
| Comment by Gerrit Updater [ 14/Nov/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17197 |
| Comment by nasf (Inactive) [ 01/Dec/15 ] |
|
We need more work on this bug. |
| Comment by Gerrit Updater [ 01/Dec/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17403 |
| Comment by nasf (Inactive) [ 10/Dec/15 ] |
|
For a new formatted Lustre system, it does not guarantee that all the on-disk inodes/blocks are initialized. Instead, they are marked as free in related inode/block bitmaps. When the inode is allocated to some object, it will be initialized at that time. Such processing accelerates the format. But it may cause trouble for MDT file-level backup/restore. For example: The sysadmin backup the MDT via server side file level tar, then reformat the MDT device, and then restore the MDT from the backup. Assume some object_A, its FID is the FID_A, and it was mapped to the inode_B before the backup, such OI mapping was recorded in the OI file. After the restore, another inode_C is assigned to the object_A, but before OI scrub rebuilding related OI mapping, the stale mapping "FID_A => inode_B" is still in the OI file, and can be found by OI lookup. Generally, even if someone found the stale OI mapping, it is not trouble, because the OSD will verify whether FID-in-LMA for the indoe_B matches the FID_A or not. But if the inode_B is NOT allocated after the resotre, and because we did not initialize inode_B during reformat, then the FID-in-LMA for the indoe_B is still kept as the same before the backup, means So the key issues are two: 1) Some FID based operation may access stale OI mapping after MDT file-level backup/restore. 2) The OSD-ldiskfs may get some non-allocated inode with the give ino#/generation. So as long as we can resovle one of them, then the trouble in the ticket can be resolved. The solutions for 1): 1.1) Avoid FID based operation before OI scrub done. That is not easy, because fid2path cannot be covered by some name based operation. 1.2) Remove the OI files after MDT file-level backup/restore. It is more direct solution. Another benefit is that even if OI scrub rebuilt the OI files, it only guarantees that all the FIDs' OI mappings have been refrshed. But it does not clean the stale FIDs' OI mappings. Because the OI scrub only does inode-table based scanning, not OI files scanning. Removing the OI files can resolve related trouble completely. The solutions for 2): 2.1) New ldiskfs patch to make ldiskfs_iget() to return "-ENOENT" for the case of loading non-allocated inode by checking the inode bitmap. 2.2) Check the inode's valid inside OSD via related inode bitmap. Generally, less ldiskfs patches is better. It will safe a lot of effort when kernel upgrade. So 2.1) is not the best solution. As for 2.2), it is not good to access the inode bitmap directly in OSD unless we have to. Relatively, the solution 1.2) is more efficient and benefit. I will make patch for that. |
| Comment by Alex Zhuravlev [ 10/Dec/15 ] |
|
if an inode isn't allocated, then the corresponding bit in the bitmap should be 0 ? |
| Comment by nasf (Inactive) [ 10/Dec/15 ] |
|
Yes, the inode table bitmap marks the inode as free after the restore, but the inode itself is not in initialised status because it has ever been used before the backup. But the ldiskfs_iget() does not check the inode table bitmap. |
| Comment by Alex Zhuravlev [ 10/Dec/15 ] |
|
well, we've got osd_iget() which can check the bitmaps (at least in case of broken OI) ? |
| Comment by Alex Zhuravlev [ 10/Dec/15 ] |
|
also, I don't understand one thing - after restore we know that OI is stale, right? then how is it possible that someone is able to use it before it's rebuilt ? |
| Comment by nasf (Inactive) [ 10/Dec/15 ] |
I have made some patch to check the inode table bitmap when ldiskfs_iget() returned the crashed directory inode. It shows that the inode table bitmaps marks the returned inode as free. |
| Comment by nasf (Inactive) [ 10/Dec/15 ] |
It is the LFSCK locating .lustre/lost+found/MDTxxxx object, originally, such locating is FID based directly, then osd_fid_lookup() will found the old OI mapping in the OI file, in future, ldiskfs_iget() returned the old inode, and the subsequent osd_check_lma() found the FID-in-LMA matches the given one, so it regarded it as an valid inode. But in fact, it is a free inode. |
| Comment by Alex Zhuravlev [ 10/Dec/15 ] |
|
I find it wrong that we let anyone use stale OI. it's just matter of time to hit a similar issue again in the future. |
| Comment by nasf (Inactive) [ 10/Dec/15 ] |
|
It is not that case, because originally, after the file-level backup/restore, the OI scrub will rebuild the OI files. But such rebuilding is based on the original OI files, means during the rebuilding, the OI file may contains some of valid OI mappings, and some of stale OI mappings. From others view, it is not easy to know which one is valid and which is invalid. So the osd_fid_lookup() will make verification via check FID-in-LMA. Unfortunately, if the target inode has neither allocated nor initialized, then such verification will be cheated. That is why I made patch to remove the OI files completely for file-lelvel backup/restore case. That can avoid such trouble form root. |
| Comment by Alex Zhuravlev [ 10/Dec/15 ] |
|
iirc, we used FS's UUID (which is unique) to recognize backup-restore case which invalidates most of OI entries.. |
| Comment by nasf (Inactive) [ 11/Dec/15 ] |
|
Yes, that is what we did. When initialise the OSD, it will check whether the FS UUID that is stored in the OI_scrub file matches current FS UUID or not. If not, then it is quite possible that the system have restored. At that time, almost all of the OI mappings in the OI files are invalid. But as the OI scrub running, more and more OI mappings are updated in the OI files, so the OI files will contain both valid and invalid OI mappings. The new patch will remove the OI files completely when detect the MDT file level backup/restore. That will guarantee that during the OI scrub, all the OI mappings in the OI files are valid. |
| Comment by Gerrit Updater [ 18/Dec/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17403/ |
| Comment by Joseph Gmitter (Inactive) [ 18/Dec/15 ] |
|
Landed for 2.8 |