[LU-6895] sanity-lfsck test 4 hung: bad entry in directory: rec_len is smaller than minimal - inode=3925999616 Created: 22/Jul/15  Updated: 13/Jan/16  Resolved: 18/Dec/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Critical
Reporter: Jian Yu Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre Tag: 2.7.56
Distro/Arch: SLES11SP3/x86_64 (both client and server)


Issue Links:
Blocker
is blocking LU-6843 Support RHEL7 on b2_5_fe Resolved
Duplicate
is duplicated by LU-7572 sanity-scrub 4c : server kernel BUG a... Resolved
Related
is related to LU-7169 conf-sanity 84 restart mds1 failed Resolved
is related to LU-7315 In osd_dirent_check_repair(), Pointer... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-lfsck test 4 hung as follows:

Starting client: shadow-23vm5:  -o user_xattr,flock shadow-23vm8@tcp:/lustre /mnt/lustre
CMD: shadow-23vm5 mkdir -p /mnt/lustre
CMD: shadow-23vm5 mount -t lustre -o user_xattr,flock shadow-23vm8@tcp:/lustre /mnt/lustre

On client shadow-23vm5:

[ 9794.882548] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre
[ 9794.888889] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock shadow-23vm8@tcp:/lustre /mnt/lustre
[ 9794.910641] LustreError: 11-0: lustre-MDT0000-mdc-ffff880071b43400: operation mds_connect to node 10.1.5.29@tcp failed: rc = -30
[ 9799.909475] LustreError: 11-0: lustre-MDT0000-mdc-ffff880071b43400: operation mds_connect to node 10.1.5.29@tcp failed: rc = -30

On MDS shadow-23vm8:

[ 9799.279381] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace
[ 9799.343571] LustreError: 9974:0:(tgt_lastrcvd.c:929:tgt_client_new()) lustre-MDT0000: unable to set MULTI RPCS incompatibility flag
[ 9801.154230] LustreError: 10003:0:(qsd_entry.c:291:qsd_update_index()) lustre-MDT0000: failed to update global index copy for id 0, rc:-30
[ 9801.154261] Lustre: 10003:0:(qsd_reint.c:487:qsd_reint_main()) lustre-MDT0000: reint global for [0x200000006:0x1010000:0x0] failed. -30
[ 9804.341804] LustreError: 9974:0:(tgt_lastrcvd.c:952:tgt_client_new()) lustre-MDT0000: Failed to write client lcd at idx 0, rc -30
[ 9809.342270] LustreError: 9974:0:(tgt_lastrcvd.c:952:tgt_client_new()) lustre-MDT0000: Failed to write client lcd at idx 0, rc -30
[ 9814.342330] LustreError: 9974:0:(tgt_lastrcvd.c:952:tgt_client_new()) lustre-MDT0000: Failed to write client lcd at idx 0, rc -30

Maloo report: https://testing.hpdd.intel.com/test_sets/2642815a-2f37-11e5-bc70-5254006e85c2



 Comments   
Comment by Oleg Drokin [ 23/Jul/15 ]

In MDS logs we can see:

[ 9798.706830] LDISKFS-fs error (device dm-0): htree_dirblock_to_tree:939: inode #25142: block 7408: comm lfsck: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=3925999616, rec_len=1, name_len=0
[ 9798.706838] Aborting journal on device dm-0-8.

Now the big question is - how come we have this corruption all of a sudden in this presumably pretty fresh filesystem?

Comment by Alex Zhuravlev [ 27/Jul/15 ]

I've been running sanity-lfsck on SLES for 24+ hours, but still can't reproduce the issue.

Comment by Jian Yu [ 27/Jul/15 ]

For master branch, the latest occurrences are:
2015-07-19 https://testing.hpdd.intel.com/test_sets/2642815a-2f37-11e5-bc70-5254006e85c2 (2.7.56)
2015-07-10 https://testing.hpdd.intel.com/test_sets/b064bb40-26d7-11e5-b3d7-5254006e85c2 (2.7.56)
2015-07-07 https://testing.hpdd.intel.com/test_sets/c86b0010-256e-11e5-b033-5254006e85c2 (2.7.56)
2015-05-21 https://testing.hpdd.intel.com/test_sets/60586868-00bc-11e5-9650-5254006e85c2 (2.7.54)

Here is a for-test-only patch trying to reproduce the failure on the latest master branch: http://review.whamcloud.com/15737

Comment by Jian Yu [ 28/Jul/15 ]

More failure instances on master branch against RHEL 7.1 distro:
https://testing.hpdd.intel.com/test_sessions/5b78dad6-34d7-11e5-a27f-5254006e85c2

Comment by Alex Zhuravlev [ 13/Aug/15 ]

is there a way to access disk image left after a failed test?

Comment by Jian Yu [ 13/Aug/15 ]

Hi Alex, I think no for autotest because the test nodes were re-provisioned for testing other patches.
I'll try to manually reproduce the failure on eagle cluster.

Comment by Jian Yu [ 15/Aug/15 ]

Failed to provision RHEL 7.1 server node (TEI-3793) and SLES11 SP3 server node (not supported) on eagle cluster. There is no free node on onyx and shadow cluster.

Comment by Gerrit Updater [ 20/Aug/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16035
Subject: LU-6895 osd-ldiskfs: handle REMOTE_PARENT_DIR properly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b50a0450dd24dd3e59537296d82885bebd8a665f

Comment by nasf (Inactive) [ 20/Aug/15 ]

Hope the above patch can fix the issue of "LustreError: 108955:0:(osd_handler.c:5497:osd_dirent_check_repair()) ASSERTION( !(lma->lma_compat & LMAC_NOT_IN_OI) ) failed:"

Comment by Gerrit Updater [ 20/Aug/15 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/16036
Subject: LU-6895 debug: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 60c7612c6b63b414167b7353f55f72184607c340

Comment by Andreas Dilger [ 25/Aug/15 ]

The patch here may also relate to LU-6974, but not sure if this patch will resolve the issue.

Comment by nasf (Inactive) [ 26/Aug/15 ]

The patch http://review.whamcloud.com/16035 (or http://review.whamcloud.com/16044) only avoids the LBUG of "osd_dirent_check_repair()) ASSERTION( !(lma->lma_compat & LMAC_NOT_IN_OI) ) failed:". I do not think it will resolve the issue of MDS crash in LU-6974.

Comment by Jian Yu [ 28/Aug/15 ]

In addition, the original failure of this ticket still needs to be resolved after landing nasf's patches.

Comment by nasf (Inactive) [ 28/Aug/15 ]

I have checked all the failure instances, and found that they all failed at the same child entry: inode=3925999616

https://testing.hpdd.intel.com/test_logs/279ce69e-2f37-11e5-bc70-5254006e85c2/show_text
[ 9798.706830] LDISKFS-fs error (device dm-0): htree_dirblock_to_tree:939: inode #25142: block 7408: comm lfsck: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=3925999616, rec_len=1, name_len=0
https://testing.hpdd.intel.com/test_logs/b22e5a94-26d7-11e5-b3d7-5254006e85c2/show_text
[11790.484265] LDISKFS-fs error (device dm-0): htree_dirblock_to_tree:1270: inode #25135: block 8434: comm lfsck: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=3925999616, rec_len=1, name_len=0
https://testing.hpdd.intel.com/test_logs/c8e97dc8-256e-11e5-b033-5254006e85c2/show_text
[12907.839356] LDISKFS-fs error (device dm-0): htree_dirblock_to_tree:939: inode #25136: block 7408: comm lfsck: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=3925999616, rec_len=1, name_len=0
https://testing.hpdd.intel.com/test_logs/61b52aca-00bc-11e5-9650-5254006e85c2/show_text
[20101.639390] LDISKFS-fs error (device dm-0): htree_dirblock_to_tree:954: inode #25139: block 7407: comm lfsck: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=3925999616, rec_len=1, name_len=0

printf %x 3925999616
ea020000

It looks quite like:

#define LDISKFS_XATTR_MAGIC             0xEA020000

I cannot say that the inode# is just the LDISKFS_XATTR_MAGIC, because I cannot establish any relationship between them. But since all the failure instances all failed at the same child entry, we have to give more attention.

Comment by Andreas Dilger [ 28/Aug/15 ]

So it seems that the directory block and an external inode block are getting mixed up:

struct ext4_dir_entry_2 {
        __le32  inode;                  /* Inode number */
        __le16  rec_len;                /* Directory entry length */
        __u8    name_len;               /* Name length */
        __u8    file_type;
        char    name[EXT4_NAME_LEN];    /* File name */
};
struct ext4_xattr_header {
        __le32  h_magic;        /* magic number for identification */
        __le32  h_refcount;     /* reference count */
        __le32  h_blocks;       /* number of disk blocks used */
        __le32  h_hash;         /* hash value of all attributes */
        __le32  h_checksum;     /* crc32c(uuid+id+xattrblock) */
                                /* id = inum if refcount=1, blknum otherwise */
        __u32   h_reserved[3];  /* zero right now */
};

The h_refcount=0x00000001 only is used for external xattr blocks and matches rec_len=0x0001, name_len=0x00.

Comment by Gerrit Updater [ 15/Sep/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16035/
Subject: LU-6895 osd-ldiskfs: handle REMOTE_PARENT_DIR properly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6bf1d78ed87dc25cef0e8a48f1ef251f9bb017a8

Comment by Gerrit Updater [ 16/Sep/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16439
Subject: LU-6895 scrub: not trigger scrub if inode removed by race
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 01e7291500e05688e6d79d18df0b89f08eaa4796

Comment by Gerrit Updater [ 16/Sep/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16440
Subject: LU-6895 lfsck: not destroy directory when fix FID-in-dirent
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 79d3bca8df59010ad686c02ce319499fe81401c6

Comment by Gerrit Updater [ 17/Sep/15 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/16468
Subject: LU-6895 tests: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5de0f2d029766c345abd4d6c0e089c94881cbc88

Comment by Andreas Dilger [ 01/Oct/15 ]

Need to add a test for this before closing this bug. Best would be to add a full e2fsck and LFSCK run in conf-sanity test_32, and fix that test to also run on ZFS.

Comment by Alex Zhuravlev [ 01/Oct/15 ]

Andreas, I don't think we have a fix for the issue. AFAIU, Fan Yong's patch can potentially help in case of 2nd and subsequent dir entries, but it's always offset 0, i.e. the 1st entry is corrupted.

Comment by Peter Jones [ 05/Oct/15 ]

Does this issue affect both ZFS and ldiskfs or just the latter?

Comment by Jian Yu [ 05/Oct/15 ]

Hi Peter,

It only affects ldiskfs because OI Scrub is not implemented for ZFS.

Comment by Peter Jones [ 05/Oct/15 ]

Ah ok - thanks!

Comment by James Nunez (Inactive) [ 09/Oct/15 ]

Need to add a test for this before closing this bug. Best would be to add a full e2fsck and LFSCK run in conf-sanity test_32, and fix that test to also run on ZFS.

Adding LFSCK and e2fsck is being tracked under LU-7212.

Comment by Gerrit Updater [ 14/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16440/
Subject: LU-6895 lfsck: not destroy directory when fix FID-in-dirent
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: eebc3da214dfcbc01ba637f0925bfe8635b26138

Comment by Gerrit Updater [ 14/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16439/
Subject: LU-6895 scrub: not trigger scrub if inode removed by race
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d8612e307f28cc81df5d93e45130b9f04837ee27

Comment by Joseph Gmitter (Inactive) [ 14/Oct/15 ]

Landed for 2.8

Comment by Gerrit Updater [ 14/Oct/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/16821
Subject: LU-6895 lfsck: conflict lu_dirent_attrs members
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 720a9fbeeb4edcb4a34022880401fc4291693b23

Comment by Gerrit Updater [ 28/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16821/
Subject: LU-6895 lfsck: conflict lu_dirent_attrs members
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 44247c22b8a93360f2734ded593da6e570083c5b

Comment by Joseph Gmitter (Inactive) [ 28/Oct/15 ]

Latest patch has landed that caused the reopen.

Fixed for 2.8

Comment by Gerrit Updater [ 14/Nov/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17197
Subject: LU-6895 ldiskfs: debug patch for kernel-3.x dir corruption
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9bc645bb858bf593145efc0906a1c974fa96e1e4

Comment by nasf (Inactive) [ 01/Dec/15 ]

We need more work on this bug.

Comment by Gerrit Updater [ 01/Dec/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/17403
Subject: LU-6895 lfsck: avoid accessing bad OI during lfsck start
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fd012d15cac12d9ef39f279abe0c4ea40284fe29

Comment by nasf (Inactive) [ 10/Dec/15 ]

For a new formatted Lustre system, it does not guarantee that all the on-disk inodes/blocks are initialized. Instead, they are marked as free in related inode/block bitmaps. When the inode is allocated to some object, it will be initialized at that time. Such processing accelerates the format. But it may cause trouble for MDT file-level backup/restore. For example:

The sysadmin backup the MDT via server side file level tar, then reformat the MDT device, and then restore the MDT from the backup. Assume some object_A, its FID is the FID_A, and it was mapped to the inode_B before the backup, such OI mapping was recorded in the OI file. After the restore, another inode_C is assigned to the object_A, but before OI scrub rebuilding related OI mapping, the stale mapping "FID_A => inode_B" is still in the OI file, and can be found by OI lookup. Generally, even if someone found the stale OI mapping, it is not trouble, because the OSD will verify whether FID-in-LMA for the indoe_B matches the FID_A or not. But if the inode_B is NOT allocated after the resotre, and because we did not initialize inode_B during reformat, then the FID-in-LMA for the indoe_B is still kept as the same before the backup, means
it matches the FID_A, then the OSD will think that the inode_B is still assigned to the object_A after the restore. That is wrong. In futher, although the inode_B is not allocated, but some of the blocks that were assigned to it may have been allocated to others. Then accessing the blocks via the inode_B may access some invalid data, and may trigger some assertion, such as this issue.

So the key issues are two:

1) Some FID based operation may access stale OI mapping after MDT file-level backup/restore.

2) The OSD-ldiskfs may get some non-allocated inode with the give ino#/generation.

So as long as we can resovle one of them, then the trouble in the ticket can be resolved.

The solutions for 1):

1.1) Avoid FID based operation before OI scrub done. That is not easy, because fid2path cannot be covered by some name based operation.

1.2) Remove the OI files after MDT file-level backup/restore. It is more direct solution. Another benefit is that even if OI scrub rebuilt the OI files, it only guarantees that all the FIDs' OI mappings have been refrshed. But it does not clean the stale FIDs' OI mappings. Because the OI scrub only does inode-table based scanning, not OI files scanning. Removing the OI files can resolve related trouble completely.

The solutions for 2):

2.1) New ldiskfs patch to make ldiskfs_iget() to return "-ENOENT" for the case of loading non-allocated inode by checking the inode bitmap.

2.2) Check the inode's valid inside OSD via related inode bitmap.

Generally, less ldiskfs patches is better. It will safe a lot of effort when kernel upgrade. So 2.1) is not the best solution. As for 2.2), it is not good to access the inode bitmap directly in OSD unless we have to.

Relatively, the solution 1.2) is more efficient and benefit. I will make patch for that.

Comment by Alex Zhuravlev [ 10/Dec/15 ]

if an inode isn't allocated, then the corresponding bit in the bitmap should be 0 ?

Comment by nasf (Inactive) [ 10/Dec/15 ]

Yes, the inode table bitmap marks the inode as free after the restore, but the inode itself is not in initialised status because it has ever been used before the backup. But the ldiskfs_iget() does not check the inode table bitmap.

Comment by Alex Zhuravlev [ 10/Dec/15 ]

well, we've got osd_iget() which can check the bitmaps (at least in case of broken OI) ?

Comment by Alex Zhuravlev [ 10/Dec/15 ]

also, I don't understand one thing - after restore we know that OI is stale, right? then how is it possible that someone is able to use it before it's rebuilt ?

Comment by nasf (Inactive) [ 10/Dec/15 ]

well, we've got osd_iget() which can check the bitmaps (at least in case of broken OI) ?

I have made some patch to check the inode table bitmap when ldiskfs_iget() returned the crashed directory inode. It shows that the inode table bitmaps marks the returned inode as free.

Comment by nasf (Inactive) [ 10/Dec/15 ]

also, I don't understand one thing - after restore we know that OI is stale, right? then how is it possible that someone is able to use it before it's rebuilt ?

It is the LFSCK locating .lustre/lost+found/MDTxxxx object, originally, such locating is FID based directly, then osd_fid_lookup() will found the old OI mapping in the OI file, in future, ldiskfs_iget() returned the old inode, and the subsequent osd_check_lma() found the FID-in-LMA matches the given one, so it regarded it as an valid inode. But in fact, it is a free inode.

Comment by Alex Zhuravlev [ 10/Dec/15 ]

I find it wrong that we let anyone use stale OI. it's just matter of time to hit a similar issue again in the future.

Comment by nasf (Inactive) [ 10/Dec/15 ]

It is not that case, because originally, after the file-level backup/restore, the OI scrub will rebuild the OI files. But such rebuilding is based on the original OI files, means during the rebuilding, the OI file may contains some of valid OI mappings, and some of stale OI mappings. From others view, it is not easy to know which one is valid and which is invalid. So the osd_fid_lookup() will make verification via check FID-in-LMA. Unfortunately, if the target inode has neither allocated nor initialized, then such verification will be cheated. That is why I made patch to remove the OI files completely for file-lelvel backup/restore case. That can avoid such trouble form root.

Comment by Alex Zhuravlev [ 10/Dec/15 ]

iirc, we used FS's UUID (which is unique) to recognize backup-restore case which invalidates most of OI entries..

Comment by nasf (Inactive) [ 11/Dec/15 ]

Yes, that is what we did. When initialise the OSD, it will check whether the FS UUID that is stored in the OI_scrub file matches current FS UUID or not. If not, then it is quite possible that the system have restored. At that time, almost all of the OI mappings in the OI files are invalid. But as the OI scrub running, more and more OI mappings are updated in the OI files, so the OI files will contain both valid and invalid OI mappings.

The new patch will remove the OI files completely when detect the MDT file level backup/restore. That will guarantee that during the OI scrub, all the OI mappings in the OI files are valid.

Comment by Gerrit Updater [ 18/Dec/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17403/
Subject: LU-6895 lfsck: drop bad OI files after MDT file-level restore
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4c45115bdf2c5145ca3779c38a06e90024f013f0

Comment by Joseph Gmitter (Inactive) [ 18/Dec/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:04:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.