[LU-14600] sanity-lfsck test_30: f0 is not recovered Created: 09/Apr/21  Updated: 16/Apr/21  Resolved: 16/Apr/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-11446 ldiskfs inodes nlink mismatch with DNE Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/84000ede-1df1-4f8f-90a0-44f5afc1ea05

test_30 failed with the following error:

stat: cannot stat '/mnt/lustre/d30.sanity-lfsck/foo/f0': No such file or directory
(18) f0 is not recovered

This only started failing on 2021-04-08, so is very likely a regression due to a recent landing.

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-lfsck test_30 - (18) f0 is not recovered



 Comments   
Comment by Andreas Dilger [ 09/Apr/21 ]

What is very strange is that this failure started happening on several different branches on the same day - master, b2_12, b_es5_2, and b_es6_0, which makes it unlikely (though not impossible) to be caused by the same patch landing on all three branches at once.

Testing on 2021-04-09 looks like it has started passing again, so it is possible it was some kind of date-related bug (unlikely, but it happened with LU-13314), or a strange hiccup in the test environment that only caused this particular test to fail that day?

Comment by Arshad Hussain [ 12/Apr/21 ]

Seen on master. https://testing.whamcloud.com/test_sets/6fd26096-1e3c-4fbc-8685-d79e6f682975

Comment by Etienne Aujames [ 13/Apr/21 ]

Seen on b2_12:  https://testing.whamcloud.com/test_sessions/0b38adbc-8f78-4b49-8163-19007be5e8c7 
                            https://testing.whamcloud.com/test_sessions/eb8ea52e-1afe-4364-956f-1d424be88c97

Comment by Andreas Dilger [ 14/Apr/21 ]

This patch only started failing on 2021-04-08, and this is now a 100% failure for review-dne-part-2 and full sessions for both master and b2_12 (the review-dne-zfs-part-2 sessions are passing because this test is ldiskfs-only).

There were several patches landed to b2_12 on 2021-04-06:

f735003c0f LU-14355 ptlrpc: do not output error when imp_sec is freed
0596a16841 LU-12506 changelog: support large number of MDT
7f04890a1b LU-13609 mgs: fix config_log buffer handling
0850c7b14a LU-13649 mdd: orhpan cleanup fix
5610ef9a7a LU-1538 tests: standardize test script init - sanity
7531c5d25c LU-14450 kernel: kernel update RHEL8.3 [4.18.0-240.15.1.el8_3]
2fd278af4c LU-11518 ldlm: lru code cleanup
eaee7c3cd6 LU-11518 osc: cancel osc_lock list traversal once found the lock is being used 

Patches landed to master on 2021-04-06 are:

622e4c6e04 LU-14547 test: skip sanityn 109 for local setup
14a1102268 LU-14552 ptlrpc: NULL pointer dereference in ptlrpc_watchdog_fire
f9d837b479 LU-14540 o2iblnd: Use REMOTE_DROPPED for ECONNREFUSED
3f8a6fd7d6 LU-14538 gss: make namespace optional in lgss_keyring
9cc7128b9b LU-14522 ldlm: reprocess locks if enqueue failed
1d3c585194 LU-14487 lustre: remove references to Sun Trademark.
642682a39e LU-14450 kernel: kernel update RHEL8.3 [4.18.0-240.15.1.el8_3]
f37bce8a57 LU-14119 osd: add mount option "resetoi"
99d00b97ef LU-14119 osd: delete stale OI mapping entry
f5136e8195 LU-14119 osd-zfs: enable LUDA_VERIFY
bf47526261 LU-14119 mdc: set fid2path RPC interruptible
771308ada3 LU-14291 ptlrpc: format UPDATE messages in server-only code
67d17dd590 LU-14195 libcfs: switch to kfree_sensitive
d7249d9d70 LU-13783 libcfs: provide fallback kallsyms_lookup_name()
3d101645a5 LU-14132 lod: do not initialize sub llogs twice
00141b1a74 LU-11776 utils: add support lfs find with mdt hash flag
4126fbb30c LU-13397 lfs: mirror resync to keep sparseness
77f5bb4dac LU-6142 lustre: convert IFTODT to S_DT
f38f09e02a LU-14090 mgs: no local logs flag
2a34dc95bd LU-12142 clio: fix hang on urgent cached pages
1058867c00 LU-12142 readahead: limit over reservation
b4391fcdaf LU-10632 tests: recovery-small test_26 idle_timeout

so the only common patch between these two branches is the LU-14450 kernel: kernel update RHEL8.3 [4.18.0-240.15.1.el8_3] patch (update from kernel 4.18.0-240.1.1.el8), but that is only affecting the client for b2_12 testing.

Comparing the test environment for the last passing and first failing run on b2_12 shows the server kernel version is different:
[https://testing.whamcloud.com/test_sessions/b4203201-088d-47c8-87be-e08ef8e31cf1|2021-04-07 05:11:32]: pass Kernel version 3.10.0-1160.15.2.el7_lustre.x86_64
[https://testing.whamcloud.com/test_sessions/bb45aa98-1c36-492d-a0f6-4b61135fae40|2021-04-09 22:57:14]: fail Kernel version 3.10.0-1160.21.1.el7_lustre.x86_64

However, on master, both the passing and failing runs are using the same RHEL8 kernel on the servers:
[https://testing.whamcloud.com/test_sessions/e72881e6-7c88-4fe3-888c-a431d2ad5810|2021-04-08 04:23:19]:pass Kernel version 4.18.0-240.15.1.el8_lustre.x86_64
[https://testing.whamcloud.com/test_sessions/94bf8f9f-b129-46aa-bdde-8405a3216f63|2021-04-08 11:00:47] Kernel version 4.18.0-240.15.1.el8_lustre.x86_64

so it looks like this is caused somehow by a test environment change that happened on 2021-04-08 between 04:23-11:00.

Comment by Andreas Dilger [ 14/Apr/21 ]

The only other common element among the test sessions is e2fsprogs, and it appears that the landing of patch https://review.whamcloud.com/43169 "LU-11446 e2fsck: check trusted.link when fixing nlink" has caused this test to start failing.

The passing sessions are reporting e2fsprogs with "e2fsck 1.45.6.wc5 (09-Feb-2021)" and failing ones report "e2fsck 1.45.6.wc6 (09-Apr-2021)".

Reading the test description for sanity-lfsck.sh::test_30() makes it clear that this is the cause, because the fix to e2fsck is specifically to avoid the entry being moved to lost+found:

Inject failure stub on MDT0 to simulate the case that
directory d0 has no linkEA entry, then the LFSCK will
move it into .lustre/lost+found/MDTxxxx/ later.
:
 Pass 4: Checking reference counts
-Unattached inode 183
-Connect to /lost+found? yes
-
-Inode 183 ref count is 2, should be 1.  Fix? yes
-
 Unattached inode 187
 Connect to /lost+found? yes
 
 Inode 187 ref count is 2, should be 1.  Fix? yes

-Unattached inode 192
-Connect to /lost+found? yes
-
-Inode 192 ref count is 2, should be 1.  Fix? yes
-
-Unattached inode 193
-Connect to /lost+found? yes
-
-Inode 193 ref count is 2, should be 1.  Fix? yes
-
 Unattached inode 199
 Connect to /lost+found? yes
 
 Inode 199 ref count is 2, should be 1.  Fix? yes
 
 Inode 20106 ref count is 1, should be 2.  Fix? yes
 
 Inode 20108 ref count is 3, should be 2.  Fix? yes
Comment by Andreas Dilger [ 15/Apr/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43324
Subject: LU-14600 e2fsck: check trusted.link after linking inode
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 2
Commit: b83f196e93f3cb982f720aeced810cbf650cde04

Comment by Gerrit Updater [ 15/Apr/21 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/43324/
Subject: LU-14600 e2fsck: check trusted.link after linking inode
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: 87164b117be3bdeb1becf1960b80687637eda08f

Comment by Andreas Dilger [ 15/Apr/21 ]

This patch appears to have fixed the problem - all four review-dne-part-2 sanity-lfsck runs started after 4am MT have passed.

What is still needed here is an e2fsck test case for this - unreferenced inodes with xattrs that need to be relinked to lost+found.

Comment by Gerrit Updater [ 15/Apr/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43335
Subject: LU-14600 e2fsck: trusted.link unref inode test case
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set: 1
Commit: e12d23d0241a1d09e05c1ef129f201c8c1515ffa

Comment by Gerrit Updater [ 16/Apr/21 ]

Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/43335/
Subject: LU-14600 e2fsck: trusted.link unref inode test case
Project: tools/e2fsprogs
Branch: master-lustre
Current Patch Set:
Commit: f46cb5c041147772639c56d993a4313e2655399d

Comment by Andreas Dilger [ 16/Apr/21 ]

Fixed in 1.45.6.wc7

Comment by Gerrit Updater [ 16/Apr/21 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43352
Subject: LU-14600 misc: update to e2fsprogs-1.45.6.wc7
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1538b8f6d8c31143e059dc95c3724d01d9f93a13

Generated at Sat Feb 10 03:11:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.