[LU-14600] sanity-lfsck test_30: f0 is not recovered Created: 09/Apr/21 Updated: 16/Apr/21 Resolved: 16/Apr/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/84000ede-1df1-4f8f-90a0-44f5afc1ea05 test_30 failed with the following error: stat: cannot stat '/mnt/lustre/d30.sanity-lfsck/foo/f0': No such file or directory (18) f0 is not recovered This only started failing on 2021-04-08, so is very likely a regression due to a recent landing. VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Andreas Dilger [ 09/Apr/21 ] |
|
What is very strange is that this failure started happening on several different branches on the same day - master, b2_12, b_es5_2, and b_es6_0, which makes it unlikely (though not impossible) to be caused by the same patch landing on all three branches at once.
|
| Comment by Arshad Hussain [ 12/Apr/21 ] |
|
Seen on master. https://testing.whamcloud.com/test_sets/6fd26096-1e3c-4fbc-8685-d79e6f682975 |
| Comment by Etienne Aujames [ 13/Apr/21 ] |
|
Seen on b2_12: https://testing.whamcloud.com/test_sessions/0b38adbc-8f78-4b49-8163-19007be5e8c7 |
| Comment by Andreas Dilger [ 14/Apr/21 ] |
|
This patch only started failing on 2021-04-08, and this is now a 100% failure for review-dne-part-2 and full sessions for both master and b2_12 (the review-dne-zfs-part-2 sessions are passing because this test is ldiskfs-only). There were several patches landed to b2_12 on 2021-04-06: f735003c0f LU-14355 ptlrpc: do not output error when imp_sec is freed 0596a16841 LU-12506 changelog: support large number of MDT 7f04890a1b LU-13609 mgs: fix config_log buffer handling 0850c7b14a LU-13649 mdd: orhpan cleanup fix 5610ef9a7a LU-1538 tests: standardize test script init - sanity 7531c5d25c LU-14450 kernel: kernel update RHEL8.3 [4.18.0-240.15.1.el8_3] 2fd278af4c LU-11518 ldlm: lru code cleanup eaee7c3cd6 LU-11518 osc: cancel osc_lock list traversal once found the lock is being used Patches landed to master on 2021-04-06 are: 622e4c6e04 LU-14547 test: skip sanityn 109 for local setup 14a1102268 LU-14552 ptlrpc: NULL pointer dereference in ptlrpc_watchdog_fire f9d837b479 LU-14540 o2iblnd: Use REMOTE_DROPPED for ECONNREFUSED 3f8a6fd7d6 LU-14538 gss: make namespace optional in lgss_keyring 9cc7128b9b LU-14522 ldlm: reprocess locks if enqueue failed 1d3c585194 LU-14487 lustre: remove references to Sun Trademark. 642682a39e LU-14450 kernel: kernel update RHEL8.3 [4.18.0-240.15.1.el8_3] f37bce8a57 LU-14119 osd: add mount option "resetoi" 99d00b97ef LU-14119 osd: delete stale OI mapping entry f5136e8195 LU-14119 osd-zfs: enable LUDA_VERIFY bf47526261 LU-14119 mdc: set fid2path RPC interruptible 771308ada3 LU-14291 ptlrpc: format UPDATE messages in server-only code 67d17dd590 LU-14195 libcfs: switch to kfree_sensitive d7249d9d70 LU-13783 libcfs: provide fallback kallsyms_lookup_name() 3d101645a5 LU-14132 lod: do not initialize sub llogs twice 00141b1a74 LU-11776 utils: add support lfs find with mdt hash flag 4126fbb30c LU-13397 lfs: mirror resync to keep sparseness 77f5bb4dac LU-6142 lustre: convert IFTODT to S_DT f38f09e02a LU-14090 mgs: no local logs flag 2a34dc95bd LU-12142 clio: fix hang on urgent cached pages 1058867c00 LU-12142 readahead: limit over reservation b4391fcdaf LU-10632 tests: recovery-small test_26 idle_timeout so the only common patch between these two branches is the Comparing the test environment for the last passing and first failing run on b2_12 shows the server kernel version is different: However, on master, both the passing and failing runs are using the same RHEL8 kernel on the servers: so it looks like this is caused somehow by a test environment change that happened on 2021-04-08 between 04:23-11:00. |
| Comment by Andreas Dilger [ 14/Apr/21 ] |
|
The only other common element among the test sessions is e2fsprogs, and it appears that the landing of patch https://review.whamcloud.com/43169 "LU-11446 e2fsck: check trusted.link when fixing nlink" has caused this test to start failing. The passing sessions are reporting e2fsprogs with "e2fsck 1.45.6.wc5 (09-Feb-2021)" and failing ones report "e2fsck 1.45.6.wc6 (09-Apr-2021)". Reading the test description for sanity-lfsck.sh::test_30() makes it clear that this is the cause, because the fix to e2fsck is specifically to avoid the entry being moved to lost+found: Inject failure stub on MDT0 to simulate the case that directory d0 has no linkEA entry, then the LFSCK will move it into .lustre/lost+found/MDTxxxx/ later. : Pass 4: Checking reference counts -Unattached inode 183 -Connect to /lost+found? yes - -Inode 183 ref count is 2, should be 1. Fix? yes - Unattached inode 187 Connect to /lost+found? yes Inode 187 ref count is 2, should be 1. Fix? yes -Unattached inode 192 -Connect to /lost+found? yes - -Inode 192 ref count is 2, should be 1. Fix? yes - -Unattached inode 193 -Connect to /lost+found? yes - -Inode 193 ref count is 2, should be 1. Fix? yes - Unattached inode 199 Connect to /lost+found? yes Inode 199 ref count is 2, should be 1. Fix? yes Inode 20106 ref count is 1, should be 2. Fix? yes Inode 20108 ref count is 3, should be 2. Fix? yes |
| Comment by Andreas Dilger [ 15/Apr/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43324 |
| Comment by Gerrit Updater [ 15/Apr/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/43324/ |
| Comment by Andreas Dilger [ 15/Apr/21 ] |
|
This patch appears to have fixed the problem - all four review-dne-part-2 sanity-lfsck runs started after 4am MT have passed. What is still needed here is an e2fsck test case for this - unreferenced inodes with xattrs that need to be relinked to lost+found. |
| Comment by Gerrit Updater [ 15/Apr/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43335 |
| Comment by Gerrit Updater [ 16/Apr/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/43335/ |
| Comment by Andreas Dilger [ 16/Apr/21 ] |
|
Fixed in 1.45.6.wc7 |
| Comment by Gerrit Updater [ 16/Apr/21 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43352 |