Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14600

sanity-lfsck test_30: f0 is not recovered

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/84000ede-1df1-4f8f-90a0-44f5afc1ea05

      test_30 failed with the following error:

      stat: cannot stat '/mnt/lustre/d30.sanity-lfsck/foo/f0': No such file or directory
      (18) f0 is not recovered
      

      This only started failing on 2021-04-08, so is very likely a regression due to a recent landing.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lfsck test_30 - (18) f0 is not recovered

      Attachments

        Issue Links

          Activity

            [LU-14600] sanity-lfsck test_30: f0 is not recovered

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43352
            Subject: LU-14600 misc: update to e2fsprogs-1.45.6.wc7
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1538b8f6d8c31143e059dc95c3724d01d9f93a13

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43352 Subject: LU-14600 misc: update to e2fsprogs-1.45.6.wc7 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1538b8f6d8c31143e059dc95c3724d01d9f93a13

            Fixed in 1.45.6.wc7

            adilger Andreas Dilger added a comment - Fixed in 1.45.6.wc7

            Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/43335/
            Subject: LU-14600 e2fsck: trusted.link unref inode test case
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set:
            Commit: f46cb5c041147772639c56d993a4313e2655399d

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/43335/ Subject: LU-14600 e2fsck: trusted.link unref inode test case Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: Commit: f46cb5c041147772639c56d993a4313e2655399d

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43335
            Subject: LU-14600 e2fsck: trusted.link unref inode test case
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 1
            Commit: e12d23d0241a1d09e05c1ef129f201c8c1515ffa

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43335 Subject: LU-14600 e2fsck: trusted.link unref inode test case Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 1 Commit: e12d23d0241a1d09e05c1ef129f201c8c1515ffa

            This patch appears to have fixed the problem - all four review-dne-part-2 sanity-lfsck runs started after 4am MT have passed.

            What is still needed here is an e2fsck test case for this - unreferenced inodes with xattrs that need to be relinked to lost+found.

            adilger Andreas Dilger added a comment - This patch appears to have fixed the problem - all four review-dne-part-2 sanity-lfsck runs started after 4am MT have passed. What is still needed here is an e2fsck test case for this - unreferenced inodes with xattrs that need to be relinked to lost+found.

            Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/43324/
            Subject: LU-14600 e2fsck: check trusted.link after linking inode
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set:
            Commit: 87164b117be3bdeb1becf1960b80687637eda08f

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) merged in patch https://review.whamcloud.com/43324/ Subject: LU-14600 e2fsck: check trusted.link after linking inode Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: Commit: 87164b117be3bdeb1becf1960b80687637eda08f

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43324
            Subject: LU-14600 e2fsck: check trusted.link after linking inode
            Project: tools/e2fsprogs
            Branch: master-lustre
            Current Patch Set: 2
            Commit: b83f196e93f3cb982f720aeced810cbf650cde04

            adilger Andreas Dilger added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43324 Subject: LU-14600 e2fsck: check trusted.link after linking inode Project: tools/e2fsprogs Branch: master-lustre Current Patch Set: 2 Commit: b83f196e93f3cb982f720aeced810cbf650cde04

            The only other common element among the test sessions is e2fsprogs, and it appears that the landing of patch https://review.whamcloud.com/43169 "LU-11446 e2fsck: check trusted.link when fixing nlink" has caused this test to start failing.

            The passing sessions are reporting e2fsprogs with "e2fsck 1.45.6.wc5 (09-Feb-2021)" and failing ones report "e2fsck 1.45.6.wc6 (09-Apr-2021)".

            Reading the test description for sanity-lfsck.sh::test_30() makes it clear that this is the cause, because the fix to e2fsck is specifically to avoid the entry being moved to lost+found:

            Inject failure stub on MDT0 to simulate the case that
            directory d0 has no linkEA entry, then the LFSCK will
            move it into .lustre/lost+found/MDTxxxx/ later.
            :
             Pass 4: Checking reference counts
            -Unattached inode 183
            -Connect to /lost+found? yes
            -
            -Inode 183 ref count is 2, should be 1.  Fix? yes
            -
             Unattached inode 187
             Connect to /lost+found? yes
             
             Inode 187 ref count is 2, should be 1.  Fix? yes
            
            -Unattached inode 192
            -Connect to /lost+found? yes
            -
            -Inode 192 ref count is 2, should be 1.  Fix? yes
            -
            -Unattached inode 193
            -Connect to /lost+found? yes
            -
            -Inode 193 ref count is 2, should be 1.  Fix? yes
            -
             Unattached inode 199
             Connect to /lost+found? yes
             
             Inode 199 ref count is 2, should be 1.  Fix? yes
             
             Inode 20106 ref count is 1, should be 2.  Fix? yes
             
             Inode 20108 ref count is 3, should be 2.  Fix? yes
            
            adilger Andreas Dilger added a comment - The only other common element among the test sessions is e2fsprogs, and it appears that the landing of patch https://review.whamcloud.com/43169 " LU-11446 e2fsck: check trusted.link when fixing nlink " has caused this test to start failing. The passing sessions are reporting e2fsprogs with " e2fsck 1.45.6.wc5 (09-Feb-2021) " and failing ones report " e2fsck 1.45.6.wc6 (09-Apr-2021) ". Reading the test description for sanity-lfsck.sh::test_30() makes it clear that this is the cause, because the fix to e2fsck is specifically to avoid the entry being moved to lost+found : Inject failure stub on MDT0 to simulate the case that directory d0 has no linkEA entry, then the LFSCK will move it into .lustre/lost+found/MDTxxxx/ later. : Pass 4: Checking reference counts -Unattached inode 183 -Connect to /lost+found? yes - -Inode 183 ref count is 2, should be 1. Fix? yes - Unattached inode 187 Connect to /lost+found? yes Inode 187 ref count is 2, should be 1. Fix? yes -Unattached inode 192 -Connect to /lost+found? yes - -Inode 192 ref count is 2, should be 1. Fix? yes - -Unattached inode 193 -Connect to /lost+found? yes - -Inode 193 ref count is 2, should be 1. Fix? yes - Unattached inode 199 Connect to /lost+found? yes Inode 199 ref count is 2, should be 1. Fix? yes Inode 20106 ref count is 1, should be 2. Fix? yes Inode 20108 ref count is 3, should be 2. Fix? yes

            This patch only started failing on 2021-04-08, and this is now a 100% failure for review-dne-part-2 and full sessions for both master and b2_12 (the review-dne-zfs-part-2 sessions are passing because this test is ldiskfs-only).

            There were several patches landed to b2_12 on 2021-04-06:

            f735003c0f LU-14355 ptlrpc: do not output error when imp_sec is freed
            0596a16841 LU-12506 changelog: support large number of MDT
            7f04890a1b LU-13609 mgs: fix config_log buffer handling
            0850c7b14a LU-13649 mdd: orhpan cleanup fix
            5610ef9a7a LU-1538 tests: standardize test script init - sanity
            7531c5d25c LU-14450 kernel: kernel update RHEL8.3 [4.18.0-240.15.1.el8_3]
            2fd278af4c LU-11518 ldlm: lru code cleanup
            eaee7c3cd6 LU-11518 osc: cancel osc_lock list traversal once found the lock is being used 
            

            Patches landed to master on 2021-04-06 are:

            622e4c6e04 LU-14547 test: skip sanityn 109 for local setup
            14a1102268 LU-14552 ptlrpc: NULL pointer dereference in ptlrpc_watchdog_fire
            f9d837b479 LU-14540 o2iblnd: Use REMOTE_DROPPED for ECONNREFUSED
            3f8a6fd7d6 LU-14538 gss: make namespace optional in lgss_keyring
            9cc7128b9b LU-14522 ldlm: reprocess locks if enqueue failed
            1d3c585194 LU-14487 lustre: remove references to Sun Trademark.
            642682a39e LU-14450 kernel: kernel update RHEL8.3 [4.18.0-240.15.1.el8_3]
            f37bce8a57 LU-14119 osd: add mount option "resetoi"
            99d00b97ef LU-14119 osd: delete stale OI mapping entry
            f5136e8195 LU-14119 osd-zfs: enable LUDA_VERIFY
            bf47526261 LU-14119 mdc: set fid2path RPC interruptible
            771308ada3 LU-14291 ptlrpc: format UPDATE messages in server-only code
            67d17dd590 LU-14195 libcfs: switch to kfree_sensitive
            d7249d9d70 LU-13783 libcfs: provide fallback kallsyms_lookup_name()
            3d101645a5 LU-14132 lod: do not initialize sub llogs twice
            00141b1a74 LU-11776 utils: add support lfs find with mdt hash flag
            4126fbb30c LU-13397 lfs: mirror resync to keep sparseness
            77f5bb4dac LU-6142 lustre: convert IFTODT to S_DT
            f38f09e02a LU-14090 mgs: no local logs flag
            2a34dc95bd LU-12142 clio: fix hang on urgent cached pages
            1058867c00 LU-12142 readahead: limit over reservation
            b4391fcdaf LU-10632 tests: recovery-small test_26 idle_timeout
            

            so the only common patch between these two branches is the LU-14450 kernel: kernel update RHEL8.3 [4.18.0-240.15.1.el8_3] patch (update from kernel 4.18.0-240.1.1.el8), but that is only affecting the client for b2_12 testing.

            Comparing the test environment for the last passing and first failing run on b2_12 shows the server kernel version is different:
            [https://testing.whamcloud.com/test_sessions/b4203201-088d-47c8-87be-e08ef8e31cf1|2021-04-07 05:11:32]: pass Kernel version 3.10.0-1160.15.2.el7_lustre.x86_64
            [https://testing.whamcloud.com/test_sessions/bb45aa98-1c36-492d-a0f6-4b61135fae40|2021-04-09 22:57:14]: fail Kernel version 3.10.0-1160.21.1.el7_lustre.x86_64

            However, on master, both the passing and failing runs are using the same RHEL8 kernel on the servers:
            [https://testing.whamcloud.com/test_sessions/e72881e6-7c88-4fe3-888c-a431d2ad5810|2021-04-08 04:23:19]:pass Kernel version 4.18.0-240.15.1.el8_lustre.x86_64
            [https://testing.whamcloud.com/test_sessions/94bf8f9f-b129-46aa-bdde-8405a3216f63|2021-04-08 11:00:47] Kernel version 4.18.0-240.15.1.el8_lustre.x86_64

            so it looks like this is caused somehow by a test environment change that happened on 2021-04-08 between 04:23-11:00.

            adilger Andreas Dilger added a comment - This patch only started failing on 2021-04-08, and this is now a 100% failure for review-dne-part-2 and full sessions for both master and b2_12 (the review-dne-zfs-part-2 sessions are passing because this test is ldiskfs-only). There were several patches landed to b2_12 on 2021-04-06: f735003c0f LU-14355 ptlrpc: do not output error when imp_sec is freed 0596a16841 LU-12506 changelog: support large number of MDT 7f04890a1b LU-13609 mgs: fix config_log buffer handling 0850c7b14a LU-13649 mdd: orhpan cleanup fix 5610ef9a7a LU-1538 tests: standardize test script init - sanity 7531c5d25c LU-14450 kernel: kernel update RHEL8.3 [4.18.0-240.15.1.el8_3] 2fd278af4c LU-11518 ldlm: lru code cleanup eaee7c3cd6 LU-11518 osc: cancel osc_lock list traversal once found the lock is being used Patches landed to master on 2021-04-06 are: 622e4c6e04 LU-14547 test: skip sanityn 109 for local setup 14a1102268 LU-14552 ptlrpc: NULL pointer dereference in ptlrpc_watchdog_fire f9d837b479 LU-14540 o2iblnd: Use REMOTE_DROPPED for ECONNREFUSED 3f8a6fd7d6 LU-14538 gss: make namespace optional in lgss_keyring 9cc7128b9b LU-14522 ldlm: reprocess locks if enqueue failed 1d3c585194 LU-14487 lustre: remove references to Sun Trademark. 642682a39e LU-14450 kernel: kernel update RHEL8.3 [4.18.0-240.15.1.el8_3] f37bce8a57 LU-14119 osd: add mount option "resetoi" 99d00b97ef LU-14119 osd: delete stale OI mapping entry f5136e8195 LU-14119 osd-zfs: enable LUDA_VERIFY bf47526261 LU-14119 mdc: set fid2path RPC interruptible 771308ada3 LU-14291 ptlrpc: format UPDATE messages in server-only code 67d17dd590 LU-14195 libcfs: switch to kfree_sensitive d7249d9d70 LU-13783 libcfs: provide fallback kallsyms_lookup_name() 3d101645a5 LU-14132 lod: do not initialize sub llogs twice 00141b1a74 LU-11776 utils: add support lfs find with mdt hash flag 4126fbb30c LU-13397 lfs: mirror resync to keep sparseness 77f5bb4dac LU-6142 lustre: convert IFTODT to S_DT f38f09e02a LU-14090 mgs: no local logs flag 2a34dc95bd LU-12142 clio: fix hang on urgent cached pages 1058867c00 LU-12142 readahead: limit over reservation b4391fcdaf LU-10632 tests: recovery-small test_26 idle_timeout so the only common patch between these two branches is the LU-14450 kernel: kernel update RHEL8.3 [4.18.0-240.15.1.el8_3] patch (update from kernel 4.18.0-240.1.1.el8), but that is only affecting the client for b2_12 testing. Comparing the test environment for the last passing and first failing run on b2_12 shows the server kernel version is different: [https://testing.whamcloud.com/test_sessions/b4203201-088d-47c8-87be-e08ef8e31cf1|2021-04-07 05:11:32] : pass Kernel version 3.10.0-1160.15.2.el7_lustre.x86_64 [https://testing.whamcloud.com/test_sessions/bb45aa98-1c36-492d-a0f6-4b61135fae40|2021-04-09 22:57:14] : fail Kernel version 3.10.0-1160.21.1.el7_lustre.x86_64 However, on master, both the passing and failing runs are using the same RHEL8 kernel on the servers: [https://testing.whamcloud.com/test_sessions/e72881e6-7c88-4fe3-888c-a431d2ad5810|2021-04-08 04:23:19] :pass Kernel version 4.18.0-240.15.1.el8_lustre.x86_64 [https://testing.whamcloud.com/test_sessions/94bf8f9f-b129-46aa-bdde-8405a3216f63|2021-04-08 11:00:47] Kernel version 4.18.0-240.15.1.el8_lustre.x86_64 so it looks like this is caused somehow by a test environment change that happened on 2021-04-08 between 04:23-11:00.
            eaujames Etienne Aujames added a comment - Seen on b2_12:  https://testing.whamcloud.com/test_sessions/0b38adbc-8f78-4b49-8163-19007be5e8c7                               https://testing.whamcloud.com/test_sessions/eb8ea52e-1afe-4364-956f-1d424be88c97

            People

              adilger Andreas Dilger
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: