Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17385

sanity-lfsck test_26a: only 3 of 4 MDTs are in completed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e80cc085-ac08-4f47-b354-22551a7da132

      test_26a failed with the following error:

      (7) only 3 of 4 MDTs are in completed
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64
      servers: https://build.whamcloud.com/job/lustre-master-patchless/840 - 4.18.0-425.10.1.el8_7.x86_64

      <<Please provide additional information about the failure here>>

      First started on 2023-12-20 for full runs, may be related to recent patch landing.

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-lfsck test_26a - (7) only 3 of 4 MDTs are in completed

      Attachments

        Issue Links

          Activity

            [LU-17385] sanity-lfsck test_26a: only 3 of 4 MDTs are in completed

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57687
            Subject: LU-17385 tests: always_except sanity-lfsck/24
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 34660438769d955968b65816e97e5f45f404198a

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57687 Subject: LU-17385 tests: always_except sanity-lfsck/24 Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 34660438769d955968b65816e97e5f45f404198a

            If I insert "sleep 10" after the "start mds1" before LFSCK start, the test completes successfully 10/10.

            thanks, let me try that..

            bzzz Alex Zhuravlev added a comment - If I insert "sleep 10" after the "start mds1" before LFSCK start, the test completes successfully 10/10. thanks, let me try that..

            >cmp: EOF on /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo which is empty

            it is a different issue, LFSCK reconnects the orphan inode into the namespace but somehow loses the layout. If I insert "sleep 10" after the "start mds1" before LFSCK start, the test completes successfully 10/10.

            zam Alexander Zarochentsev added a comment - >cmp: EOF on /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo which is empty it is a different issue, LFSCK reconnects the orphan inode into the namespace but somehow loses the layout. If I insert "sleep 10" after the "start mds1" before LFSCK start, the test completes successfully 10/10.
            bzzz Alex Zhuravlev added a comment - - edited

            23d still fail often locally:

            == sanity-lfsck test 23d: LFSCK can repair a dangling name entry to a remote object ========================================================== 04:50:54 (1712551854)
            Stopping /mnt/lustre-mds1 (opts:) on tmp.LJd1G0O9yk
            debugfs 1.46.2.wc5 (26-Mar-2022)
            
            Starting mds1: -o localrecov  /dev/mapper/mds1_flakey /mnt/lustre-mds1
            Started lustre-MDT0000
            cat: /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo: Bad address
            Started LFSCK on the device lustre-MDT0001: scrub namespace
            Started LFSCK on the device lustre-MDT0000: scrub layout
            cmp: EOF on /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo which is empty
             sanity-lfsck test_23d: @@@@@@ FAIL: file body has changed 
              Trace dump:
              = ./../tests/test-framework.sh:6975:error()
              = sanity-lfsck.sh:4252:test_23d()
              = ./../tests/test-framework.sh:7315:run_one()
              = ./../tests/test-framework.sh:7375:run_one_logged()
              = ./../tests/test-framework.sh:7201:run_test()
              = sanity-lfsck.sh:4254:main()
            Dumping lctl log to /tmp/ltest-logs/sanity-lfsck.test_23d.*.1712551889.log
            Dumping logs only on local client.
            FAIL 23d (36s)
            
            bzzz Alex Zhuravlev added a comment - - edited 23d still fail often locally: == sanity-lfsck test 23d: LFSCK can repair a dangling name entry to a remote object ========================================================== 04:50:54 (1712551854) Stopping /mnt/lustre-mds1 (opts:) on tmp.LJd1G0O9yk debugfs 1.46.2.wc5 (26-Mar-2022) Starting mds1: -o localrecov /dev/mapper/mds1_flakey /mnt/lustre-mds1 Started lustre-MDT0000 cat: /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo: Bad address Started LFSCK on the device lustre-MDT0001: scrub namespace Started LFSCK on the device lustre-MDT0000: scrub layout cmp: EOF on /mnt/lustre/d23d.sanity-lfsck/mdt1dir/foo which is empty sanity-lfsck test_23d: @@@@@@ FAIL: file body has changed Trace dump: = ./../tests/test-framework.sh:6975:error() = sanity-lfsck.sh:4252:test_23d() = ./../tests/test-framework.sh:7315:run_one() = ./../tests/test-framework.sh:7375:run_one_logged() = ./../tests/test-framework.sh:7201:run_test() = sanity-lfsck.sh:4254:main() Dumping lctl log to /tmp/ltest-logs/sanity-lfsck.test_23d.*.1712551889.log Dumping logs only on local client. FAIL 23d (36s)
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53591/
            Subject: LU-17385 tests: sanity-lfsck 23d fix and enable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 028ed64d90cfdeb908fb5574aacf2f71c259e2c2

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53591/ Subject: LU-17385 tests: sanity-lfsck 23d fix and enable Project: fs/lustre-release Branch: master Current Patch Set: Commit: 028ed64d90cfdeb908fb5574aacf2f71c259e2c2

            "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53591
            Subject: LU-17385 tests: sanity-lfsck 23d fix and enable
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 2f4f656947703d8b44a7ea49a8b2c84020591307

            gerrit Gerrit Updater added a comment - "Alexander Zarochentsev <alexander.zarochentsev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53591 Subject: LU-17385 tests: sanity-lfsck 23d fix and enable Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 2f4f656947703d8b44a7ea49a8b2c84020591307

            hongchao.zhang, thanks for the analysis, I am submitting a patch to test the idea.

            zam Alexander Zarochentsev added a comment - hongchao.zhang , thanks for the analysis, I am submitting a patch to test the idea.

            The sanity-lfsck test_23d is now skipped, so any patch that is fixing it needs to remove the always_except line.

            adilger Andreas Dilger added a comment - The sanity-lfsck test_23d is now skipped, so any patch that is fixing it needs to remove the always_except line.

            People

              zam Alexander Zarochentsev
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: