Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14330

Interop: recovery-small test 143 fails with 'MDD orphan cleanup thread not quit'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • Lustre 2.14.0
    • 3
    • 9223372036854775807

    Description

      recovery-small test_143 fails for interop testing starting on 19 APRIL 2020 for Lustre server version < 2.13.53.62 and Lustre client version >= 2.13.53.62. This failure does not happen for Lustre servers 2.12.5 and 2.12.6, but we do see this failure for 2.13.0 servers.

      Looking at suite_log for the latest failure at https://testing.whamcloud.com/test_sets/8adef6a4-82c3-4286-811b-c3600c371395, we can still see MDD orphan threads

      trevis-17vm4: trevis-17vm4.trevis.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
      trevis-17vm4: *.lustre-MDT0000.recovery_status status: COMPLETE
      CMD: trevis-17vm4 pgrep orph_.*-MDD | wc -l
      Waiting 90s for '0'
      CMD: trevis-17vm4 pgrep orph_.*-MDD | wc -l
      …
      CMD: trevis-17vm4 pgrep orph_.*-MDD | wc -l
      Update not seen after 90s: want '0' got '1'
       recovery-small test_143: @@@@@@ FAIL: MDD orphan cleanup thread not quit 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:6273:error()
        = /usr/lib64/lustre/tests/recovery-small.sh:3030:test_143()
      

      Attachments

        Issue Links

          Activity

            [LU-14330] Interop: recovery-small test 143 fails with 'MDD orphan cleanup thread not quit'
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56559/
            Subject: LU-14330 tests: wait for orphan thread to exit
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4969259babf237bd361106019fab0ba0e28ee82d

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56559/ Subject: LU-14330 tests: wait for orphan thread to exit Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4969259babf237bd361106019fab0ba0e28ee82d

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56559
            Subject: LU-14330 tests: wait for orphan thread to exit
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e235d3c91c7dfafada4be69183241b0aba9b27e6

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56559 Subject: LU-14330 tests: wait for orphan thread to exit Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e235d3c91c7dfafada4be69183241b0aba9b27e6

            This same issue hit once in conf-sanity test_34d:
            https://testing.whamcloud.com/test_sets/ed0b439b-9e97-4326-8717-9d4a5fb00735

            onyx-64vm8: Pool t32fs.interop created
            CMD: onyx-64vm8 pgrep orph_.*-MDD
             conf-sanity test_32d: @@@@@@ FAIL: MDD orphan cleanup thread not quit 
            
            adilger Andreas Dilger added a comment - This same issue hit once in conf-sanity test_34d: https://testing.whamcloud.com/test_sets/ed0b439b-9e97-4326-8717-9d4a5fb00735 onyx-64vm8: Pool t32fs.interop created CMD: onyx-64vm8 pgrep orph_.*-MDD conf-sanity test_32d: @@@@@@ FAIL: MDD orphan cleanup thread not quit

            The recovery-small test_142 and test_143 were previously sanity test_811, but moved to recovery-small because they were more recovery related, and restarting the MDS in sanity is too slow.

            This was done in patch https://review.whamcloud.com/36602 "LU-12846 mdd: return error while delete failed" (commit v2_13_53-57-g688d5da6a8), which is why it was not seen in previous versions (those test numbers didn't exist in earlier versions).

            adilger Andreas Dilger added a comment - The recovery-small test_142 and test_143 were previously sanity test_811, but moved to recovery-small because they were more recovery related, and restarting the MDS in sanity is too slow. This was done in patch https://review.whamcloud.com/36602 " LU-12846 mdd: return error while delete failed " (commit v2_13_53-57-g688d5da6a8), which is why it was not seen in previous versions (those test numbers didn't exist in earlier versions).

            People

              adilger Andreas Dilger
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: