Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16336

LFSCK should fix inconsistencies caused by recovery abort

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      In LU-16159, the update logs are canceled upon recovery, which will cause inconsistencies in the filesystem. LFSCK should be able to fix these inconsistencies.

      This is visible in tests like replay-single test_70b that sometimes leave an undeletable directory behind after test completion (LU-10616). There are various workarounds (e.g. LU-16335 to use "lfs rm_entry" to unlink the directory from the namespace, or EX-6692 to reformat the filesystem), but it would be much better to have LFSCK fix these directories and/or allow them to actually be unlinked from the filesystem.

      Attachments

        Issue Links

          Activity

            [LU-16336] LFSCK should fix inconsistencies caused by recovery abort
            laisiyao Lai Siyao added a comment -

            Yes, LU-14470 can help create failure, and beyond that, we need to consider other distributed transaction replay as well, e.g. migration and restripe. Besides, if client replay is aborted as well, it may still leave dangling name entries.

            I didn't test yet, IMHO LFSCK won't simply move dangling name entries to lost+found.

            laisiyao Lai Siyao added a comment - Yes, LU-14470 can help create failure, and beyond that, we need to consider other distributed transaction replay as well, e.g. migration and restripe. Besides, if client replay is aborted as well, it may still leave dangling name entries. I didn't test yet, IMHO LFSCK won't simply move dangling name entries to lost+found.

            We've seen that the rm_entry workaround to "hide" the bad entry is only temporary, and running LFSCK on the filesystem will restore the broken entry back to .lustre/lost+found/<fsname>-MDT0000 where it will again be undeletable.

            We either need to be able to delete such a directory with missing stripes using "rmdir" if we are sure the MDT is available but the the stripe is missing, or have LFSCK fix the missing stripe in the directory so that it can be removed normally.

            Is it possible that patch https://review.whamcloud.com/47385 "LU-14470 dne: striped mkdir replay by request" will avoid such recovery failures by allowing the client to recover the broken directory even when the MDT recovery is aborted?

            adilger Andreas Dilger added a comment - We've seen that the rm_entry workaround to "hide" the bad entry is only temporary, and running LFSCK on the filesystem will restore the broken entry back to .lustre/lost+found/<fsname>-MDT0000 where it will again be undeletable. We either need to be able to delete such a directory with missing stripes using " rmdir " if we are sure the MDT is available but the the stripe is missing, or have LFSCK fix the missing stripe in the directory so that it can be removed normally. Is it possible that patch https://review.whamcloud.com/47385 " LU-14470 dne: striped mkdir replay by request " will avoid such recovery failures by allowing the client to recover the broken directory even when the MDT recovery is aborted?

            People

              laisiyao Lai Siyao
              laisiyao Lai Siyao
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: