Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17412

lustre snapshot: write barrier stuck at "failed" state

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Upstream
    • 3
    • 9223372036854775807

    Description

      When we create a lustre snapshot (lctl snapshot_create) a global write barrier is used internally to avoid an inconsistent snapshot of the filesystem.

      Creating a snapshot after mounting another snapshot causes the barrier to get into a "failed" state. This state cannot be cleared until an MGS remount an actual filesystem. Any operations involving barrier fails due to this<lctl snapshot_{create, destroy} >, etc.

      # lctl barrier_stat testfs
      state: failed
      timeout: 0 seconds
      
      # lctl barrier_rescan testfs
      Fail to rescan barrier bitmap for testfs: Invalid argument
      # lctl barrier_thaw testfs
      Fail to thaw barrier for testfs: Invalid argument

      Bisecting through master branch recent commits. The below commit was causing the issue and I'm not able to reproduce this issue without this commit.

          LU-17142 mgc: reconnection without pinger

      Overall looks like the issue is due to LU-17142.

      I'm able to consistently reproduce this using this script(reproducer.sh).

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              akash-b Akash B
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: