Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17412

lustre snapshot: write barrier stuck at "failed" state

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • Upstream
    • 3
    • 9223372036854775807

      When we create a lustre snapshot (lctl snapshot_create) a global write barrier is used internally to avoid an inconsistent snapshot of the filesystem.

      Creating a snapshot after mounting another snapshot causes the barrier to get into a "failed" state. This state cannot be cleared until an MGS remount an actual filesystem. Any operations involving barrier fails due to this<lctl snapshot_{create, destroy} >, etc.

      # lctl barrier_stat testfs
      state: failed
      timeout: 0 seconds
      
      # lctl barrier_rescan testfs
      Fail to rescan barrier bitmap for testfs: Invalid argument
      # lctl barrier_thaw testfs
      Fail to thaw barrier for testfs: Invalid argument

      Bisecting through master branch recent commits. The below commit was causing the issue and I'm not able to reproduce this issue without this commit.

          LU-17142 mgc: reconnection without pinger

      Overall looks like the issue is due to LU-17142.

      I'm able to consistently reproduce this using this script(reproducer.sh).

            akash-b Akash B
            akash-b Akash B
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: