[LU-17412] lustre snapshot: write barrier stuck at "failed" state - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Upstream
Labels:
- ZFS
- snapshots

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When we create a lustre snapshot (lctl snapshot_create) a global write barrier is used internally to avoid an inconsistent snapshot of the filesystem.

Creating a snapshot after mounting another snapshot causes the barrier to get into a "failed" state. This state cannot be cleared until an MGS remount an actual filesystem. Any operations involving barrier fails due to this<lctl snapshot_{create, destroy} >, etc.

# lctl barrier_stat testfs
state: failed
timeout: 0 seconds

# lctl barrier_rescan testfs
Fail to rescan barrier bitmap for testfs: Invalid argument
# lctl barrier_thaw testfs
Fail to thaw barrier for testfs: Invalid argument

Bisecting through master branch recent commits. The below commit was causing the issue and I'm not able to reproduce this issue without this commit.

    LU-17142 mgc: reconnection without pinger

Overall looks like the issue is due to ~~LU-17142~~.

I'm able to consistently reproduce this using this script(reproducer.sh).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

reproducer.sh
0.4 kB
10/Jan/24 4:03 PM

Issue Links

is related to

LU-17142 MGC long time connection

Resolved

Activity

People

Assignee:: Akash B

Reporter:: Akash B

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 10/Jan/24 4:05 PM

Updated:: 11/Sep/24 4:53 PM