[LU-17412] lustre snapshot: write barrier stuck at "failed" state Created: 10/Jan/24  Updated: 11/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Akash B Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: ZFS, snapshots

Attachments: File reproducer.sh    
Issue Links:
Related
is related to LU-17142 MGC long time connection Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When we create a lustre snapshot (lctl snapshot_create) a global write barrier is used internally to avoid an inconsistent snapshot of the filesystem.

Creating a snapshot after mounting another snapshot causes the barrier to get into a "failed" state. This state cannot be cleared until an MGS remount an actual filesystem. Any operations involving barrier fails due to this<lctl snapshot_{create, destroy} >, etc.

# lctl barrier_stat testfs
state: failed
timeout: 0 seconds

# lctl barrier_rescan testfs
Fail to rescan barrier bitmap for testfs: Invalid argument
# lctl barrier_thaw testfs
Fail to thaw barrier for testfs: Invalid argument

Bisecting through master branch recent commits. The below commit was causing the issue and I'm not able to reproduce this issue without this commit.

    LU-17142 mgc: reconnection without pinger

Overall looks like the issue is due to LU-17142.

I'm able to consistently reproduce this using this script(reproducer.sh).



 Comments   
Comment by Andreas Dilger [ 10/Jan/24 ]

aboyko, any ideas here?

Comment by Alexander Boyko [ 11/Jan/24 ]

The situation is next
lctl snapshot_mount xxx - starts the MGC reconnection, this leads to a dropping locks for MGS.
lctl snapshot_create xxx -> lctl barrier_freeze -> MGS sends glimpse AST to MGCs, no locks, considers as error.
Clients need some time to enqueue MGS locks.
The simple 3-5 seconds timeout between snapshot_mount and snapshot_create helps.

Generated at Sat Feb 10 03:35:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.