[LU-17412] lustre snapshot: write barrier stuck at "failed" state Created: 10/Jan/24 Updated: 11/Jan/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Akash B | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | ZFS, snapshots | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
When we create a lustre snapshot (lctl snapshot_create) a global write barrier is used internally to avoid an inconsistent snapshot of the filesystem. Creating a snapshot after mounting another snapshot causes the barrier to get into a "failed" state. This state cannot be cleared until an MGS remount an actual filesystem. Any operations involving barrier fails due to this<lctl snapshot_{create, destroy} >, etc. # lctl barrier_stat testfs state: failed timeout: 0 seconds # lctl barrier_rescan testfs Fail to rescan barrier bitmap for testfs: Invalid argument # lctl barrier_thaw testfs Fail to thaw barrier for testfs: Invalid argument Bisecting through master branch recent commits. The below commit was causing the issue and I'm not able to reproduce this issue without this commit. LU-17142 mgc: reconnection without pinger Overall looks like the issue is due to I'm able to consistently reproduce this using this script(reproducer.sh |
| Comments |
| Comment by Andreas Dilger [ 10/Jan/24 ] |
|
aboyko, any ideas here? |
| Comment by Alexander Boyko [ 11/Jan/24 ] |
|
The situation is next |