Description
When we create a lustre snapshot (lctl snapshot_create) a global write barrier is used internally to avoid an inconsistent snapshot of the filesystem.
Creating a snapshot after mounting another snapshot causes the barrier to get into a "failed" state. This state cannot be cleared until an MGS remount an actual filesystem. Any operations involving barrier fails due to this<lctl snapshot_{create, destroy} >, etc.
# lctl barrier_stat testfs state: failed timeout: 0 seconds # lctl barrier_rescan testfs Fail to rescan barrier bitmap for testfs: Invalid argument # lctl barrier_thaw testfs Fail to thaw barrier for testfs: Invalid argument
Bisecting through master branch recent commits. The below commit was causing the issue and I'm not able to reproduce this issue without this commit.
LU-17142 mgc: reconnection without pinger
Overall looks like the issue is due to LU-17142.
I'm able to consistently reproduce this using this script(reproducer.sh).
Attachments
Issue Links
- is related to
-
LU-17142 MGC long time connection
- Resolved