Description
When we create a lustre snapshot (lctl snapshot_create) a global write barrier is used internally to avoid an inconsistent snapshot of the filesystem.
Creating a snapshot after mounting another snapshot causes the barrier to get into a "failed" state. This state cannot be cleared until an MGS remount an actual filesystem. Any operations involving barrier fails due to this<lctl snapshot_{create, destroy} >, etc.
# lctl barrier_stat testfs state: failed timeout: 0 seconds # lctl barrier_rescan testfs Fail to rescan barrier bitmap for testfs: Invalid argument # lctl barrier_thaw testfs Fail to thaw barrier for testfs: Invalid argument
Bisecting through master branch recent commits. The below commit was causing the issue and I'm not able to reproduce this issue without this commit.
LU-17142 mgc: reconnection without pinger
Overall looks like the issue is due to LU-17142.
I'm able to consistently reproduce this using this script(reproducer.sh).
Attachments
Issue Links
- is related to
-
LU-17142 MGC long time connection
-
- Resolved
-
test_801a log:
The current failure in test_801a occurs because the barrier is expected to be in a "failed" state when OBD_FAIL_BARRIER_FAILURE is set.
The purpose of this patch is to retry the operation (barrier_freeze) until the barrier time expires and then fail. Given the test snippet for OBD_FAIL_BARRIER_FAILURE, we expect the barrier state to be "failed" and retry until timer expiry.
Should we also consider the "expired" state as valid failed state in this testcase? Any thoughts?
Let me know if any other modifications are necessary.