Details
-
Task
-
Resolution: Unresolved
-
Medium
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
The RAFT log may become large, even exceeds the max size of llog catalog, and large RAFT log takes long time to load and replicate. The RAFT protocol employs a periodic snapshot mechanism to control the log size. Since Lustre filesystem snapshot is not always enabled, it can be supported to capture complete state to disk:
- add function to save state to disk, this includes MGS changes, FLDB and Quota
- add function to load snapshot from disk
- add function to send and handle snapshot RPC: the state may be large, and split into multiple RPCs
The snapshot is captured periodically, and in the first step, this can be done upon RAFT node startup.