Type: New Feature
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0
Snapshot is an important feature for Lustre. As the first step, we will use ZFS backend snapshot functionalities to implement Lustre snapshot.
Snapshots provide fast recovery of files from a previous checkpoint (without recourse to offline backup). Snapshots are cheap online backups, provided the hardware itself is not compromised. Recovery of lost files from a snapshot is usually considerably faster than from any offline backup or remote replica. It is noted that snapshots do not improve storage reliability and are just as exposed to hardware failure as any other storage volume.
Snapshot addresses a need to be able to take a checkpoint of the file system, and has two historic purposes: prepare a file system for a backup or fast recovery of files from a previous state without recourse to an offline backup. The latter option is increasingly used in environments where the cost associated with any downtime is significant – consider the time required to restore a dataset from a tape library. In many cases, restore from tape will exceed the SLA for operations.
A common pattern is for a file system to be checkpointed every two hours. If an error occurs in the “live” data (accidental data loss, corruption, etc.), then it is straightforward to revert to a previous snapshot, either whole sale or by copying back the original data. Snapshots do require that the underlying hardware is not compromised.
Stabilising the file system for a backup is probably less relevant when the file system size reaches into petabytes. LTO drives for example, can only record at a maximum rate of 576-900 GB/hour. As file system capacities increase, the ability to take a backup, and more importantly restore from a backup within the conditions of an SLA, diminish.
Let us not underestimate the utility of snapshots when planning maintenance. Taking a snapshot immediately prior to a system upgrade is a sensible precaution and making that mechanism accessible and reliable adds value to any system maintenance workflow.