Another Approach to Lustre Metadata Redundancy

Executive Summary
This document proposes a comprehensive approach to implementing metadata redundancy in the Lustre filesystem, addressing the critical need for fault tolerance in metadata services. The solution provides redundancy for filesystem configurations, service data, and file/directory metadata through a phased implementation approach.

Goal
Implement comprehensive redundancy for all three types of Lustre filesystem metadata:
1. Filesystem configurations (currently managed by MGS)
2. System services data (FLD, quota, and flock services - currently on MDT0)
3. File and directory inodes (distributed across MDTs)

Key Design Components
1. Fault Tolerant MGS (Management Service)
- Separate MGS from MDT0 is prerequisite to reduce implementation complexity
- Implement a Raft-based consensus protocol for MGS cluster
- Multiple MGS instances providing distributed fault-tolerant service
- Act as coordinator for rebuild operations, which itself is a service

2. Service Data Migration
- Migrate FLD, quota, and flock services from MDT0 to MGS
- Leverage MGS fault tolerance to protect these critical services
- Simplify MDT architecture by making MDT0 identical to other MDTs

3. File Metadata Redundancy
- FID Structure Enhancement:
* Reserve first 4 bits of FID sequence for replicate ID
* Support up to 16 replicas per file
* Example FID structure for 3 replicates:
. R0: [0x200000401:0x1:0x0]
. R1: [0x1000000200000401:0x1:0x0]
. R2: [0x2000000200000401:0x1:0x0]
- lu_seq_range Structure Change:
* Extend struct lu_seq_range to include 16 replica indices and actual replica count:
. struct lu_seq_range {
__u64 lsr_start;
__u64 lsr_end;
__u32 lsr_flags;
__u32 lsr_count;
__u32 lsr_index[16];
}
* Target MDT allocates sequences for all replicas in one request
* Replica MDT indices are chosen by MDT instance: e.g. 8-MDT system, if R0 is on MDT1, R1 is on MDT5, R2 on MDT7, R3 on MDT3
- Rebuild:
* Replica FID is used to know whether it's located on the failed MDT
* The remaining MDTs scan local files and check their replica locations, and recreate those belong to the failed MDT on other MDTs
* FLD database needs rebuilt after file rebuild finishes
* Multiple MDT failures are handled the same, if there is MDT failure during rebuild, it won't interrupt current rebuild, but starts a new rebuild after current rebuild finishes

Use Cases
Enabling Metadata Redundancy
1. Prerequisites: Fresh Lustre filesystem with multiple MDTs
2. Set replicate count via 'lctl conf_param <fsname>.sys.mr_count=<REPLICATE_COUNT>' on MGS
3. MDT0 inserts FLD records for other replicas of '/' and creates them on other MDTs

Create:
1. Target MDT allocates FID for R0, and pack replica ID in it for other replicas
2. Distributed transaction ensures atomic creation
3. Transparent to clients: Clients knows R0 only in normal mode

Access:
1. Transparent to clients: Clients knows R0 only in normal mode

Delete MDT:
1. Admin initiates MDT removal via 'lctl --device <FAILDED_MDT> delete' on MGS
2. System enters read-only mode
3. Automated rebuild process starts
4. The remaining MDTs start to scan local files, and recreate files belong to the failed MDT on other MDTs
5. System returns to normal mode after rebuild

Access File During Rebuild:
1. Delete failed MDT (e.g. MDT0)
2. Rebuild starts, revoke ldlm locks held by clients
3. Client revalidates '/' with FID [0x200000007:0x1:0x0], which still points it to MDT0, but client finds MDT0 is deleted, it will lookup '/' R1 FID [0x1000000200000007:0x1:0x0] to locate its MDT, and then revalidate '/' with R1 FID there
4. Resolve pathname one by one, during this process if some of the intermediate directory are located on MDT0, try lookup with the other replica FID of this directory, and finally the operation will be sent to the MDT where one of the file replica is located
5. If this operation is a write operation, it will fail with -EROFS

Access File After Rebuild:
1. Rebuild finishes, revoke all ldlm locks held by clients
2. Client revalidates '/', which is now located on an MDT other than MDT0
3. File pathname resolution and operation handling is the same as before, i.e. client only needs to access R0

Implementation Phases
Phase 1: Service Migration
- Migrate configuration, FLD, quota, and flock services to MGS

Phase 2: File Replication
- Implement replicated file creation and modifications

Phase 3: Rebuild Framework
- Add command to delete failed MDT and trigger rebuild
- Develop rebuild coordinator in MGS
- Add file scanning and recreation logic
- Implement FLD rebuild support

Phase 4: Fault Tolerant MGS
- Implement Raft consensus protocol
- Add MGS cluster management
- Develop leader election mechanism
- Implement state replication

Issues & Risks
Compatibility:
- Not backward compatible with older clients
- No downgrade path to non-redundant versions
- Applications may encounter errors during rebuild

Performance Impact:
- Write operations require distributed transactions
- Rebuild may take long time

Complexity
- Rebuild complexity is similar to LFSCK
- DoM file replication and rebuild need special handling
- Phase 4 is a huge change, there may be unknown issues, but without it configurations/FLD/quota/flock services will still be the single point of failure

Future Enhancements
- Support per-directory replicate count
- Support per-file replicate count
- Optimize distributed transaction performance
- Optimize rebuild performance
- Add monitoring and reporting tools
- Transparent rebuild
- Automatic failure detection