Type: Technical task
Affects Version/s: None
Fix Version/s: Lustre 2.8.0
When one MDT restarts after a crash, it will process all of the records in its local update llog. It will batch up all of the updates with the same ur_master_index, ur_batchid and sends them in an OUT_UPDATE RPC to each of the remote targets that were part of the operation. At the mean time, all other MDTs will be notified, and they will also check their own local update log, and all of the related records will be sent to the failover MDT.
The MDT who receives the updates from other MDT, will check whether the corresponding updates are already recorded in their local update llog.
If the update was already committed, then the MDT will reply with an arbitrary pb_transno < pb_last_committed.
If the updates do not exist in the update llog, they will compare the master transno in the update record with the transno in the last_rcvd, if the transno in update record is smaller than the one in the last_rcvd, it means the master already sent the update to this MDT, and the update is already being exected and committed, and the update log has been deleted, so it will also return an arbitrary smaller transno as above. If the transno in the update record is larger, it will replay the update with a new transno.
In all of cases, the MDT will reply to the sender with the transno.
If the sender is the recovering MDT, which is the master for this operation, it will build the in-memory operation state to track the remote updates, and when all of the remote updates have committed, it can cancel the local update record.
Then client will send replay/resend request to the failover MDT,
The master MDT will check whether the request exists in the update log by the request xid.
If it does not exist, it will compare the request transno with its own transno, only replay the request if its transno is bigger than the last transno(lcd_last_transno) of this MDT.
If it does exist, it means the recovery between MDTs already handle this case. So it will return an arbitrary smaller transno, then client can remove the request from the replay list.
If there are any failures during the above 2 steps, lfsck daemon will be triggered to fix the filesystem.
For more details, please refer to the HLD for DNE phase II.