[LU-16096] recovery: handle compatibility during upgrade for new replay data format Created: 16/Aug/22  Updated: 20/Jun/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: Lustre 2.16.0

Type: Improvement Priority: Critical
Reporter: Qian Yingjin Assignee: Qian Yingjin
Resolution: Unresolved Votes: 0
Labels: LTS15, statahead

Issue Links:
Related
is related to LU-15975 Statahead_V1 Features Resolved
is related to LU-14139 batched statahead processing Resolved
Rank (Obsolete): 9223372036854775807

 Description   

As batched RPC protocol will change the disk format of the client reply data "REPLY_DATA" for recovery, thus we need to handle compatibility during upgrade carefully for this new replay data format.

The new format is introduced in https://review.whamcloud.com/#/c/46799/.

The new format is as follow:

struct lsd_reply_data
{ 
__u64 lrd_transno; /* transaction number */
__u64 lrd_xid; /* transmission id */
__u64 lrd_data; /* per-operation data */
__u32 lrd_result; /* request result */
__u32 lrd_client_gen; /* client generation */
+__u32 lrd_batch_idx; /* sub request index in a batched RPC */
+__u32 lrd_padding[7]; /* unused fields. */ 
};

The proposed solution is as follows:

Add several flags in the magic number field of the reply data header:

LRH_MAGIC_V1: 0xbdabda01 - the magic number of the old format for client reply data.

LRH_MAGIC: 0xbdabda02 - the magic number of the new format for the client reply data.

LRH_FLAG_BACKUP_DONE: 0x00000004 - indicate the target has finished to backup the "REPLY_DATA" with old format.

 

During the target setup, it will initialize the reply data in @tgt_init()->tgt_reply_data_init().

  1. if found that the "REPLY_DATA" is old format (according to the magic number in the reply data header "LRH_MAGIC"),  the target starts to backup the "REPLY_DATA" file into the file "REPLY_DATA_BAK".
  2. After finished the backup, the target will change the magic number field of the reply data header with LRH_MAGIC_V1 | LRH_FLAG_BACKUP_DONE, and sync the magic flag change into the persistent storage.
  3. The target starts to convert the old format reply data from the backup file "REPLY_DATA_BAK" into the original reply data file "REPLY_DATA".
  4. After finished the conversion, the target changes the magic number @lrh_magic of the reply data header with LRH_MAGIC and @lrh_reply_size with new format, and sync the change to the disk. After that delete the backup file "REPLY_DATA_BAK".
  5. After that, the target starts the recovery. processing as normal with the new format reply data.

 



 Comments   
Comment by Andreas Dilger [ 16/Aug/22 ]

I don't understand why there is a need to make a backup of the repay_data file? It would seem better to complete the replay of all records in the file (if possible), or wait until the clients are evicted, and then reset the file to the new format.

Comment by Qian Yingjin [ 16/Aug/22 ]

The reason we need to make a backup of the replay_data file is as follows:

  1. First, we want to extend the data structure @lsd_reply_data, thus we can not convert the old records with new format from the original replay_data file, It would better to make a backup.
  2. The target may reboot repeatedly during the recovery.
  3.  The target does not free the reply data corresponding to the highest transno of each export. This ensures on-disk reply data is kept and last committed transno can be restored form disk in case of target recovery.
  4.  Although some clients get evicted, but we still need to keep the client replay data for other successful replay clients.

Due to above reason, I think we would better to complete the format conversion before the recovery.

Comment by Gerrit Updater [ 19/Aug/22 ]

"Yingjin Qian <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48260
Subject: LU-16096 recovery: upgrade compatibility for new reply data
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d1480f69b71fe647af47e4ec4b502e76caf96695

Comment by Qian Yingjin [ 19/Aug/22 ]

After discussed with Lai, he think when found that the reply data is in the old format, we can directly drop the reply data (via truncate the size with 0), and rewrite the reply data header with the new format.

This approach is much simpler, but will result in the recovery failure and the clients are evicted.

Comment by Andreas Dilger [ 19/Aug/22 ]

the reply_data file is not the primary recovery state, since clients are listed in the last_rcvd and only "extra" recovery records are in reply_data. It should be possible to complete the client recovery using existing file (without conversion), and then truncate the file after recovery has finished and rewrite the header to use the new magic and size.

The clients will still be listed in the last_rcvd file and do not need to be evicted, and then new records will be written in the new format.

Comment by Gerrit Updater [ 19/Aug/22 ]

"Yingjin Qian <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48261
Subject: LU-16096 recovery: upgrade reply data after recovery finish
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ab15fa93d32ec1629041c1df4a773d946759f648

Comment by Gerrit Updater [ 29/Nov/22 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49268
Subject: LU-16096 recovery: upgrade reply data after recovery finish
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 235730ade05896237d3b6fafae6d8db07fea0283

Comment by Gerrit Updater [ 31/Jan/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48261/
Subject: LU-16096 recovery: upgrade reply data after recovery finish
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bbf0017fdea52f094c190f14fd82b9f5d0902c90

Comment by Andreas Dilger [ 07/Feb/23 ]

I've been bitten a few times recently by the landing of patch https://review.whamcloud.com/48261 "LU-16096 recovery: upgrade reply data after recovery finish" when switching between branches without reformatting the filesystem (with added debugging to show why the mount was failing):

[Tue Feb  7 15:26:54 2023] LustreError: 1230:0:(tgt_lastrcvd.c:2206:tgt_reply_data_init()) testfs-MDT0000: invalid reply_data header size: 64 != 32
[Tue Feb  7 15:26:54 2023] LustreError: 1230:0:(obd_config.c:776:class_setup()) setup testfs-MDT0000 failed (-22)
[Tue Feb  7 15:26:54 2023] LustreError: 1230:0:(obd_config.c:2024:class_config_llog_handler()) MGC192.168.10.99@tcp: cfg command failed: rc = -22
[Tue Feb  7 15:26:54 2023] Lustre:    cmd=cf003 0:testfs-MDT0000  1:testfs-MDT0000_UUID  2:0  3:testfs-MDT0000-mdtlov  4:f  
[Tue Feb  7 15:26:54 2023] LustreError: 15b-f: MGC192.168.10.99@tcp: Configuration from log testfs-MDT0000 failed from MGS -22. Check client and MGS are on compatible version.
[Tue Feb  7 15:26:54 2023] LustreError: 1173:0:(tgt_mount.c:1444:server_start_targets()) failed to start server testfs-MDT0000: -22
[Tue Feb  7 15:26:54 2023] LustreError: 1173:0:(tgt_mount.c:2081:server_fill_super()) Unable to start targets: -22
[Tue Feb  7 15:26:54 2023] LustreError: 1173:0:(obd_config.c:829:class_cleanup()) Device 5 not setup
[Tue Feb  7 15:26:55 2023] Lustre: server umount testfs-MDT0000 complete

It would be useful to backport a patch to b2_15 and b_es6_0 to allow mounting the filesystem with the new reply_data record size during recovery, if that is possible. It should certainly be possible if lrd_batch_idx == 0 is in the records (i.e. no clients doing WBC), which should at least be true for 2.16 clients. This would avoid lots of support problems in the field if the MDS is upgraded to 2.16+ and then downgraded because of problems with WBC or some other new feature.

I'm not sure whether it would be possible for a 2.15 server to finish recovery with actual WBC client records (seems unlikely), but that becomes less critical if at least the 2.15->2.16->2.15 upgrade/downgrade path is handled. It may be necessary to land support into 2.16 for the MDS to properly handle WBC record recovery, even if the WBC feature is not yet implemented there, again to allow upgrade/downgrade to work.

At a minimum, the 2.16 MDS should be able to ignore such records (e.g. with "abort_recov_mdt") if it doesn't understand the format of the record, so that it isn't necessary for the user to manually mount the MDT and truncate reply_data to recover from this problem.

Comment by Andreas Dilger [ 08/Feb/23 ]

Yingjin, I guess the separate question is whether 2.16 with batched statahead actually needs the larger reply_data format with lrd_batch_idx? If not, then we should strongly consider disabling the automatic update of reply_data to the new format in 2.16, and then only enable it in 2.17 when WBC is actually using it. That would allow 2.16 to be able to downgrade from 2.17+ (and do batch RPC recovery) without causing unnecessary incompatibility. I think we would still need some basic interop in 2.15.x to handle the larger record size, but if batched statahead isn't using lrd_batch_idx it should be quite simple (allow reading the larger records, then truncating the reply_data file and reverting to the V1 record size, or staying with the same record size if that is simpler).

Comment by Gerrit Updater [ 08/Feb/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49939
Subject: LU-16096 tgt: improve messages for reply_data
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bdbc114438ae79e5e3f7fff30808c9a5f9096158

Comment by Andreas Dilger [ 24/Feb/23 ]

Hi Yingjin, could you please look into this.

Comment by Qian Yingjin [ 27/Feb/23 ]

Hi Andreas,
This only happened when downgraded a MDT server of a Lustre file system from the latest master to b_es6_0 or b2_15.
Batched statahead does not need the larger reply_data format with lrd_batch_idx.

we would still need some basic interop in 2.15.x to handle the larger record size

This need to patch 2.15.x to recognize the replay_data V2 format, and then convert new format data into the old V1 format. (Please note we can not simply truncate the new format data, we must convert it first as their record size are different)
If this is acceptable, I will make a patch. Then the automatic update of reply_data to the new format in 2.16 can be kept, I think.

Comment by Andreas Dilger [ 27/Feb/23 ]

The reason this is important is to allow downgrade from a newer version of Lustre to 2.15.

I think if batched statahead does not require the use of the new replay data format, then it would make sense to allow the new format to be*read* but not actually do the upgrade until the version that requires it to be enabled (I guess when actual WBC is enabled).

For master, I think that just means disabling the replay_data upgrade, and then re-enabling it after the statahead patches land, and b2_16 is branched. This will at least allow downgrade from 2.16 to 2.15 (without reply_data upgrade), and from 2.17+ to 2.16, but downgrading from 2.17+ to 2.15.2 would not be possible.

Separately, I think it would be less complex to patch the older maintenance branches to understand the new format but not the code to do the upgrade. I don't think it would be hard to read the new format and ignore the added fields.

Comment by Qian Yingjin [ 27/Feb/23 ]

Please note that in the master branch (b2_16) we write the reply_data with new format (in @tgt_reply_data_write) and the record size written by tgt_reply_data_write is enlarged in the current master branch...
This means that we must upgrade reply_data also unless we patch the master to write reply_data records with old V1 format...

Thus I think the better solution here may be that we do not change the current master, but add downgrade support when switch from the master to 2.15:

This need to patch 2.15.x to recognize the replay_data V2 format, and then convert new format data into the old V1 format. (Please note we can not simply truncate the new format data, we must convert it first as their record size are different)

Comment by Andreas Dilger [ 27/Feb/23 ]

Since master does not actually need the lrd_batch_idx field for statahead, it makes sense to me that the lsd_reply_data_v2 format only be enabled after the 2.16 release is made. That does mean that master/2.16 would be writing the lsd_reply_data_v1 format for now, but able to mount a filesystem that has lsd_reply_data_v2 records (based on the magic).

Comment by Gerrit Updater [ 14/Apr/23 ]

"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50636
Subject: LU-16096 target: use lsd_reply_data_v1 format by default
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4b4c1ea61bcc2029b4db9e4ca106d42eac2257a4

Comment by Gerrit Updater [ 18/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49939/
Subject: LU-16096 tgt: improve messages for reply_data
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b2f05051c4239e845434ea9e183d889e74a5db57

Comment by James A Simmons [ 01/Jun/23 ]

Is this work complete?

Comment by Andreas Dilger [ 20/Jun/23 ]

James,
the lack of compatibility is causing the clean-downgrade and clean-downgrade-zfs tests to fail with "invalid header in reply_data":
https://testing.whamcloud.com/test_sets/26ed0085-1e4b-4c6d-acc7-ca63a5775066

LustreError: 74648:0:(tgt_lastrcvd.c:2070:tgt_reply_data_init()) lustre-MDT0000: invalid header in reply_data
LustreError: 74648:0:(obd_config.c:774:class_setup()) setup lustre-MDT0000 failed (-22)
 LustreError: 74648:0:(obd_config.c:2029:class_config_llog_handler()) MGC10.240.26.9@tcp: cfg command failed: rc = -22
Lustre:    cmd=cf003 0:lustre-MDT0000  1:lustre-MDT0000_UUID  2:0  3:lustre-MDT0000-mdtlov  4:f  

LustreError: 15b-f: MGC10.240.26.9@tcp: Configuration from log lustre-MDT0000 failed from MGS -22. Check client and MGS are on compatible version.
LustreError: 74442:0:(obd_mount_server.c:1425:server_start_targets()) failed to start server lustre-MDT0000: -22
LustreError: 74442:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -22
LustreError: 74442:0:(obd_config.c:827:class_cleanup()) Device 5 not setup
LustreError: 74521:0:(ldlm_lockd.c:2500:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1686822829 with bad export cookie 6862479036220380163
LustreError: 166-1: MGC10.240.26.9@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Lustre: server umount lustre-MDT0000 complete
LustreError: 74442:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -22

The patch patch: https://review.whamcloud.com/50636 "LU-16096 target: use lsd_reply_data_v1 format by default" still needs to land so that upgrade/downgrade continues to work.

Generated at Sat Feb 10 03:23:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.