[LU-16096] recovery: handle compatibility during upgrade for new replay data format Created: 16/Aug/22 Updated: 20/Jun/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.16.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Improvement | Priority: | Critical |
| Reporter: | Qian Yingjin | Assignee: | Qian Yingjin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | LTS15, statahead | ||
| Issue Links: |
|
||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
As batched RPC protocol will change the disk format of the client reply data "REPLY_DATA" for recovery, thus we need to handle compatibility during upgrade carefully for this new replay data format. The new format is introduced in https://review.whamcloud.com/#/c/46799/. The new format is as follow:
struct lsd_reply_data
{
__u64 lrd_transno; /* transaction number */
__u64 lrd_xid; /* transmission id */
__u64 lrd_data; /* per-operation data */
__u32 lrd_result; /* request result */
__u32 lrd_client_gen; /* client generation */
+__u32 lrd_batch_idx; /* sub request index in a batched RPC */
+__u32 lrd_padding[7]; /* unused fields. */
};
The proposed solution is as follows: Add several flags in the magic number field of the reply data header: LRH_MAGIC_V1: 0xbdabda01 - the magic number of the old format for client reply data. LRH_MAGIC: 0xbdabda02 - the magic number of the new format for the client reply data. LRH_FLAG_BACKUP_DONE: 0x00000004 - indicate the target has finished to backup the "REPLY_DATA" with old format.
During the target setup, it will initialize the reply data in @tgt_init()->tgt_reply_data_init().
|
| Comments |
| Comment by Andreas Dilger [ 16/Aug/22 ] |
|
I don't understand why there is a need to make a backup of the repay_data file? It would seem better to complete the replay of all records in the file (if possible), or wait until the clients are evicted, and then reset the file to the new format. |
| Comment by Qian Yingjin [ 16/Aug/22 ] |
|
The reason we need to make a backup of the replay_data file is as follows:
Due to above reason, I think we would better to complete the format conversion before the recovery. |
| Comment by Gerrit Updater [ 19/Aug/22 ] |
|
|
| Comment by Qian Yingjin [ 19/Aug/22 ] |
|
After discussed with Lai, he think when found that the reply data is in the old format, we can directly drop the reply data (via truncate the size with 0), and rewrite the reply data header with the new format. This approach is much simpler, but will result in the recovery failure and the clients are evicted. |
| Comment by Andreas Dilger [ 19/Aug/22 ] |
|
the reply_data file is not the primary recovery state, since clients are listed in the last_rcvd and only "extra" recovery records are in reply_data. It should be possible to complete the client recovery using existing file (without conversion), and then truncate the file after recovery has finished and rewrite the header to use the new magic and size. The clients will still be listed in the last_rcvd file and do not need to be evicted, and then new records will be written in the new format. |
| Comment by Gerrit Updater [ 19/Aug/22 ] |
|
"Yingjin Qian <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/48261 |
| Comment by Gerrit Updater [ 29/Nov/22 ] |
|
|
| Comment by Gerrit Updater [ 31/Jan/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48261/ |
| Comment by Andreas Dilger [ 07/Feb/23 ] |
|
I've been bitten a few times recently by the landing of patch https://review.whamcloud.com/48261 "LU-16096 recovery: upgrade reply data after recovery finish" when switching between branches without reformatting the filesystem (with added debugging to show why the mount was failing): [Tue Feb 7 15:26:54 2023] LustreError: 1230:0:(tgt_lastrcvd.c:2206:tgt_reply_data_init()) testfs-MDT0000: invalid reply_data header size: 64 != 32 [Tue Feb 7 15:26:54 2023] LustreError: 1230:0:(obd_config.c:776:class_setup()) setup testfs-MDT0000 failed (-22) [Tue Feb 7 15:26:54 2023] LustreError: 1230:0:(obd_config.c:2024:class_config_llog_handler()) MGC192.168.10.99@tcp: cfg command failed: rc = -22 [Tue Feb 7 15:26:54 2023] Lustre: cmd=cf003 0:testfs-MDT0000 1:testfs-MDT0000_UUID 2:0 3:testfs-MDT0000-mdtlov 4:f [Tue Feb 7 15:26:54 2023] LustreError: 15b-f: MGC192.168.10.99@tcp: Configuration from log testfs-MDT0000 failed from MGS -22. Check client and MGS are on compatible version. [Tue Feb 7 15:26:54 2023] LustreError: 1173:0:(tgt_mount.c:1444:server_start_targets()) failed to start server testfs-MDT0000: -22 [Tue Feb 7 15:26:54 2023] LustreError: 1173:0:(tgt_mount.c:2081:server_fill_super()) Unable to start targets: -22 [Tue Feb 7 15:26:54 2023] LustreError: 1173:0:(obd_config.c:829:class_cleanup()) Device 5 not setup [Tue Feb 7 15:26:55 2023] Lustre: server umount testfs-MDT0000 complete It would be useful to backport a patch to b2_15 and b_es6_0 to allow mounting the filesystem with the new reply_data record size during recovery, if that is possible. It should certainly be possible if lrd_batch_idx == 0 is in the records (i.e. no clients doing WBC), which should at least be true for 2.16 clients. This would avoid lots of support problems in the field if the MDS is upgraded to 2.16+ and then downgraded because of problems with WBC or some other new feature. I'm not sure whether it would be possible for a 2.15 server to finish recovery with actual WBC client records (seems unlikely), but that becomes less critical if at least the 2.15->2.16->2.15 upgrade/downgrade path is handled. It may be necessary to land support into 2.16 for the MDS to properly handle WBC record recovery, even if the WBC feature is not yet implemented there, again to allow upgrade/downgrade to work. At a minimum, the 2.16 MDS should be able to ignore such records (e.g. with "abort_recov_mdt") if it doesn't understand the format of the record, so that it isn't necessary for the user to manually mount the MDT and truncate reply_data to recover from this problem. |
| Comment by Andreas Dilger [ 08/Feb/23 ] |
|
Yingjin, I guess the separate question is whether 2.16 with batched statahead actually needs the larger reply_data format with lrd_batch_idx? If not, then we should strongly consider disabling the automatic update of reply_data to the new format in 2.16, and then only enable it in 2.17 when WBC is actually using it. That would allow 2.16 to be able to downgrade from 2.17+ (and do batch RPC recovery) without causing unnecessary incompatibility. I think we would still need some basic interop in 2.15.x to handle the larger record size, but if batched statahead isn't using lrd_batch_idx it should be quite simple (allow reading the larger records, then truncating the reply_data file and reverting to the V1 record size, or staying with the same record size if that is simpler). |
| Comment by Gerrit Updater [ 08/Feb/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49939 |
| Comment by Andreas Dilger [ 24/Feb/23 ] |
|
Hi Yingjin, could you please look into this. |
| Comment by Qian Yingjin [ 27/Feb/23 ] |
|
Hi Andreas,
This need to patch 2.15.x to recognize the replay_data V2 format, and then convert new format data into the old V1 format. (Please note we can not simply truncate the new format data, we must convert it first as their record size are different) |
| Comment by Andreas Dilger [ 27/Feb/23 ] |
|
The reason this is important is to allow downgrade from a newer version of Lustre to 2.15. I think if batched statahead does not require the use of the new replay data format, then it would make sense to allow the new format to be*read* but not actually do the upgrade until the version that requires it to be enabled (I guess when actual WBC is enabled). For master, I think that just means disabling the replay_data upgrade, and then re-enabling it after the statahead patches land, and b2_16 is branched. This will at least allow downgrade from 2.16 to 2.15 (without reply_data upgrade), and from 2.17+ to 2.16, but downgrading from 2.17+ to 2.15.2 would not be possible. Separately, I think it would be less complex to patch the older maintenance branches to understand the new format but not the code to do the upgrade. I don't think it would be hard to read the new format and ignore the added fields. |
| Comment by Qian Yingjin [ 27/Feb/23 ] |
|
Please note that in the master branch (b2_16) we write the reply_data with new format (in @tgt_reply_data_write) and the record size written by tgt_reply_data_write is enlarged in the current master branch... Thus I think the better solution here may be that we do not change the current master, but add downgrade support when switch from the master to 2.15:
|
| Comment by Andreas Dilger [ 27/Feb/23 ] |
|
Since master does not actually need the lrd_batch_idx field for statahead, it makes sense to me that the lsd_reply_data_v2 format only be enabled after the 2.16 release is made. That does mean that master/2.16 would be writing the lsd_reply_data_v1 format for now, but able to mount a filesystem that has lsd_reply_data_v2 records (based on the magic). |
| Comment by Gerrit Updater [ 14/Apr/23 ] |
|
"Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50636 |
| Comment by Gerrit Updater [ 18/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49939/ |
| Comment by James A Simmons [ 01/Jun/23 ] |
|
Is this work complete? |
| Comment by Andreas Dilger [ 20/Jun/23 ] |
|
James, LustreError: 74648:0:(tgt_lastrcvd.c:2070:tgt_reply_data_init()) lustre-MDT0000: invalid header in reply_data LustreError: 74648:0:(obd_config.c:774:class_setup()) setup lustre-MDT0000 failed (-22) LustreError: 74648:0:(obd_config.c:2029:class_config_llog_handler()) MGC10.240.26.9@tcp: cfg command failed: rc = -22 Lustre: cmd=cf003 0:lustre-MDT0000 1:lustre-MDT0000_UUID 2:0 3:lustre-MDT0000-mdtlov 4:f LustreError: 15b-f: MGC10.240.26.9@tcp: Configuration from log lustre-MDT0000 failed from MGS -22. Check client and MGS are on compatible version. LustreError: 74442:0:(obd_mount_server.c:1425:server_start_targets()) failed to start server lustre-MDT0000: -22 LustreError: 74442:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -22 LustreError: 74442:0:(obd_config.c:827:class_cleanup()) Device 5 not setup LustreError: 74521:0:(ldlm_lockd.c:2500:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1686822829 with bad export cookie 6862479036220380163 LustreError: 166-1: MGC10.240.26.9@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Lustre: server umount lustre-MDT0000 complete LustreError: 74442:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -22 The patch patch: https://review.whamcloud.com/50636 "LU-16096 target: use lsd_reply_data_v1 format by default" still needs to land so that upgrade/downgrade continues to work. |