Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16096

recovery: handle compatibility during upgrade for new replay data format

Details

    • Improvement
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.16.0
    • 9223372036854775807

    Description

      As batched RPC protocol will change the disk format of the client reply data "REPLY_DATA" for recovery, thus we need to handle compatibility during upgrade carefully for this new replay data format.

      The new format is introduced in https://review.whamcloud.com/#/c/46799/.

      The new format is as follow:

      struct lsd_reply_data
      { 
      __u64 lrd_transno; /* transaction number */
      __u64 lrd_xid; /* transmission id */
      __u64 lrd_data; /* per-operation data */
      __u32 lrd_result; /* request result */
      __u32 lrd_client_gen; /* client generation */
      +__u32 lrd_batch_idx; /* sub request index in a batched RPC */
      +__u32 lrd_padding[7]; /* unused fields. */ 
      };
      

      The proposed solution is as follows:

      Add several flags in the magic number field of the reply data header:

      LRH_MAGIC_V1: 0xbdabda01 - the magic number of the old format for client reply data.

      LRH_MAGIC: 0xbdabda02 - the magic number of the new format for the client reply data.

      LRH_FLAG_BACKUP_DONE: 0x00000004 - indicate the target has finished to backup the "REPLY_DATA" with old format.

       

      During the target setup, it will initialize the reply data in @tgt_init()->tgt_reply_data_init().

      1. if found that the "REPLY_DATA" is old format (according to the magic number in the reply data header "LRH_MAGIC"),  the target starts to backup the "REPLY_DATA" file into the file "REPLY_DATA_BAK".
      2. After finished the backup, the target will change the magic number field of the reply data header with LRH_MAGIC_V1 | LRH_FLAG_BACKUP_DONE, and sync the magic flag change into the persistent storage.
      3. The target starts to convert the old format reply data from the backup file "REPLY_DATA_BAK" into the original reply data file "REPLY_DATA".
      4. After finished the conversion, the target changes the magic number @lrh_magic of the reply data header with LRH_MAGIC and @lrh_reply_size with new format, and sync the change to the disk. After that delete the backup file "REPLY_DATA_BAK".
      5. After that, the target starts the recovery. processing as normal with the new format reply data.

       

      Attachments

        Issue Links

          Activity

            [LU-16096] recovery: handle compatibility during upgrade for new replay data format

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49939/
            Subject: LU-16096 tgt: improve messages for reply_data
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b2f05051c4239e845434ea9e183d889e74a5db57

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49939/ Subject: LU-16096 tgt: improve messages for reply_data Project: fs/lustre-release Branch: master Current Patch Set: Commit: b2f05051c4239e845434ea9e183d889e74a5db57

            "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50636
            Subject: LU-16096 target: use lsd_reply_data_v1 format by default
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4b4c1ea61bcc2029b4db9e4ca106d42eac2257a4

            gerrit Gerrit Updater added a comment - "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50636 Subject: LU-16096 target: use lsd_reply_data_v1 format by default Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4b4c1ea61bcc2029b4db9e4ca106d42eac2257a4

            Since master does not actually need the lrd_batch_idx field for statahead, it makes sense to me that the lsd_reply_data_v2 format only be enabled after the 2.16 release is made. That does mean that master/2.16 would be writing the lsd_reply_data_v1 format for now, but able to mount a filesystem that has lsd_reply_data_v2 records (based on the magic).

            adilger Andreas Dilger added a comment - Since master does not actually need the lrd_batch_idx field for statahead, it makes sense to me that the lsd_reply_data_v2 format only be enabled after the 2.16 release is made. That does mean that master/2.16 would be writing the lsd_reply_data_v1 format for now, but able to mount a filesystem that has lsd_reply_data_v2 records (based on the magic).
            qian_wc Qian Yingjin added a comment -

            Please note that in the master branch (b2_16) we write the reply_data with new format (in @tgt_reply_data_write) and the record size written by tgt_reply_data_write is enlarged in the current master branch...
            This means that we must upgrade reply_data also unless we patch the master to write reply_data records with old V1 format...

            Thus I think the better solution here may be that we do not change the current master, but add downgrade support when switch from the master to 2.15:

            This need to patch 2.15.x to recognize the replay_data V2 format, and then convert new format data into the old V1 format. (Please note we can not simply truncate the new format data, we must convert it first as their record size are different)

            qian_wc Qian Yingjin added a comment - Please note that in the master branch (b2_16) we write the reply_data with new format (in @tgt_reply_data_write) and the record size written by tgt_reply_data_write is enlarged in the current master branch... This means that we must upgrade reply_data also unless we patch the master to write reply_data records with old V1 format... Thus I think the better solution here may be that we do not change the current master, but add downgrade support when switch from the master to 2.15: This need to patch 2.15.x to recognize the replay_data V2 format, and then convert new format data into the old V1 format. (Please note we can not simply truncate the new format data, we must convert it first as their record size are different)

            The reason this is important is to allow downgrade from a newer version of Lustre to 2.15.

            I think if batched statahead does not require the use of the new replay data format, then it would make sense to allow the new format to be*read* but not actually do the upgrade until the version that requires it to be enabled (I guess when actual WBC is enabled).

            For master, I think that just means disabling the replay_data upgrade, and then re-enabling it after the statahead patches land, and b2_16 is branched. This will at least allow downgrade from 2.16 to 2.15 (without reply_data upgrade), and from 2.17+ to 2.16, but downgrading from 2.17+ to 2.15.2 would not be possible.

            Separately, I think it would be less complex to patch the older maintenance branches to understand the new format but not the code to do the upgrade. I don't think it would be hard to read the new format and ignore the added fields.

            adilger Andreas Dilger added a comment - The reason this is important is to allow downgrade from a newer version of Lustre to 2.15. I think if batched statahead does not require the use of the new replay data format, then it would make sense to allow the new format to be*read* but not actually do the upgrade until the version that requires it to be enabled (I guess when actual WBC is enabled). For master, I think that just means disabling the replay_data upgrade, and then re-enabling it after the statahead patches land, and b2_16 is branched. This will at least allow downgrade from 2.16 to 2.15 (without reply_data upgrade), and from 2.17+ to 2.16, but downgrading from 2.17+ to 2.15.2 would not be possible. Separately, I think it would be less complex to patch the older maintenance branches to understand the new format but not the code to do the upgrade. I don't think it would be hard to read the new format and ignore the added fields.
            qian_wc Qian Yingjin added a comment - - edited

            Hi Andreas,
            This only happened when downgraded a MDT server of a Lustre file system from the latest master to b_es6_0 or b2_15.
            Batched statahead does not need the larger reply_data format with lrd_batch_idx.

            we would still need some basic interop in 2.15.x to handle the larger record size

            This need to patch 2.15.x to recognize the replay_data V2 format, and then convert new format data into the old V1 format. (Please note we can not simply truncate the new format data, we must convert it first as their record size are different)
            If this is acceptable, I will make a patch. Then the automatic update of reply_data to the new format in 2.16 can be kept, I think.

            qian_wc Qian Yingjin added a comment - - edited Hi Andreas, This only happened when downgraded a MDT server of a Lustre file system from the latest master to b_es6_0 or b2_15. Batched statahead does not need the larger reply_data format with lrd_batch_idx. we would still need some basic interop in 2.15.x to handle the larger record size This need to patch 2.15.x to recognize the replay_data V2 format, and then convert new format data into the old V1 format. (Please note we can not simply truncate the new format data, we must convert it first as their record size are different) If this is acceptable, I will make a patch. Then the automatic update of reply_data to the new format in 2.16 can be kept, I think.

            Hi Yingjin, could you please look into this.

            adilger Andreas Dilger added a comment - Hi Yingjin, could you please look into this.

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49939
            Subject: LU-16096 tgt: improve messages for reply_data
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bdbc114438ae79e5e3f7fff30808c9a5f9096158

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49939 Subject: LU-16096 tgt: improve messages for reply_data Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bdbc114438ae79e5e3f7fff30808c9a5f9096158

            Yingjin, I guess the separate question is whether 2.16 with batched statahead actually needs the larger reply_data format with lrd_batch_idx? If not, then we should strongly consider disabling the automatic update of reply_data to the new format in 2.16, and then only enable it in 2.17 when WBC is actually using it. That would allow 2.16 to be able to downgrade from 2.17+ (and do batch RPC recovery) without causing unnecessary incompatibility. I think we would still need some basic interop in 2.15.x to handle the larger record size, but if batched statahead isn't using lrd_batch_idx it should be quite simple (allow reading the larger records, then truncating the reply_data file and reverting to the V1 record size, or staying with the same record size if that is simpler).

            adilger Andreas Dilger added a comment - Yingjin, I guess the separate question is whether 2.16 with batched statahead actually needs the larger reply_data format with lrd_batch_idx ? If not, then we should strongly consider disabling the automatic update of reply_data to the new format in 2.16, and then only enable it in 2.17 when WBC is actually using it. That would allow 2.16 to be able to downgrade from 2.17+ (and do batch RPC recovery) without causing unnecessary incompatibility. I think we would still need some basic interop in 2.15.x to handle the larger record size, but if batched statahead isn't using lrd_batch_idx it should be quite simple (allow reading the larger records, then truncating the reply_data file and reverting to the V1 record size, or staying with the same record size if that is simpler).
            adilger Andreas Dilger added a comment - - edited

            I've been bitten a few times recently by the landing of patch https://review.whamcloud.com/48261 "LU-16096 recovery: upgrade reply data after recovery finish" when switching between branches without reformatting the filesystem (with added debugging to show why the mount was failing):

            [Tue Feb  7 15:26:54 2023] LustreError: 1230:0:(tgt_lastrcvd.c:2206:tgt_reply_data_init()) testfs-MDT0000: invalid reply_data header size: 64 != 32
            [Tue Feb  7 15:26:54 2023] LustreError: 1230:0:(obd_config.c:776:class_setup()) setup testfs-MDT0000 failed (-22)
            [Tue Feb  7 15:26:54 2023] LustreError: 1230:0:(obd_config.c:2024:class_config_llog_handler()) MGC192.168.10.99@tcp: cfg command failed: rc = -22
            [Tue Feb  7 15:26:54 2023] Lustre:    cmd=cf003 0:testfs-MDT0000  1:testfs-MDT0000_UUID  2:0  3:testfs-MDT0000-mdtlov  4:f  
            [Tue Feb  7 15:26:54 2023] LustreError: 15b-f: MGC192.168.10.99@tcp: Configuration from log testfs-MDT0000 failed from MGS -22. Check client and MGS are on compatible version.
            [Tue Feb  7 15:26:54 2023] LustreError: 1173:0:(tgt_mount.c:1444:server_start_targets()) failed to start server testfs-MDT0000: -22
            [Tue Feb  7 15:26:54 2023] LustreError: 1173:0:(tgt_mount.c:2081:server_fill_super()) Unable to start targets: -22
            [Tue Feb  7 15:26:54 2023] LustreError: 1173:0:(obd_config.c:829:class_cleanup()) Device 5 not setup
            [Tue Feb  7 15:26:55 2023] Lustre: server umount testfs-MDT0000 complete
            

            It would be useful to backport a patch to b2_15 and b_es6_0 to allow mounting the filesystem with the new reply_data record size during recovery, if that is possible. It should certainly be possible if lrd_batch_idx == 0 is in the records (i.e. no clients doing WBC), which should at least be true for 2.16 clients. This would avoid lots of support problems in the field if the MDS is upgraded to 2.16+ and then downgraded because of problems with WBC or some other new feature.

            I'm not sure whether it would be possible for a 2.15 server to finish recovery with actual WBC client records (seems unlikely), but that becomes less critical if at least the 2.15->2.16->2.15 upgrade/downgrade path is handled. It may be necessary to land support into 2.16 for the MDS to properly handle WBC record recovery, even if the WBC feature is not yet implemented there, again to allow upgrade/downgrade to work.

            At a minimum, the 2.16 MDS should be able to ignore such records (e.g. with "abort_recov_mdt") if it doesn't understand the format of the record, so that it isn't necessary for the user to manually mount the MDT and truncate reply_data to recover from this problem.

            adilger Andreas Dilger added a comment - - edited I've been bitten a few times recently by the landing of patch https://review.whamcloud.com/48261 " LU-16096 recovery: upgrade reply data after recovery finish " when switching between branches without reformatting the filesystem (with added debugging to show why the mount was failing): [Tue Feb 7 15:26:54 2023] LustreError: 1230:0:(tgt_lastrcvd.c:2206:tgt_reply_data_init()) testfs-MDT0000: invalid reply_data header size: 64 != 32 [Tue Feb 7 15:26:54 2023] LustreError: 1230:0:(obd_config.c:776:class_setup()) setup testfs-MDT0000 failed (-22) [Tue Feb 7 15:26:54 2023] LustreError: 1230:0:(obd_config.c:2024:class_config_llog_handler()) MGC192.168.10.99@tcp: cfg command failed: rc = -22 [Tue Feb 7 15:26:54 2023] Lustre: cmd=cf003 0:testfs-MDT0000 1:testfs-MDT0000_UUID 2:0 3:testfs-MDT0000-mdtlov 4:f [Tue Feb 7 15:26:54 2023] LustreError: 15b-f: MGC192.168.10.99@tcp: Configuration from log testfs-MDT0000 failed from MGS -22. Check client and MGS are on compatible version. [Tue Feb 7 15:26:54 2023] LustreError: 1173:0:(tgt_mount.c:1444:server_start_targets()) failed to start server testfs-MDT0000: -22 [Tue Feb 7 15:26:54 2023] LustreError: 1173:0:(tgt_mount.c:2081:server_fill_super()) Unable to start targets: -22 [Tue Feb 7 15:26:54 2023] LustreError: 1173:0:(obd_config.c:829:class_cleanup()) Device 5 not setup [Tue Feb 7 15:26:55 2023] Lustre: server umount testfs-MDT0000 complete It would be useful to backport a patch to b2_15 and b_es6_0 to allow mounting the filesystem with the new reply_data record size during recovery, if that is possible. It should certainly be possible if lrd_batch_idx == 0 is in the records (i.e. no clients doing WBC), which should at least be true for 2.16 clients. This would avoid lots of support problems in the field if the MDS is upgraded to 2.16+ and then downgraded because of problems with WBC or some other new feature. I'm not sure whether it would be possible for a 2.15 server to finish recovery with actual WBC client records (seems unlikely), but that becomes less critical if at least the 2.15->2.16->2.15 upgrade/downgrade path is handled. It may be necessary to land support into 2.16 for the MDS to properly handle WBC record recovery, even if the WBC feature is not yet implemented there, again to allow upgrade/downgrade to work. At a minimum, the 2.16 MDS should be able to ignore such records (e.g. with " abort_recov_mdt ") if it doesn't understand the format of the record, so that it isn't necessary for the user to manually mount the MDT and truncate reply_data to recover from this problem.

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48261/
            Subject: LU-16096 recovery: upgrade reply data after recovery finish
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: bbf0017fdea52f094c190f14fd82b9f5d0902c90

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48261/ Subject: LU-16096 recovery: upgrade reply data after recovery finish Project: fs/lustre-release Branch: master Current Patch Set: Commit: bbf0017fdea52f094c190f14fd82b9f5d0902c90

            People

              qian_wc Qian Yingjin
              qian_wc Qian Yingjin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: