Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16096

recovery: handle compatibility during upgrade for new replay data format

Details

    • Improvement
    • Resolution: Fixed
    • Critical
    • Lustre 2.16.0
    • Lustre 2.16.0
    • 9223372036854775807

    Description

      As batched RPC protocol will change the disk format of the client reply data "REPLY_DATA" for recovery, thus we need to handle compatibility during upgrade carefully for this new replay data format.

      The new format is introduced in https://review.whamcloud.com/#/c/46799/.

      The new format is as follow:

      struct lsd_reply_data
      { 
      __u64 lrd_transno; /* transaction number */
      __u64 lrd_xid; /* transmission id */
      __u64 lrd_data; /* per-operation data */
      __u32 lrd_result; /* request result */
      __u32 lrd_client_gen; /* client generation */
      +__u32 lrd_batch_idx; /* sub request index in a batched RPC */
      +__u32 lrd_padding[7]; /* unused fields. */ 
      };
      

      The proposed solution is as follows:

      Add several flags in the magic number field of the reply data header:

      LRH_MAGIC_V1: 0xbdabda01 - the magic number of the old format for client reply data.

      LRH_MAGIC: 0xbdabda02 - the magic number of the new format for the client reply data.

      LRH_FLAG_BACKUP_DONE: 0x00000004 - indicate the target has finished to backup the "REPLY_DATA" with old format.

       

      During the target setup, it will initialize the reply data in @tgt_init()->tgt_reply_data_init().

      1. if found that the "REPLY_DATA" is old format (according to the magic number in the reply data header "LRH_MAGIC"),  the target starts to backup the "REPLY_DATA" file into the file "REPLY_DATA_BAK".
      2. After finished the backup, the target will change the magic number field of the reply data header with LRH_MAGIC_V1 | LRH_FLAG_BACKUP_DONE, and sync the magic flag change into the persistent storage.
      3. The target starts to convert the old format reply data from the backup file "REPLY_DATA_BAK" into the original reply data file "REPLY_DATA".
      4. After finished the conversion, the target changes the magic number @lrh_magic of the reply data header with LRH_MAGIC and @lrh_reply_size with new format, and sync the change to the disk. After that delete the backup file "REPLY_DATA_BAK".
      5. After that, the target starts the recovery. processing as normal with the new format reply data.

       

      Attachments

        Issue Links

          Activity

            [LU-16096] recovery: handle compatibility during upgrade for new replay data format
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50636/
            Subject: LU-16096 target: use lsd_reply_data_v1 format by default
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5321d510878f8893a49778025c6981e46a66cdff

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50636/ Subject: LU-16096 target: use lsd_reply_data_v1 format by default Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5321d510878f8893a49778025c6981e46a66cdff

            James,
            the lack of compatibility is causing the clean-downgrade and clean-downgrade-zfs tests to fail with "invalid header in reply_data":
            https://testing.whamcloud.com/test_sets/26ed0085-1e4b-4c6d-acc7-ca63a5775066

            LustreError: 74648:0:(tgt_lastrcvd.c:2070:tgt_reply_data_init()) lustre-MDT0000: invalid header in reply_data
            LustreError: 74648:0:(obd_config.c:774:class_setup()) setup lustre-MDT0000 failed (-22)
             LustreError: 74648:0:(obd_config.c:2029:class_config_llog_handler()) MGC10.240.26.9@tcp: cfg command failed: rc = -22
            Lustre:    cmd=cf003 0:lustre-MDT0000  1:lustre-MDT0000_UUID  2:0  3:lustre-MDT0000-mdtlov  4:f  
            
            LustreError: 15b-f: MGC10.240.26.9@tcp: Configuration from log lustre-MDT0000 failed from MGS -22. Check client and MGS are on compatible version.
            LustreError: 74442:0:(obd_mount_server.c:1425:server_start_targets()) failed to start server lustre-MDT0000: -22
            LustreError: 74442:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -22
            LustreError: 74442:0:(obd_config.c:827:class_cleanup()) Device 5 not setup
            LustreError: 74521:0:(ldlm_lockd.c:2500:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1686822829 with bad export cookie 6862479036220380163
            LustreError: 166-1: MGC10.240.26.9@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
            Lustre: server umount lustre-MDT0000 complete
            LustreError: 74442:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -22
            

            The patch patch: https://review.whamcloud.com/50636 "LU-16096 target: use lsd_reply_data_v1 format by default" still needs to land so that upgrade/downgrade continues to work.

            adilger Andreas Dilger added a comment - James, the lack of compatibility is causing the clean-downgrade and clean-downgrade-zfs tests to fail with " invalid header in reply_data ": https://testing.whamcloud.com/test_sets/26ed0085-1e4b-4c6d-acc7-ca63a5775066 LustreError: 74648:0:(tgt_lastrcvd.c:2070:tgt_reply_data_init()) lustre-MDT0000: invalid header in reply_data LustreError: 74648:0:(obd_config.c:774:class_setup()) setup lustre-MDT0000 failed (-22) LustreError: 74648:0:(obd_config.c:2029:class_config_llog_handler()) MGC10.240.26.9@tcp: cfg command failed: rc = -22 Lustre: cmd=cf003 0:lustre-MDT0000 1:lustre-MDT0000_UUID 2:0 3:lustre-MDT0000-mdtlov 4:f LustreError: 15b-f: MGC10.240.26.9@tcp: Configuration from log lustre-MDT0000 failed from MGS -22. Check client and MGS are on compatible version. LustreError: 74442:0:(obd_mount_server.c:1425:server_start_targets()) failed to start server lustre-MDT0000: -22 LustreError: 74442:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start targets: -22 LustreError: 74442:0:(obd_config.c:827:class_cleanup()) Device 5 not setup LustreError: 74521:0:(ldlm_lockd.c:2500:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1686822829 with bad export cookie 6862479036220380163 LustreError: 166-1: MGC10.240.26.9@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail Lustre: server umount lustre-MDT0000 complete LustreError: 74442:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -22 The patch patch: https://review.whamcloud.com/50636 " LU-16096 target: use lsd_reply_data_v1 format by default " still needs to land so that upgrade/downgrade continues to work.

            Is this work complete?

            simmonsja James A Simmons added a comment - Is this work complete?

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49939/
            Subject: LU-16096 tgt: improve messages for reply_data
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b2f05051c4239e845434ea9e183d889e74a5db57

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49939/ Subject: LU-16096 tgt: improve messages for reply_data Project: fs/lustre-release Branch: master Current Patch Set: Commit: b2f05051c4239e845434ea9e183d889e74a5db57

            "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50636
            Subject: LU-16096 target: use lsd_reply_data_v1 format by default
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4b4c1ea61bcc2029b4db9e4ca106d42eac2257a4

            gerrit Gerrit Updater added a comment - "Qian Yingjin <qian@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50636 Subject: LU-16096 target: use lsd_reply_data_v1 format by default Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4b4c1ea61bcc2029b4db9e4ca106d42eac2257a4

            Since master does not actually need the lrd_batch_idx field for statahead, it makes sense to me that the lsd_reply_data_v2 format only be enabled after the 2.16 release is made. That does mean that master/2.16 would be writing the lsd_reply_data_v1 format for now, but able to mount a filesystem that has lsd_reply_data_v2 records (based on the magic).

            adilger Andreas Dilger added a comment - Since master does not actually need the lrd_batch_idx field for statahead, it makes sense to me that the lsd_reply_data_v2 format only be enabled after the 2.16 release is made. That does mean that master/2.16 would be writing the lsd_reply_data_v1 format for now, but able to mount a filesystem that has lsd_reply_data_v2 records (based on the magic).
            qian_wc Qian Yingjin added a comment -

            Please note that in the master branch (b2_16) we write the reply_data with new format (in @tgt_reply_data_write) and the record size written by tgt_reply_data_write is enlarged in the current master branch...
            This means that we must upgrade reply_data also unless we patch the master to write reply_data records with old V1 format...

            Thus I think the better solution here may be that we do not change the current master, but add downgrade support when switch from the master to 2.15:

            This need to patch 2.15.x to recognize the replay_data V2 format, and then convert new format data into the old V1 format. (Please note we can not simply truncate the new format data, we must convert it first as their record size are different)

            qian_wc Qian Yingjin added a comment - Please note that in the master branch (b2_16) we write the reply_data with new format (in @tgt_reply_data_write) and the record size written by tgt_reply_data_write is enlarged in the current master branch... This means that we must upgrade reply_data also unless we patch the master to write reply_data records with old V1 format... Thus I think the better solution here may be that we do not change the current master, but add downgrade support when switch from the master to 2.15: This need to patch 2.15.x to recognize the replay_data V2 format, and then convert new format data into the old V1 format. (Please note we can not simply truncate the new format data, we must convert it first as their record size are different)

            The reason this is important is to allow downgrade from a newer version of Lustre to 2.15.

            I think if batched statahead does not require the use of the new replay data format, then it would make sense to allow the new format to be*read* but not actually do the upgrade until the version that requires it to be enabled (I guess when actual WBC is enabled).

            For master, I think that just means disabling the replay_data upgrade, and then re-enabling it after the statahead patches land, and b2_16 is branched. This will at least allow downgrade from 2.16 to 2.15 (without reply_data upgrade), and from 2.17+ to 2.16, but downgrading from 2.17+ to 2.15.2 would not be possible.

            Separately, I think it would be less complex to patch the older maintenance branches to understand the new format but not the code to do the upgrade. I don't think it would be hard to read the new format and ignore the added fields.

            adilger Andreas Dilger added a comment - The reason this is important is to allow downgrade from a newer version of Lustre to 2.15. I think if batched statahead does not require the use of the new replay data format, then it would make sense to allow the new format to be*read* but not actually do the upgrade until the version that requires it to be enabled (I guess when actual WBC is enabled). For master, I think that just means disabling the replay_data upgrade, and then re-enabling it after the statahead patches land, and b2_16 is branched. This will at least allow downgrade from 2.16 to 2.15 (without reply_data upgrade), and from 2.17+ to 2.16, but downgrading from 2.17+ to 2.15.2 would not be possible. Separately, I think it would be less complex to patch the older maintenance branches to understand the new format but not the code to do the upgrade. I don't think it would be hard to read the new format and ignore the added fields.
            qian_wc Qian Yingjin added a comment - - edited

            Hi Andreas,
            This only happened when downgraded a MDT server of a Lustre file system from the latest master to b_es6_0 or b2_15.
            Batched statahead does not need the larger reply_data format with lrd_batch_idx.

            we would still need some basic interop in 2.15.x to handle the larger record size

            This need to patch 2.15.x to recognize the replay_data V2 format, and then convert new format data into the old V1 format. (Please note we can not simply truncate the new format data, we must convert it first as their record size are different)
            If this is acceptable, I will make a patch. Then the automatic update of reply_data to the new format in 2.16 can be kept, I think.

            qian_wc Qian Yingjin added a comment - - edited Hi Andreas, This only happened when downgraded a MDT server of a Lustre file system from the latest master to b_es6_0 or b2_15. Batched statahead does not need the larger reply_data format with lrd_batch_idx. we would still need some basic interop in 2.15.x to handle the larger record size This need to patch 2.15.x to recognize the replay_data V2 format, and then convert new format data into the old V1 format. (Please note we can not simply truncate the new format data, we must convert it first as their record size are different) If this is acceptable, I will make a patch. Then the automatic update of reply_data to the new format in 2.16 can be kept, I think.

            People

              qian_wc Qian Yingjin
              qian_wc Qian Yingjin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: