Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7607

Preserve inode number after MDT migration

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.8.0

    Description

      During migration, the MDT FID of the migrated file is changed to reflect the new MDT the inode is stored on. However, it would be possible to keep the user-visible inode constant after migration by storing the original FID into the LMA as a new field. If present, this saved FID could be used to generate the inode number for userspace instead of the current FID so that it doesn't affect user tools such as backups.

      Attachments

        Issue Links

          Activity

            [LU-7607] Preserve inode number after MDT migration

            I was looking at whether we could use the FID stored in the LOV EA to preserve the "original" inode number of the file. It seems that the FID is stored in the LOV EA in each component and is also preserved over OST and MDT migration:

            tests# lfs setstripe -E 1M -L mdt -E 1G -c 3 -E eof /mnt/testfs/dir1/tt
            tests# dd if=/dev/zero of=/mnt/testfs/dir1/tt bs=1M count=2
            tests# lfs path2fid /mnt/testfs/dir1/tt
            [0x200001b72:0x10c07:0x0]
            tests# lfs getstripe -v /mnt/testfs/dir1/tt
            components:
              - lcme_id:             1
                lcme_extent.e_start: 0
                lcme_extent.e_end:   1048576
                sub_layout:
                  lmm_seq:           0x200001b72
                  lmm_object_id:     0x10c07
                  lmm_fid:           [0x200001b72:0x10c07:0x0]
                  lmm_stripe_count:  1
              - lcme_id:             2
                lcme_extent.e_start: 1048576
                lcme_extent.e_end:   1073741824
                sub_layout:
                  lmm_seq:           0x200001b72
                  lmm_object_id:     0x10c07
                  lmm_fid:           [0x200001b72:0x10c07:0x0]
              - lcme_id:             3
                lcme_extent.e_start: 1073741824
                lcme_extent.e_end:   EOF
                sub_layout:
                  lmm_seq:           0x200001b72
                  lmm_object_id:     0x10c07
                  lmm_fid:           [0x200001b72:0x10c07:0x0]
            tests# lfs migrate -c 3 /mnt/testfs/dir1/tt
            tests# lfs getstripe -v /mnt/testfs/dir1/tt
            lmm_seq:           0x200001b72
            lmm_object_id:     0x10c07
            lmm_fid:           [0x200001b72:0x10c07:0x0]
            lmm_stripe_count:  3
            tests# lfs migrate -m1 /mnt/testfs/dir1
            tests# lfs path2fid /mnt/testfs/dir1/tt
            [0x240001b70:0x2743:0x0]
            tests# lfs getstripe -v /mnt/testfs/dir1/tt
            lmm_seq:           0x200001b72
            lmm_object_id:     0x10c07
            lmm_fid:           [0x200001b72:0x10c07:0x0]
            lmm_stripe_count:  3
            

            There is a bug (LU-13426) if there is a DOM component in the layout that clobbers the FID, but DOM migration is relatively new and can be fixed to preserve the FID properly.

            adilger Andreas Dilger added a comment - I was looking at whether we could use the FID stored in the LOV EA to preserve the "original" inode number of the file. It seems that the FID is stored in the LOV EA in each component and is also preserved over OST and MDT migration: tests# lfs setstripe -E 1M -L mdt -E 1G -c 3 -E eof /mnt/testfs/dir1/tt tests# dd if=/dev/zero of=/mnt/testfs/dir1/tt bs=1M count=2 tests# lfs path2fid /mnt/testfs/dir1/tt [0x200001b72:0x10c07:0x0] tests# lfs getstripe -v /mnt/testfs/dir1/tt components: - lcme_id: 1 lcme_extent.e_start: 0 lcme_extent.e_end: 1048576 sub_layout: lmm_seq: 0x200001b72 lmm_object_id: 0x10c07 lmm_fid: [0x200001b72:0x10c07:0x0] lmm_stripe_count: 1 - lcme_id: 2 lcme_extent.e_start: 1048576 lcme_extent.e_end: 1073741824 sub_layout: lmm_seq: 0x200001b72 lmm_object_id: 0x10c07 lmm_fid: [0x200001b72:0x10c07:0x0] - lcme_id: 3 lcme_extent.e_start: 1073741824 lcme_extent.e_end: EOF sub_layout: lmm_seq: 0x200001b72 lmm_object_id: 0x10c07 lmm_fid: [0x200001b72:0x10c07:0x0] tests# lfs migrate -c 3 /mnt/testfs/dir1/tt tests# lfs getstripe -v /mnt/testfs/dir1/tt lmm_seq: 0x200001b72 lmm_object_id: 0x10c07 lmm_fid: [0x200001b72:0x10c07:0x0] lmm_stripe_count: 3 tests# lfs migrate -m1 /mnt/testfs/dir1 tests# lfs path2fid /mnt/testfs/dir1/tt [0x240001b70:0x2743:0x0] tests# lfs getstripe -v /mnt/testfs/dir1/tt lmm_seq: 0x200001b72 lmm_object_id: 0x10c07 lmm_fid: [0x200001b72:0x10c07:0x0] lmm_stripe_count: 3 There is a bug ( LU-13426 ) if there is a DOM component in the layout that clobbers the FID, but DOM migration is relatively new and can be fixed to preserve the FID properly.
            laisiyao Lai Siyao added a comment -

            Andreas, okay.

            laisiyao Lai Siyao added a comment - Andreas, okay.

            Lai, it definitely makes sense to have an option to migrate the parent directory and filenames without migrating the file inodes. In that case there is no need for this feature to preserve the inode numbers, since they won't change.

            This is only needed in the case where the inode is moved to a new MDT, which can be needed in case of removing an MDT, or if an MDT is very full, not in the normal space balancing case.

            I think storing the original FID in the inode is not too hard, and it will always be unique. Adding the old and new FID in the OI table is also useful. We can't return -EREMOTE to NFS clients, but it can be handled by the Lustre client so that it doesn't return -ESTALE to NFS.

            adilger Andreas Dilger added a comment - Lai, it definitely makes sense to have an option to migrate the parent directory and filenames without migrating the file inodes. In that case there is no need for this feature to preserve the inode numbers, since they won't change. This is only needed in the case where the inode is moved to a new MDT, which can be needed in case of removing an MDT, or if an MDT is very full, not in the normal space balancing case. I think storing the original FID in the inode is not too hard, and it will always be unique. Adding the old and new FID in the OI table is also useful. We can't return -EREMOTE to NFS clients, but it can be handled by the Lustre client so that it doesn't return -ESTALE to NFS.

            Ben, there is a already a MIGRT ChangeLog record for inode migration.

            adilger Andreas Dilger added a comment - Ben, there is a already a MIGRT ChangeLog record for inode migration.

            Could we emit the changes out the changelog?  If someone wants to keep track of changes of fid/inode through time, they could have a listener set up to catch/archive them.

            The various calls to find a FID, etc. could simply call up to the userspace service to get historical info.

            bevans Ben Evans (Inactive) added a comment - Could we emit the changes out the changelog?  If someone wants to keep track of changes of fid/inode through time, they could have a listener set up to catch/archive them. The various calls to find a FID, etc. could simply call up to the userspace service to get historical info.
            laisiyao Lai Siyao added a comment -

            IMO the overhead of reserving inode number is quite high, and rather than saving the original FID, I'd prefer to add an option to keep file inode untouched in migration, that is to say, for existing sub files, inode won't be migrated, but namespace updated.

            If we still prefer migrating inode, I'd suggest drop the support of preserving inode number, but add FID mapping to support NFS export: a special OI file will be added, which contains mappings from original FID to new FID, and in lu_object_find() it will lookup this
            OI file, if new FID is found, it's replied client with -EREMOTE, and client will resend the request with new FID to the correct MDT, for some request like rename, client may need to retry several times if more than involved file are migrated. There will be a garbage collect thread on server to remove aged mapping from this OI file.

            laisiyao Lai Siyao added a comment - IMO the overhead of reserving inode number is quite high, and rather than saving the original FID, I'd prefer to add an option to keep file inode untouched in migration, that is to say, for existing sub files, inode won't be migrated, but namespace updated. If we still prefer migrating inode, I'd suggest drop the support of preserving inode number, but add FID mapping to support NFS export: a special OI file will be added, which contains mappings from original FID to new FID, and in lu_object_find() it will lookup this OI file, if new FID is found, it's replied client with -EREMOTE, and client will resend the request with new FID to the correct MDT, for some request like rename, client may need to retry several times if more than involved file are migrated. There will be a garbage collect thread on server to remove aged mapping from this OI file.

            Lai, I recall not too long ago we discussed the ability to save the old FID after migration. Is there anything that needs to be updated in this ticket to describe your proposal?

            adilger Andreas Dilger added a comment - Lai, I recall not too long ago we discussed the ability to save the old FID after migration. Is there anything that needs to be updated in this ticket to describe your proposal?

            I see. Thanks. Hmm for stat() we only need fill the original FID into mdt_body of getattr request, and cache it in ll_inode_info, and fill it to stat->ino, But for readdir(), it will read from the directory entries directly, if we inject LMA original ino checking in this process, it might slow down the readdir a lot.

            di.wang Di Wang (Inactive) added a comment - I see. Thanks. Hmm for stat() we only need fill the original FID into mdt_body of getattr request, and cache it in ll_inode_info, and fill it to stat->ino, But for readdir(), it will read from the directory entries directly, if we inject LMA original ino checking in this process, it might slow down the readdir a lot.

            The point of keeping a consistent inode number is that some tools, such as backups, depend on the inode number to remain the same so they can do incremental backups. Otherwise, they can't tell the difference between migrate changing the inode number, or the file being deleted and a new file created with the same name. NFS servers in userspace would also use the inode number.

            NFS file handles generated in the kernel by Lustre are the same, but since we encode the FID into the Lustre file handle this wouldn't help - we'd need to allow the original FID to be looked up on the original MDT with a redirection to the new FID.

            adilger Andreas Dilger added a comment - The point of keeping a consistent inode number is that some tools, such as backups, depend on the inode number to remain the same so they can do incremental backups. Otherwise, they can't tell the difference between migrate changing the inode number, or the file being deleted and a new file created with the same name. NFS servers in userspace would also use the inode number. NFS file handles generated in the kernel by Lustre are the same, but since we encode the FID into the Lustre file handle this wouldn't help - we'd need to allow the original FID to be looked up on the original MDT with a redirection to the new FID.

            Actually I am a bit confused now. Since migration will keep namespace consistency, so either stat() or readdir() should return the real (correct) ino. I do not know why should we keep the original ino (or FID) after migration? I probably miss sth. Could you please explain the purpose of keeping consistent ino here? why these external tool needs consistency ino? Thanks.

            di.wang Di Wang (Inactive) added a comment - Actually I am a bit confused now. Since migration will keep namespace consistency, so either stat() or readdir() should return the real (correct) ino. I do not know why should we keep the original ino (or FID) after migration? I probably miss sth. Could you please explain the purpose of keeping consistent ino here? why these external tool needs consistency ino? Thanks.

            For the HSM FID problem I think a different solution is needed. Instead of storing the FID in the archive, it is better to store the archive identifier (UUID or whatever) in the Lustre inode as a part of the composite layout. That allows storing multiple versions of the file in the archive, as well as allowing partial HSM file restore with composite files.

            adilger Andreas Dilger added a comment - For the HSM FID problem I think a different solution is needed. Instead of storing the FID in the archive, it is better to store the archive identifier (UUID or whatever) in the Lustre inode as a part of the composite layout. That allows storing multiple versions of the file in the archive, as well as allowing partial HSM file restore with composite files.

            People

              laisiyao Lai Siyao
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: