Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.3.0, Lustre 2.4.0
    • Lustre 2.3.0, Lustre 2.1.3, Lustre 2.1.6
    • b2_1 g636ddbf
    • 3
    • 4236

    Description

      I have a smallish filesystem to which I only allocated a 5GB MDT since the overall dataset was always intended to be very small. This filesystem is simply being used to add and remove files in a loop with something along the lines of:

      while true; do
          cp -a /lib /mnt/lustre/foo
          rm -rf /mnt/lustre/foo
      done
      

      It seems in doing this I have filled up my MDT with an "oi.16" file that is now 94% of the space of the MDT:

      # stat /mnt/lustre/mdt/oi.16 
        File: `/mnt/lustre/mdt/oi.16'
        Size: 4733702144	Blocks: 9254568    IO Block: 4096   regular file
      Device: fd05h/64773d	Inode: 13          Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2012-05-27 11:55:00.175323551 +0000
      Modify: 2012-05-27 11:55:00.175323551 +0000
      Change: 2012-05-27 11:55:00.175323551 +0000
      
      # df -k /mnt/lustre/mdt/
      Filesystem           1K-blocks      Used Available Use% Mounted on
      /dev/mapper/LustreVG-mdt0
                             5240128   5240128         0 100% /mnt/lustre/mdt
      
      # ls -ls /mnt/lustre/mdt/oi.16 
      4627284 -rw-r--r-- 1 root root 4733702144 May 27 11:55 /mnt/lustre/mdt/oi.16
      

      It seems the OI is leaking and not being reaped when files are removed.

      Attachments

        Activity

          [LU-1512] OI leaks

          I notice this is fixed for 2.3 and 2.4. Will anything be done for 2.1.x?

          brian Brian Murrell (Inactive) added a comment - I notice this is fixed for 2.3 and 2.4. Will anything be done for 2.1.x?
          pjones Peter Jones added a comment -

          Landed for 2.3 and 2.4

          pjones Peter Jones added a comment - Landed for 2.3 and 2.4

          I think the important thing is that it improves the update performance, and only hurts lookup performance for lookup by FID for objects that are not in cache. This should be only a very small fraction of operations.

          adilger Andreas Dilger added a comment - I think the important thing is that it improves the update performance, and only hurts lookup performance for lookup by FID for objects that are not in cache. This should be only a very small fraction of operations.

          OK, that means the lookup-by-fid will check new OI file firstly, if missed, then check the old OI file. It improves the update performance by lost some lookup performance.

          Oleg, what's your suggestion? If you do not oppose, I will start the back porting.

          yong.fan nasf (Inactive) added a comment - OK, that means the lookup-by-fid will check new OI file firstly, if missed, then check the old OI file. It improves the update performance by lost some lookup performance. Oleg, what's your suggestion? If you do not oppose, I will start the back porting.

          For backup OI, I think it makes more sense to do the opposite - update only the new OI, and leave the old OI as only the backup. For created, only add newly created FIDs into the new OI. For normal lookup by name, the existing OI rebuild will add the FID into the new OI already. In the case of a by-FID lookup that is missed in the new OI do we need to do a lookup in the backup OI. For unlinked files, only delete the FID from the new OI.

          If there is an old (invalid) lookup by FID for a deleted file that is missed in the new OI, but found in the old OI, there will still be a chance to return an error if the inode is not found.

          I think this will reduce the amount of updates to disk, with only changes being made to the new OI file, and the old OI file will not be modified.

          As for when to do this, I think the OI rebuild should be ported to b2_1 first (subject to pre-approval from Oleg), and the "backup OI" handling can be done in Phase IV, since this is largely only a performance/usability improvement after the base OI scrub is available.

          adilger Andreas Dilger added a comment - For backup OI, I think it makes more sense to do the opposite - update only the new OI, and leave the old OI as only the backup. For created, only add newly created FIDs into the new OI. For normal lookup by name, the existing OI rebuild will add the FID into the new OI already. In the case of a by-FID lookup that is missed in the new OI do we need to do a lookup in the backup OI. For unlinked files, only delete the FID from the new OI. If there is an old (invalid) lookup by FID for a deleted file that is missed in the new OI, but found in the old OI, there will still be a chance to return an error if the inode is not found. I think this will reduce the amount of updates to disk, with only changes being made to the new OI file, and the old OI file will not be modified. As for when to do this, I think the OI rebuild should be ported to b2_1 first (subject to pre-approval from Oleg), and the "backup OI" handling can be done in Phase IV, since this is largely only a performance/usability improvement after the base OI scrub is available.

          Then current idea for "backup mode" OI scrub will be like as following:

          For create: it will insert the OI mapping into the old OI file firstly, if the target ino is in front of OI scrub current postion, then OI scrub can add the mapping to new OI file also, otherwise the OI mapping should be inserted into the new OI file by the creator.

          For unlink: it will delete the OI mapping from the new OI file firstly (if it is there).

          For lookup: it will check old OI file only, if there is no relate OI mapping, then return -ENOENT; if found related OI mapping, but fail to load related inode, then return -EIO; if found related OI mapping, but the loaded inode is not the expected one, then return -ENOENT.

          When should we do that? Now or LFSCK phase IV?

          yong.fan nasf (Inactive) added a comment - Then current idea for "backup mode" OI scrub will be like as following: For create: it will insert the OI mapping into the old OI file firstly, if the target ino is in front of OI scrub current postion, then OI scrub can add the mapping to new OI file also, otherwise the OI mapping should be inserted into the new OI file by the creator. For unlink: it will delete the OI mapping from the new OI file firstly (if it is there). For lookup: it will check old OI file only, if there is no relate OI mapping, then return -ENOENT; if found related OI mapping, but fail to load related inode, then return -EIO; if found related OI mapping, but the loaded inode is not the expected one, then return -ENOENT. When should we do that? Now or LFSCK phase IV?

          I see two options in that case. It would be possible to also delete FID entries from backup OI, but this would hurt performance during OI scrub. It would instead be possible to detect this (hopefuly rare) error during lookup, where a FID entry exists in the backup OI, but the inode is deleted or does not have a matching LMA FID, and return ENOENT or ESTALE as it would if no such entry existed in the first place.

          Since the FID entry would have been lost anyway during OI rebuild, this by-FID lookup is just a rare race condition that only happens during scrub.

          adilger Andreas Dilger added a comment - I see two options in that case. It would be possible to also delete FID entries from backup OI, but this would hurt performance during OI scrub. It would instead be possible to detect this (hopefuly rare) error during lookup, where a FID entry exists in the backup OI, but the inode is deleted or does not have a matching LMA FID, and return ENOENT or ESTALE as it would if no such entry existed in the first place. Since the FID entry would have been lost anyway during OI rebuild, this by-FID lookup is just a rare race condition that only happens during scrub.

          > For unlinked files, there is only a need to delete the FID from the new OI file (if it is there yet). The old FID should no longer be referenced by any files, so there is no harm to leave it in the old OI file I think?

          It is not so simple. If we only delete the OI mapping in the new OI file, and leave it in the old OI mapping. Then what will happen if someone does lookup-by-FID after the unlink operation? He/she will find the stale OI mapping in the old OI file, but related object does not exist, under such case, it is not easy to distinguish whether it is normal case, or abnormal case of object lost because of disk issues or system errors.

          yong.fan nasf (Inactive) added a comment - > For unlinked files, there is only a need to delete the FID from the new OI file (if it is there yet). The old FID should no longer be referenced by any files, so there is no harm to leave it in the old OI file I think? It is not so simple. If we only delete the OI mapping in the new OI file, and leave it in the old OI mapping. Then what will happen if someone does lookup-by-FID after the unlink operation? He/she will find the stale OI mapping in the old OI file, but related object does not exist, under such case, it is not easy to distinguish whether it is normal case, or abnormal case of object lost because of disk issues or system errors.

          The issue isn't about recovering from a crash, but rather if this is a "garbage collection" action that needs to be done on a regular basis, but the only way to do it is by deleting the OI file(s) and running an urgent scan, this will have serious performance impact, and block threads that are doing by-FID lookups.

          My goal is to allow this "maintenance" action to be done without significant performance impact or delay. I agree that this would be more complex, but I don't know how much more. If we always create a new OI file when running LFSCK, it will also solve the problem of stale FID entries in the OI file. But to do this, it is better to do it at the "background scrub" speed, and allow the cases of lookup-by-FID not being found in the new OI file to be handled from the backup OI file.

          For unlinked files, there is only a need to delete the FID from the new OI file (if it is there yet). The old FID should no longer be referenced by any files, so there is no harm to leave it in the old OI file I think?

          adilger Andreas Dilger added a comment - The issue isn't about recovering from a crash, but rather if this is a "garbage collection" action that needs to be done on a regular basis, but the only way to do it is by deleting the OI file(s) and running an urgent scan, this will have serious performance impact, and block threads that are doing by-FID lookups. My goal is to allow this "maintenance" action to be done without significant performance impact or delay. I agree that this would be more complex, but I don't know how much more. If we always create a new OI file when running LFSCK, it will also solve the problem of stale FID entries in the OI file. But to do this, it is better to do it at the "background scrub" speed, and allow the cases of lookup-by-FID not being found in the new OI file to be handled from the backup OI file. For unlinked files, there is only a need to delete the FID from the new OI file (if it is there yet). The old FID should no longer be referenced by any files, so there is no harm to leave it in the old OI file I think?

          The "backup OI" mode for OI scrub to rebuild OI file will introduce more complexity, because there may be concurrent create/unlink during the OI scrub, it need to process both the old OI file and new OI file, and should kept them in consistent on somehow, which will cause normal logic changed for lookup/create/unlink. Such changes may introduce some race bugs.

          In fact, we do not care the system crash during OI scrub, because we have the support to resume OI scrub from the breakpoint. We can guarantee the OI file rebuild correctly eventually, even if the system crash many times.

          Oleg, what is your suggestion for back-porting OI scrub to Lustre-2.1.x?

          yong.fan nasf (Inactive) added a comment - The "backup OI" mode for OI scrub to rebuild OI file will introduce more complexity, because there may be concurrent create/unlink during the OI scrub, it need to process both the old OI file and new OI file, and should kept them in consistent on somehow, which will cause normal logic changed for lookup/create/unlink. Such changes may introduce some race bugs. In fact, we do not care the system crash during OI scrub, because we have the support to resume OI scrub from the breakpoint. We can guarantee the OI file rebuild correctly eventually, even if the system crash many times. Oleg, what is your suggestion for back-porting OI scrub to Lustre-2.1.x?

          My preferred path would be for OI scrub to be backported to 2.1. This would allow fixing this issue (though not in an ideal manner, currently), and also improve maintenance/support for 2.1 itself (allowing recovery from all sorts of OI corruption, backup/ restore, eyc.

          First, however, please ask Oleg if he would also be in flavour of landing this code onto b2_1 as well. It is rather large for a maintenance release, though it could be argued for the above reasons that this is really necessary for making 2.1 more supportable in the future.

          The one twain I say that this doesn't really resolve the OI size problem very well is because it requires manually deleting the OI file(s), then running OI scrub in urgent mode, which will block threads if they cannot find the FID they are looking for, and cause high load on the MDS.

          It would be better to have some kind of "backup OI" mode where the new OI file is created, while the old one is used to find any missing FID. if the old OI file were kept around, this would also help during OI scrub in case the primary were lost or corupted. Only in case of backup/restore, where the old file is useless would it make sense to delete it right away.

          adilger Andreas Dilger added a comment - My preferred path would be for OI scrub to be backported to 2.1. This would allow fixing this issue (though not in an ideal manner, currently), and also improve maintenance/support for 2.1 itself (allowing recovery from all sorts of OI corruption, backup/ restore, eyc. First, however, please ask Oleg if he would also be in flavour of landing this code onto b2_1 as well. It is rather large for a maintenance release, though it could be argued for the above reasons that this is really necessary for making 2.1 more supportable in the future. The one twain I say that this doesn't really resolve the OI size problem very well is because it requires manually deleting the OI file(s), then running OI scrub in urgent mode, which will block threads if they cannot find the FID they are looking for, and cause high load on the MDS. It would be better to have some kind of "backup OI" mode where the new OI file is created, while the old one is used to find any missing FID. if the old OI file were kept around, this would also help during OI scrub in case the primary were lost or corupted. Only in case of backup/restore, where the old file is useless would it make sense to delete it right away.

          People

            yong.fan nasf (Inactive)
            brian Brian Murrell (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: