Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5726

MDS buffer not freed when deleting files

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0, Lustre 2.5.4
    • Lustre 2.4.3
    • None
    • CentOS 6.5
      Kernel 2.6.32-358.23.2
    • 3
    • 16083

    Description

      When deleting large numbers of files, memory usage on the MDS server grows significantly. Attempts to reclaim memory by dropping caches only results in some of the memory being freed. The buffer usage continues to grow until eventually the MDS server starts OOMing.

      The rate at which the buffer usage grows seems to vary but looks like it might be based on the number of clients that are deleting files and the speed at which the files are deleted.

      Attachments

        1. lustre-debug-malloc.gz
          0.2 kB
        2. mds-crash-log-20140913
          47 kB
        3. meminfo.after
          1 kB
        4. meminfo.before
          1 kB
        5. slabinfo.after
          26 kB
        6. slabinfo.before
          26 kB

        Issue Links

          Activity

            [LU-5726] MDS buffer not freed when deleting files
            rmohr Rick Mohr added a comment -

            In response to Andreas' question:

            dumpe2fs 1.42.12.wc1 (15-Sep-2014)
            Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota
            Journal features: journal_incompat_revoke

            Our file system has 90 OSTs.

            rmohr Rick Mohr added a comment - In response to Andreas' question: dumpe2fs 1.42.12.wc1 (15-Sep-2014) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink extra_isize quota Journal features: journal_incompat_revoke Our file system has 90 OSTs.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13452/
            Subject: LU-5726 ldiskfs: missed brelse() in large EA patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ffd42ff529f5823b5a04529e1db2ea3b32a9f59f

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13452/ Subject: LU-5726 ldiskfs: missed brelse() in large EA patch Project: fs/lustre-release Branch: master Current Patch Set: Commit: ffd42ff529f5823b5a04529e1db2ea3b32a9f59f
            niu Niu Yawei (Inactive) added a comment - Port to b2_5: http://review.whamcloud.com/13464

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13464
            Subject: LU-5726 ldiskfs: missed brelse() in large EA patch
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: 516a0cf6020fa169b0890ba6a51dc8295c1a44cd

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13464 Subject: LU-5726 ldiskfs: missed brelse() in large EA patch Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 516a0cf6020fa169b0890ba6a51dc8295c1a44cd

            Andreas, ea_inode/large_xattr isn't enabled in my testing, but I also observed the "growing buffers" problem, I think this bug will be triggered as long as the inode has ea_in_inode.

            int
            ldiskfs_xattr_delete_inode(handle_t *handle, struct inode *inode,
                                    struct ldiskfs_xattr_ino_array **lea_ino_array)
            {
                    struct buffer_head *bh = NULL;
                    struct ldiskfs_xattr_ibody_header *header;
                    struct ldiskfs_inode *raw_inode;
                    struct ldiskfs_iloc iloc;
                    struct ldiskfs_xattr_entry *entry;
                    int error = 0;
            
                    if (!ldiskfs_test_inode_state(inode, LDISKFS_STATE_XATTR))
                            goto delete_external_ea;
            
                    error = ldiskfs_get_inode_loc(inode, &iloc);
            

            As long as the LDISKFS_STATE_XATTR is set on inode, it'll get the bh.

            niu Niu Yawei (Inactive) added a comment - Andreas, ea_inode/large_xattr isn't enabled in my testing, but I also observed the "growing buffers" problem, I think this bug will be triggered as long as the inode has ea_in_inode. int ldiskfs_xattr_delete_inode(handle_t *handle, struct inode *inode, struct ldiskfs_xattr_ino_array **lea_ino_array) { struct buffer_head *bh = NULL; struct ldiskfs_xattr_ibody_header *header; struct ldiskfs_inode *raw_inode; struct ldiskfs_iloc iloc; struct ldiskfs_xattr_entry *entry; int error = 0; if (!ldiskfs_test_inode_state(inode, LDISKFS_STATE_XATTR)) goto delete_external_ea; error = ldiskfs_get_inode_loc(inode, &iloc); As long as the LDISKFS_STATE_XATTR is set on inode, it'll get the bh.

            We are running 2.4.3 and 2.5.3 default MDT settings, so ea_inode is not enable (Here is output from one of our MDT):

            [root@puma-mds-10-5 ~]# dumpe2fs -h /dev/md0 | grep features
            dumpe2fs 1.42.7.wc1 (12-Apr-2013)
            Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota
            Journal features: journal_incompat_revoke

            In addition, all our filesystems hit this bug have less than 160 OSTs.

            Haisong

            haisong Haisong Cai (Inactive) added a comment - We are running 2.4.3 and 2.5.3 default MDT settings, so ea_inode is not enable (Here is output from one of our MDT): [root@puma-mds-10-5 ~] # dumpe2fs -h /dev/md0 | grep features dumpe2fs 1.42.7.wc1 (12-Apr-2013) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota Journal features: journal_incompat_revoke In addition, all our filesystems hit this bug have less than 160 OSTs. Haisong

            Niu, Lai, excellent work finding and fixing this bug.

            A question for the users hitting this problem - is the ea_inode (also named large_xattr) feature enabled on the MDT filesystem? Running dumpe2fs -h /dev/{mdtdev} | grep features on the MDT device would list ea_inode in the Filesystem features: output. This feature is needed if there are more than 160 OSTs in the filesystem, or if many and/or large xattrs are being stored (e.g. lots of ACLs, user xattrs, etc).

            While I hope that is the case and we can close this bug, if the ea_inode feature is not enabled on your MDT, then this patch is unlikely to solve your problem.

            adilger Andreas Dilger added a comment - Niu, Lai, excellent work finding and fixing this bug. A question for the users hitting this problem - is the ea_inode (also named large_xattr ) feature enabled on the MDT filesystem? Running dumpe2fs -h /dev/{mdtdev} | grep features on the MDT device would list ea_inode in the Filesystem features: output. This feature is needed if there are more than 160 OSTs in the filesystem, or if many and/or large xattrs are being stored (e.g. lots of ACLs, user xattrs, etc). While I hope that is the case and we can close this bug, if the ea_inode feature is not enabled on your MDT, then this patch is unlikely to solve your problem.
            niu Niu Yawei (Inactive) added a comment - patch to master: http://review.whamcloud.com/13452

            Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13452
            Subject: LU-5726 ldiskfs: missed brelse() in large EA patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1eb46ffbec85016db1054594094abde6d09a3616

            gerrit Gerrit Updater added a comment - Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13452 Subject: LU-5726 ldiskfs: missed brelse() in large EA patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1eb46ffbec85016db1054594094abde6d09a3616

            After quite a lot of testing & debugging with Lai, we found that a brelse() is missed in ldiskfs large EA patch, I'll post patch soon.

            niu Niu Yawei (Inactive) added a comment - After quite a lot of testing & debugging with Lai, we found that a brelse() is missed in ldiskfs large EA patch, I'll post patch soon.

            We've been watching this ticket at TACC as we've noticed similar behavior with the Lustre 2.5.2 version MDS for our /scratch filesystem where we have to perform occasional purges. We also have had it crash with what looks like an OOM condition, especially after we've run a purge removing millions of files. I mentioned it to Peter Jones during a call yesterday and he may have relayed some additional details. We took the opportunity during our maintenance on Tuesday to try a few things and have some additional information that might be helpful to track down this issue. From what we found, it appears that something in the kernel is not allowing the Inactive(file) portion of the memory to get released and used when needed, which is what the kernel should do. Before we did anything to the MDS during the maintenance, we looked at the memory and had 95GB in Buffers (according to /proc/meminfo, 128GB total memory in the MDS box) but also had 94GB of memory in the Inactive(file) portion of the memory. To see if it could release this buffer cache, we issued the vm.drop_caches=3 and while it released some cached file memory, it did not release the buffer memory like it usually does. We then unmounted the MDT and removed the Lustre module files and then the Buffers portion of the memory dropped to a very low value, but there was still 94GB of memory in the Inactive(file). We then tried to run some programs that would use the memory, however, none of them could ever get back any of the 94GB used by the Inactive(file) portion of memory. The only way we found to recover this portion of memory was to reboot the server. So even though the usage is shown in buffers, it seems that the Inactive(file) memory is the portion it that the kernel can't seem to recover after many files have been removed. Not sure if you have noticed the same behavior, but we thought this might help in tracking down this issue.

            We're running some tests on another testbed filesystem so if there is some additional information you would like to have, let us know. We definitely need to get this resolved as it is requiring us to reboot the MDS after every purge to prevent it from running out of memory.

            minyard Tommy Minyard added a comment - We've been watching this ticket at TACC as we've noticed similar behavior with the Lustre 2.5.2 version MDS for our /scratch filesystem where we have to perform occasional purges. We also have had it crash with what looks like an OOM condition, especially after we've run a purge removing millions of files. I mentioned it to Peter Jones during a call yesterday and he may have relayed some additional details. We took the opportunity during our maintenance on Tuesday to try a few things and have some additional information that might be helpful to track down this issue. From what we found, it appears that something in the kernel is not allowing the Inactive(file) portion of the memory to get released and used when needed, which is what the kernel should do. Before we did anything to the MDS during the maintenance, we looked at the memory and had 95GB in Buffers (according to /proc/meminfo, 128GB total memory in the MDS box) but also had 94GB of memory in the Inactive(file) portion of the memory. To see if it could release this buffer cache, we issued the vm.drop_caches=3 and while it released some cached file memory, it did not release the buffer memory like it usually does. We then unmounted the MDT and removed the Lustre module files and then the Buffers portion of the memory dropped to a very low value, but there was still 94GB of memory in the Inactive(file). We then tried to run some programs that would use the memory, however, none of them could ever get back any of the 94GB used by the Inactive(file) portion of memory. The only way we found to recover this portion of memory was to reboot the server. So even though the usage is shown in buffers, it seems that the Inactive(file) memory is the portion it that the kernel can't seem to recover after many files have been removed. Not sure if you have noticed the same behavior, but we thought this might help in tracking down this issue. We're running some tests on another testbed filesystem so if there is some additional information you would like to have, let us know. We definitely need to get this resolved as it is requiring us to reboot the MDS after every purge to prevent it from running out of memory.

            People

              niu Niu Yawei (Inactive)
              rmohr Rick Mohr
              Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: