[LU-5726] MDS buffer not freed when deleting files - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4
Affects Version/s: Lustre 2.4.3
Labels:
None
Environment:
CentOS 6.5
Kernel 2.6.32-358.23.2

Severity:
3
Rank (Obsolete):
16083

Description

When deleting large numbers of files, memory usage on the MDS server grows significantly. Attempts to reclaim memory by dropping caches only results in some of the memory being freed. The buffer usage continues to grow until eventually the MDS server starts OOMing.

The rate at which the buffer usage grows seems to vary but looks like it might be based on the number of clients that are deleting files and the speed at which the files are deleted.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-debug-malloc.gz
0.2 kB
11/Oct/14 7:13 AM
mds-crash-log-20140913
47 kB
29/Oct/14 6:50 PM
meminfo.after
1 kB
11/Oct/14 7:10 AM
meminfo.before
1 kB
11/Oct/14 7:10 AM
slabinfo.after
26 kB
11/Oct/14 7:10 AM
slabinfo.before
26 kB
11/Oct/14 7:10 AM

Issue Links

is related to

LU-5333 rm cause MDS to complain hung tasks and disconnecting clients

Resolved

LU-5841 Lustre 2.4.2 MDS, hitting OOM errors

Resolved

Activity

[LU-5726] MDS buffer not freed when deleting files

Niu Yawei (Inactive) added a comment - 20/Jan/15 9:55 AM

Port to b2_5: http://review.whamcloud.com/13464

Niu Yawei (Inactive) added a comment - 20/Jan/15 9:55 AM Port to b2_5: http://review.whamcloud.com/13464

Gerrit Updater added a comment - 20/Jan/15 9:54 AM

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13464
Subject: ~~LU-5726~~ ldiskfs: missed brelse() in large EA patch
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 516a0cf6020fa169b0890ba6a51dc8295c1a44cd

Gerrit Updater added a comment - 20/Jan/15 9:54 AM Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13464 Subject: LU-5726 ldiskfs: missed brelse() in large EA patch Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: 516a0cf6020fa169b0890ba6a51dc8295c1a44cd

Niu Yawei (Inactive) added a comment - 20/Jan/15 3:17 AM

Andreas, ea_inode/large_xattr isn't enabled in my testing, but I also observed the "growing buffers" problem, I think this bug will be triggered as long as the inode has ea_in_inode.

int
ldiskfs_xattr_delete_inode(handle_t *handle, struct inode *inode,
                        struct ldiskfs_xattr_ino_array **lea_ino_array)
{
        struct buffer_head *bh = NULL;
        struct ldiskfs_xattr_ibody_header *header;
        struct ldiskfs_inode *raw_inode;
        struct ldiskfs_iloc iloc;
        struct ldiskfs_xattr_entry *entry;
        int error = 0;

        if (!ldiskfs_test_inode_state(inode, LDISKFS_STATE_XATTR))
                goto delete_external_ea;

        error = ldiskfs_get_inode_loc(inode, &iloc);

As long as the LDISKFS_STATE_XATTR is set on inode, it'll get the bh.

Niu Yawei (Inactive) added a comment - 20/Jan/15 3:17 AM Andreas, ea_inode/large_xattr isn't enabled in my testing, but I also observed the "growing buffers" problem, I think this bug will be triggered as long as the inode has ea_in_inode. int ldiskfs_xattr_delete_inode(handle_t *handle, struct inode *inode, struct ldiskfs_xattr_ino_array **lea_ino_array) { struct buffer_head *bh = NULL; struct ldiskfs_xattr_ibody_header *header; struct ldiskfs_inode *raw_inode; struct ldiskfs_iloc iloc; struct ldiskfs_xattr_entry *entry; int error = 0; if (!ldiskfs_test_inode_state(inode, LDISKFS_STATE_XATTR)) goto delete_external_ea; error = ldiskfs_get_inode_loc(inode, &iloc); As long as the LDISKFS_STATE_XATTR is set on inode, it'll get the bh.

Haisong Cai (Inactive) added a comment - 20/Jan/15 2:29 AM

We are running 2.4.3 and 2.5.3 default MDT settings, so ea_inode is not enable (Here is output from one of our MDT):

[root@puma-mds-10-5 ~]# dumpe2fs -h /dev/md0 | grep features
dumpe2fs 1.42.7.wc1 (12-Apr-2013)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota
Journal features: journal_incompat_revoke

In addition, all our filesystems hit this bug have less than 160 OSTs.

Haisong

Haisong Cai (Inactive) added a comment - 20/Jan/15 2:29 AM We are running 2.4.3 and 2.5.3 default MDT settings, so ea_inode is not enable (Here is output from one of our MDT): [root@puma-mds-10-5 ~] # dumpe2fs -h /dev/md0 | grep features dumpe2fs 1.42.7.wc1 (12-Apr-2013) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery mmp flex_bg sparse_super large_file huge_file uninit_bg dir_nlink quota Journal features: journal_incompat_revoke In addition, all our filesystems hit this bug have less than 160 OSTs. Haisong

Andreas Dilger added a comment - 20/Jan/15 12:36 AM

Niu, Lai, excellent work finding and fixing this bug.

A question for the users hitting this problem - is the ea_inode (also named large_xattr) feature enabled on the MDT filesystem? Running dumpe2fs -h /dev/{mdtdev} | grep features on the MDT device would list ea_inode in the Filesystem features: output. This feature is needed if there are more than 160 OSTs in the filesystem, or if many and/or large xattrs are being stored (e.g. lots of ACLs, user xattrs, etc).

While I hope that is the case and we can close this bug, if the ea_inode feature is not enabled on your MDT, then this patch is unlikely to solve your problem.

Andreas Dilger added a comment - 20/Jan/15 12:36 AM Niu, Lai, excellent work finding and fixing this bug. A question for the users hitting this problem - is the ea_inode (also named large_xattr ) feature enabled on the MDT filesystem? Running dumpe2fs -h /dev/{mdtdev} | grep features on the MDT device would list ea_inode in the Filesystem features: output. This feature is needed if there are more than 160 OSTs in the filesystem, or if many and/or large xattrs are being stored (e.g. lots of ACLs, user xattrs, etc). While I hope that is the case and we can close this bug, if the ea_inode feature is not enabled on your MDT, then this patch is unlikely to solve your problem.

Niu Yawei (Inactive) added a comment - 19/Jan/15 4:06 PM

patch to master: http://review.whamcloud.com/13452

Niu Yawei (Inactive) added a comment - 19/Jan/15 4:06 PM patch to master: http://review.whamcloud.com/13452

Gerrit Updater added a comment - 19/Jan/15 4:03 PM

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13452
Subject: ~~LU-5726~~ ldiskfs: missed brelse() in large EA patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1eb46ffbec85016db1054594094abde6d09a3616

Gerrit Updater added a comment - 19/Jan/15 4:03 PM Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13452 Subject: LU-5726 ldiskfs: missed brelse() in large EA patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1eb46ffbec85016db1054594094abde6d09a3616

Niu Yawei (Inactive) added a comment - 19/Jan/15 3:59 PM

After quite a lot of testing & debugging with Lai, we found that a brelse() is missed in ldiskfs large EA patch, I'll post patch soon.

Niu Yawei (Inactive) added a comment - 19/Jan/15 3:59 PM After quite a lot of testing & debugging with Lai, we found that a brelse() is missed in ldiskfs large EA patch, I'll post patch soon.

Tommy Minyard added a comment - 15/Jan/15 10:00 PM

We've been watching this ticket at TACC as we've noticed similar behavior with the Lustre 2.5.2 version MDS for our /scratch filesystem where we have to perform occasional purges. We also have had it crash with what looks like an OOM condition, especially after we've run a purge removing millions of files. I mentioned it to Peter Jones during a call yesterday and he may have relayed some additional details. We took the opportunity during our maintenance on Tuesday to try a few things and have some additional information that might be helpful to track down this issue. From what we found, it appears that something in the kernel is not allowing the Inactive(file) portion of the memory to get released and used when needed, which is what the kernel should do. Before we did anything to the MDS during the maintenance, we looked at the memory and had 95GB in Buffers (according to /proc/meminfo, 128GB total memory in the MDS box) but also had 94GB of memory in the Inactive(file) portion of the memory. To see if it could release this buffer cache, we issued the vm.drop_caches=3 and while it released some cached file memory, it did not release the buffer memory like it usually does. We then unmounted the MDT and removed the Lustre module files and then the Buffers portion of the memory dropped to a very low value, but there was still 94GB of memory in the Inactive(file). We then tried to run some programs that would use the memory, however, none of them could ever get back any of the 94GB used by the Inactive(file) portion of memory. The only way we found to recover this portion of memory was to reboot the server. So even though the usage is shown in buffers, it seems that the Inactive(file) memory is the portion it that the kernel can't seem to recover after many files have been removed. Not sure if you have noticed the same behavior, but we thought this might help in tracking down this issue.

We're running some tests on another testbed filesystem so if there is some additional information you would like to have, let us know. We definitely need to get this resolved as it is requiring us to reboot the MDS after every purge to prevent it from running out of memory.

Tommy Minyard added a comment - 15/Jan/15 10:00 PM We've been watching this ticket at TACC as we've noticed similar behavior with the Lustre 2.5.2 version MDS for our /scratch filesystem where we have to perform occasional purges. We also have had it crash with what looks like an OOM condition, especially after we've run a purge removing millions of files. I mentioned it to Peter Jones during a call yesterday and he may have relayed some additional details. We took the opportunity during our maintenance on Tuesday to try a few things and have some additional information that might be helpful to track down this issue. From what we found, it appears that something in the kernel is not allowing the Inactive(file) portion of the memory to get released and used when needed, which is what the kernel should do. Before we did anything to the MDS during the maintenance, we looked at the memory and had 95GB in Buffers (according to /proc/meminfo, 128GB total memory in the MDS box) but also had 94GB of memory in the Inactive(file) portion of the memory. To see if it could release this buffer cache, we issued the vm.drop_caches=3 and while it released some cached file memory, it did not release the buffer memory like it usually does. We then unmounted the MDT and removed the Lustre module files and then the Buffers portion of the memory dropped to a very low value, but there was still 94GB of memory in the Inactive(file). We then tried to run some programs that would use the memory, however, none of them could ever get back any of the 94GB used by the Inactive(file) portion of memory. The only way we found to recover this portion of memory was to reboot the server. So even though the usage is shown in buffers, it seems that the Inactive(file) memory is the portion it that the kernel can't seem to recover after many files have been removed. Not sure if you have noticed the same behavior, but we thought this might help in tracking down this issue. We're running some tests on another testbed filesystem so if there is some additional information you would like to have, let us know. We definitely need to get this resolved as it is requiring us to reboot the MDS after every purge to prevent it from running out of memory.

Minh Diep added a comment - 14/Jan/15 4:16 PM

Niu, do you have any update on this?

Minh Diep added a comment - 14/Jan/15 4:16 PM Niu, do you have any update on this?

Niu Yawei (Inactive) added a comment - 09/Jan/15 12:59 PM

Thank you for the information, Haisong & Rick.

Niu Yawei (Inactive) added a comment - 09/Jan/15 12:59 PM Thank you for the information, Haisong & Rick.

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Rick Mohr

Votes:: 0 Vote for this issue

Watchers:: 22 Start watching this issue

Dates

Created:: 11/Oct/14 6:39 AM

Updated:: 03/Mar/15 4:18 PM

Resolved:: 05/Feb/15 6:54 PM