[LU-3338] IOC_MDC_GETFILESTRIPE can abuse vmalloc() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.6.0
Affects Version/s: Lustre 2.4.0
Labels:
- patch
Environment:
x86_64 client

Epic/Theme:
- Performance
Story Points:
1
Severity:
3
Epic:
- client
Rank (Obsolete):
8255

Description

We've observed that when using Robinhood to scan the Sequoia filesystem the Lustre 2.3.65 (and earlier) clients will thrash in vmalloc(). The issue is caused by Robinhood repeatedly calling IOC_MDC_GETFILESTRIPE to get the striping information for the files which it is scanning.

Normally this wouldn't be an issue but because Sequoia's filesystem has a large number of OSTs (768). And because we always allocate space in the reply buffer for the maximum numbers of OSTs. All of the reply buffers end up getting vmalloc()'ed instead of kmalloc()'ed by OBD_ALLOC_LARGE().

It's worth noting that we see this behavior even though we're running with the fix for the recent kernel vmalloc() regression. Going forward we should keep this in mind and never ever use vmalloc() if we can at all avoid it. It's not just slow, it's downright dangerous to use in any concurrent context.

I'll push a fix for this issue shortly for review.

Attachments

Issue Links

is related to

LU-19076 resend can hit the original request on server

Open

LU-4811 Interop 2.5<->2.6 failure on test suite sanity test_24u: error on ioctl 0x4008669a, no data availiable

Resolved

LU-5549 cl_default_mds_easize not refreshed

Resolved

Activity

[LU-3338] IOC_MDC_GETFILESTRIPE can abuse vmalloc()

Bob Glossman (Inactive) added a comment - 12/May/14 10:43 PM

backport to b2_4:
http://review.whamcloud.com/10303

Bob Glossman (Inactive) added a comment - 12/May/14 10:43 PM backport to b2_4: http://review.whamcloud.com/10303

Peter Jones added a comment - 19/Mar/14 1:29 PM

Landed for 2.6

Peter Jones added a comment - 19/Mar/14 1:29 PM Landed for 2.6

Peter Jones added a comment - 14/Jan/14 2:06 AM

Lai

Could you please look at Andreas's review feedback on this patch and see about revising it so that we can land it?

Thanks

Peter

Peter Jones added a comment - 14/Jan/14 2:06 AM Lai Could you please look at Andreas's review feedback on this patch and see about revising it so that we can land it? Thanks Peter

Peter Jones added a comment - 16/Dec/13 5:18 PM

Mike

Could you please look into this one?

Thanks

Peter

Peter Jones added a comment - 16/Dec/13 5:18 PM Mike Could you please look into this one? Thanks Peter

Andreas Dilger added a comment - 30/Sep/13 11:36 PM

Move this off to 2.5.1 because the patch is consistently failing, and this is not a new problem but rather optimizing some code.

Andreas Dilger added a comment - 30/Sep/13 11:36 PM Move this off to 2.5.1 because the patch is consistently failing, and this is not a new problem but rather optimizing some code.

Brian Behlendorf added a comment - 21/May/13 4:51 PM

Thanks for the clarification Andreas.

Testing of the proposed patch on Friday showed there are still some cases triggered by the Robinhood scan which are resulting in large allocations. I'm going to run those remaining cases down this week and I'll refresh that patch with those additions. I'm not sure it's worth the extra complexity to send a minimum sized buffer and add another flag to track it. I think we get most of what we need just by using the default as long as we cap the default value at less than a page. But I'll play around with it a bit and propose another patch.

Brian Behlendorf added a comment - 21/May/13 4:51 PM Thanks for the clarification Andreas. Testing of the proposed patch on Friday showed there are still some cases triggered by the Robinhood scan which are resulting in large allocations. I'm going to run those remaining cases down this week and I'll refresh that patch with those additions. I'm not sure it's worth the extra complexity to send a minimum sized buffer and add another flag to track it. I think we get most of what we need just by using the default as long as we cap the default value at less than a page. But I'll play around with it a bit and propose another patch.

Andreas Dilger added a comment - 17/May/13 5:38 PM

To clarify, it is fine to start with the default LOV EA reply buffer for 2.4 clients, so long as the RPC layer will handle the larger reply if it is sent. This is needed for client compatibility with pre-2.4 MDSes.

It would be possible to always allocate a minimum-sized reply buffer (e.g. 56 bytes == 1 stripe) for these RPCs on the assumption that the MDS will not send anything. If the MDS ever does send something in this buffer it can save a flag that means future RPCs will get the default-sized reply buffer.

Andreas Dilger added a comment - 17/May/13 5:38 PM To clarify, it is fine to start with the default LOV EA reply buffer for 2.4 clients, so long as the RPC layer will handle the larger reply if it is sent. This is needed for client compatibility with pre-2.4 MDSes. It would be possible to always allocate a minimum-sized reply buffer (e.g. 56 bytes == 1 stripe) for these RPCs on the assumption that the MDS will not send anything. If the MDS ever does send something in this buffer it can save a flag that means future RPCs will get the default-sized reply buffer.

Andreas Dilger added a comment - 17/May/13 5:30 PM

With MDSes before 2.4 the unlink, rename, and close replies can result in the MDS inode dropping the last link on a file and the OST objects being destroyed. In pre-2.4 days, the clients would do the OST object destroys, so they would be sent the LOV EA in order to do this.

With 2.4, the MDS will do the OST object destroys, but only after the name unlink has committed locally. Without that, there is a chance that the just-unlinked name will survive an MDS crash, but the OST objects will have been destroyed. While that isn't technically data loss (the user had already deleted the file), it can be confusing if they later find the partially-deleted file and it returns an error when accessed.

Andreas Dilger added a comment - 17/May/13 5:30 PM With MDSes before 2.4 the unlink, rename, and close replies can result in the MDS inode dropping the last link on a file and the OST objects being destroyed. In pre-2.4 days, the clients would do the OST object destroys, so they would be sent the LOV EA in order to do this. With 2.4, the MDS will do the OST object destroys, but only after the name unlink has committed locally. Without that, there is a chance that the just-unlinked name will survive an MDS crash, but the OST objects will have been destroyed. While that isn't technically data loss (the user had already deleted the file), it can be confusing if they later find the partially-deleted file and it returns an error when accessed.

Brian Behlendorf added a comment - 16/May/13 12:02 AM

This whole issue goes deeper than I first thought. It seems many of the mdc requests (unlink, getattr, rename, close, ...) proactively allocate a reply buffer of cl_max_mds_easize which may or may not end up being used. As described above when the size of this reply buffer forces it to be vmalloc() you basically end up with a non-functional very slow client which spins.

Can someone explain to me why we need to add this reply buffer for all of these request types? For the moment I've updated the existing proposed patch to use the cl_default_mds_easize for all these cases to avoid the issue.

Brian Behlendorf added a comment - 16/May/13 12:02 AM This whole issue goes deeper than I first thought. It seems many of the mdc requests (unlink, getattr, rename, close, ...) proactively allocate a reply buffer of cl_max_mds_easize which may or may not end up being used. As described above when the size of this reply buffer forces it to be vmalloc() you basically end up with a non-functional very slow client which spins. Can someone explain to me why we need to add this reply buffer for all of these request types? For the moment I've updated the existing proposed patch to use the cl_default_mds_easize for all these cases to avoid the issue.

Brian Behlendorf added a comment - 14/May/13 11:45 PM

http://review.whamcloud.com/6339

Brian Behlendorf added a comment - 14/May/13 11:45 PM http://review.whamcloud.com/6339

People

Assignee:: Lai Siyao

Reporter:: Brian Behlendorf

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 14/May/13 10:56 PM

Updated:: 03/Jun/25 9:25 PM

Resolved:: 19/Mar/14 1:29 PM