Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3338

IOC_MDC_GETFILESTRIPE can abuse vmalloc()

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • Lustre 2.4.0
    • x86_64 client
    • 1
    • 3
    • 8255

    Description

      We've observed that when using Robinhood to scan the Sequoia filesystem the Lustre 2.3.65 (and earlier) clients will thrash in vmalloc(). The issue is caused by Robinhood repeatedly calling IOC_MDC_GETFILESTRIPE to get the striping information for the files which it is scanning.

      Normally this wouldn't be an issue but because Sequoia's filesystem has a large number of OSTs (768). And because we always allocate space in the reply buffer for the maximum numbers of OSTs. All of the reply buffers end up getting vmalloc()'ed instead of kmalloc()'ed by OBD_ALLOC_LARGE().

      It's worth noting that we see this behavior even though we're running with the fix for the recent kernel vmalloc() regression. Going forward we should keep this in mind and never ever use vmalloc() if we can at all avoid it. It's not just slow, it's downright dangerous to use in any concurrent context.

      I'll push a fix for this issue shortly for review.

      Attachments

        Issue Links

          Activity

            [LU-3338] IOC_MDC_GETFILESTRIPE can abuse vmalloc()
            bogl Bob Glossman (Inactive) added a comment - backport to b2_4: http://review.whamcloud.com/10303
            pjones Peter Jones added a comment -

            Landed for 2.6

            pjones Peter Jones added a comment - Landed for 2.6
            pjones Peter Jones added a comment -

            Lai

            Could you please look at Andreas's review feedback on this patch and see about revising it so that we can land it?

            Thanks

            Peter

            pjones Peter Jones added a comment - Lai Could you please look at Andreas's review feedback on this patch and see about revising it so that we can land it? Thanks Peter
            pjones Peter Jones added a comment -

            Mike

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Mike Could you please look into this one? Thanks Peter

            Move this off to 2.5.1 because the patch is consistently failing, and this is not a new problem but rather optimizing some code.

            adilger Andreas Dilger added a comment - Move this off to 2.5.1 because the patch is consistently failing, and this is not a new problem but rather optimizing some code.

            Thanks for the clarification Andreas.

            Testing of the proposed patch on Friday showed there are still some cases triggered by the Robinhood scan which are resulting in large allocations. I'm going to run those remaining cases down this week and I'll refresh that patch with those additions. I'm not sure it's worth the extra complexity to send a minimum sized buffer and add another flag to track it. I think we get most of what we need just by using the default as long as we cap the default value at less than a page. But I'll play around with it a bit and propose another patch.

            behlendorf Brian Behlendorf added a comment - Thanks for the clarification Andreas. Testing of the proposed patch on Friday showed there are still some cases triggered by the Robinhood scan which are resulting in large allocations. I'm going to run those remaining cases down this week and I'll refresh that patch with those additions. I'm not sure it's worth the extra complexity to send a minimum sized buffer and add another flag to track it. I think we get most of what we need just by using the default as long as we cap the default value at less than a page. But I'll play around with it a bit and propose another patch.

            To clarify, it is fine to start with the default LOV EA reply buffer for 2.4 clients, so long as the RPC layer will handle the larger reply if it is sent. This is needed for client compatibility with pre-2.4 MDSes.

            It would be possible to always allocate a minimum-sized reply buffer (e.g. 56 bytes == 1 stripe) for these RPCs on the assumption that the MDS will not send anything. If the MDS ever does send something in this buffer it can save a flag that means future RPCs will get the default-sized reply buffer.

            adilger Andreas Dilger added a comment - To clarify, it is fine to start with the default LOV EA reply buffer for 2.4 clients, so long as the RPC layer will handle the larger reply if it is sent. This is needed for client compatibility with pre-2.4 MDSes. It would be possible to always allocate a minimum-sized reply buffer (e.g. 56 bytes == 1 stripe) for these RPCs on the assumption that the MDS will not send anything. If the MDS ever does send something in this buffer it can save a flag that means future RPCs will get the default-sized reply buffer.

            With MDSes before 2.4 the unlink, rename, and close replies can result in the MDS inode dropping the last link on a file and the OST objects being destroyed. In pre-2.4 days, the clients would do the OST object destroys, so they would be sent the LOV EA in order to do this.

            With 2.4, the MDS will do the OST object destroys, but only after the name unlink has committed locally. Without that, there is a chance that the just-unlinked name will survive an MDS crash, but the OST objects will have been destroyed. While that isn't technically data loss (the user had already deleted the file), it can be confusing if they later find the partially-deleted file and it returns an error when accessed.

            adilger Andreas Dilger added a comment - With MDSes before 2.4 the unlink, rename, and close replies can result in the MDS inode dropping the last link on a file and the OST objects being destroyed. In pre-2.4 days, the clients would do the OST object destroys, so they would be sent the LOV EA in order to do this. With 2.4, the MDS will do the OST object destroys, but only after the name unlink has committed locally. Without that, there is a chance that the just-unlinked name will survive an MDS crash, but the OST objects will have been destroyed. While that isn't technically data loss (the user had already deleted the file), it can be confusing if they later find the partially-deleted file and it returns an error when accessed.

            This whole issue goes deeper than I first thought. It seems many of the mdc requests (unlink, getattr, rename, close, ...) proactively allocate a reply buffer of cl_max_mds_easize which may or may not end up being used. As described above when the size of this reply buffer forces it to be vmalloc() you basically end up with a non-functional very slow client which spins.

            Can someone explain to me why we need to add this reply buffer for all of these request types? For the moment I've updated the existing proposed patch to use the cl_default_mds_easize for all these cases to avoid the issue.

            behlendorf Brian Behlendorf added a comment - This whole issue goes deeper than I first thought. It seems many of the mdc requests (unlink, getattr, rename, close, ...) proactively allocate a reply buffer of cl_max_mds_easize which may or may not end up being used. As described above when the size of this reply buffer forces it to be vmalloc() you basically end up with a non-functional very slow client which spins. Can someone explain to me why we need to add this reply buffer for all of these request types? For the moment I've updated the existing proposed patch to use the cl_default_mds_easize for all these cases to avoid the issue.
            behlendorf Brian Behlendorf added a comment - http://review.whamcloud.com/6339

            People

              laisiyao Lai Siyao
              behlendorf Brian Behlendorf
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: