Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2432

ptlrpc_alloc_rqbd spinning on vmap_area_lock on MDS

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 5764

    Description

      vmalloc based allocations can potentially take a very long time to complete due to a regression in the kernel. As a result, I've seen our MDS "lock up" for certain periods of time while all of the cores spin on the vmap_area_lock down in ptlrpc_alloc_rqbd.

      For example:

          2012-11-01 11:34:28 Pid: 34505, comm: mdt02_051
          2012-11-01 11:34:28 
          2012-11-01 11:34:28 Call Trace:
          2012-11-01 11:34:28  [<ffffffff81273155>] ? rb_insert_color+0x125/0x160
          2012-11-01 11:34:28  [<ffffffff81149f1f>] ? __vmalloc_area_node+0x5f/0x190
          2012-11-01 11:34:28  [<ffffffff810609ea>] __cond_resched+0x2a/0x40
          2012-11-01 11:34:28  [<ffffffff814efa60>] _cond_resched+0x30/0x40
          2012-11-01 11:34:28  [<ffffffff8115fa88>] kmem_cache_alloc_node_notrace+0xa8/0x130
          2012-11-01 11:34:28  [<ffffffff8115fc8b>] __kmalloc_node+0x7b/0x100
          2012-11-01 11:34:28  [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
          2012-11-01 11:34:28  [<ffffffff81149f1f>] __vmalloc_area_node+0x5f/0x190
          2012-11-01 11:34:28  [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
          2012-11-01 11:34:28  [<ffffffff81149eb2>] __vmalloc_node+0xa2/0xb0
          2012-11-01 11:34:28  [<ffffffffa05a2a40>] ? cfs_cpt_vmalloc+0x20/0x30 [libcfs]
          2012-11-01 11:34:28  [<ffffffff8114a199>] vmalloc_node+0x29/0x30
          2012-11-01 11:34:28  [<ffffffffa05a2a40>] cfs_cpt_vmalloc+0x20/0x30 [libcfs]
          2012-11-01 11:34:28  [<ffffffffa0922ffe>] ptlrpc_alloc_rqbd+0x13e/0x690 [ptlrpc]
          2012-11-01 11:34:28  [<ffffffffa09235b5>] ptlrpc_grow_req_bufs+0x65/0x1b0 [ptlrpc]
          2012-11-01 11:34:28  [<ffffffffa0927fbd>] ptlrpc_main+0xd0d/0x19f0 [ptlrpc]
          2012-11-01 11:34:28  [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
          2012-11-01 11:34:28  [<ffffffff8100c14a>] child_rip+0xa/0x20
          2012-11-01 11:34:28  [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
          2012-11-01 11:34:28  [<ffffffffa09272b0>] ? ptlrpc_main+0x0/0x19f0 [ptlrpc]
          2012-11-01 11:34:28  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      Here's a couple links regarding the kernel regression:

      Attachments

        Issue Links

          Activity

            [LU-2432] ptlrpc_alloc_rqbd spinning on vmap_area_lock on MDS

            Both http://review.whamcloud.com/4939 and http://review.whamcloud.com/4940 have landed, so I think this bug could be closed. There should only be a single thread calling vmalloc() now.

            adilger Andreas Dilger added a comment - Both http://review.whamcloud.com/4939 and http://review.whamcloud.com/4940 have landed, so I think this bug could be closed. There should only be a single thread calling vmalloc() now.

            all other threads can serve requests, they will not wait for buffer allocating at all

            Perfect, that's what I wanted to verify with you. Thanks for the clarification!

            prakash Prakash Surya (Inactive) added a comment - all other threads can serve requests, they will not wait for buffer allocating at all Perfect, that's what I wanted to verify with you. Thanks for the clarification!

            I think we might not care one thread (or very few threads) spinning, because each service has tens or even hundreds of threads, and servers normally have many CPU cores, all other threads can serve requests, they will not wait for buffer allocating at all.
            The key issue of this ticket is vmalloc can't be parallelized, so it's a waste if all threads/CPUs try to allocate buffers at the same time.

            liang Liang Zhen (Inactive) added a comment - I think we might not care one thread (or very few threads) spinning, because each service has tens or even hundreds of threads, and servers normally have many CPU cores, all other threads can serve requests, they will not wait for buffer allocating at all. The key issue of this ticket is vmalloc can't be parallelized, so it's a waste if all threads/CPUs try to allocate buffers at the same time.

            Liang, Why would limiting the vmalloc calls to a single thread fix the issue? That one thread will still be affected by the regression. Will the other threads still be able to service requests despite needing more request buffers? Or will they all have wait for this single thread to finish the allocations?

            prakash Prakash Surya (Inactive) added a comment - Liang, Why would limiting the vmalloc calls to a single thread fix the issue? That one thread will still be affected by the regression. Will the other threads still be able to service requests despite needing more request buffers? Or will they all have wait for this single thread to finish the allocations?

            I posted a patch for this:
            http://review.whamcloud.com/#change,4939
            and another patch to resolve buffer utilizaiton issue:
            http://review.whamcloud.com/#change,4940

            liang Liang Zhen (Inactive) added a comment - I posted a patch for this: http://review.whamcloud.com/#change,4939 and another patch to resolve buffer utilizaiton issue: http://review.whamcloud.com/#change,4940

            I didn't realize that we still don't have the "big request buffer" fix, then this should be the right way to fix this problem and LU-2424.
            I would suggest to have 512K or 1M as request buffer size, as Andreas said, a very large request buffer can't be reused if any of those (thousands or more) requests is pending on something, so it might have some other issues.
            And I still think it's a nice improvement if we only allow one thread (per CPT) to enter allocating path.

            liang Liang Zhen (Inactive) added a comment - I didn't realize that we still don't have the "big request buffer" fix, then this should be the right way to fix this problem and LU-2424 . I would suggest to have 512K or 1M as request buffer size, as Andreas said, a very large request buffer can't be reused if any of those (thousands or more) requests is pending on something, so it might have some other issues. And I still think it's a nice improvement if we only allow one thread (per CPT) to enter allocating path.

            My (imperfect) understanding is that the receive buffers cannot be re-used until all of the requests therein are processed. That means the buffered are filled from the start, processed, and then returned to the incoming buffer list. If the buffer is too large, then requests sitting in the buffer may wait too long to be processed, or the buffer still will not be fully utilized if there is an upper limit for how long a request will wait.

            adilger Andreas Dilger added a comment - My (imperfect) understanding is that the receive buffers cannot be re-used until all of the requests therein are processed. That means the buffered are filled from the start, processed, and then returned to the incoming buffer list. If the buffer is too large, then requests sitting in the buffer may wait too long to be processed, or the buffer still will not be fully utilized if there is an upper limit for how long a request will wait.

            I wasn't at LAD, so I'm unaware of that discussion. But, what trade offs are being made between the number of buffers used and the size of each? i.e why can't we just have one huge buffer, increasing the utilization to (BUFFER_SIZE-REQUEST_SIZE)/BUFFER_SIZE percent (trending towards 100% as BUFFER_SIZE grows large)? Granted I don't understand the LNET code well, so I must be missing something which makes that obviously the wrong thing to do.

            prakash Prakash Surya (Inactive) added a comment - I wasn't at LAD, so I'm unaware of that discussion. But, what trade offs are being made between the number of buffers used and the size of each? i.e why can't we just have one huge buffer, increasing the utilization to (BUFFER_SIZE-REQUEST_SIZE)/BUFFER_SIZE percent (trending towards 100% as BUFFER_SIZE grows large)? Granted I don't understand the LNET code well, so I must be missing something which makes that obviously the wrong thing to do.

            We discussed at LAD that one problem with the request buffers is that the incoming LNET buffers (sorry, I don't have the correct LNET terms here) are allocated only large enough for the largest single request, though most requests are smaller than this. Unfortunately, as soon as a single RPC is waiting in the incoming buffer, there is no longer enough space in the buffer to receive a maximum-sized incoming request. This means that each buffer is only ever used for a single message, regardless of how many might fit.

            A solution that was discussed was to make the request buffer be 2x as large as the maximum request size and/or rounded up to the next power-of-two boundary. That would at least increase the buffer utilization to 50%, and would likely allow tens of requests per LNET buffer.

            It may be that the patch for LU-2424 will already address this issue?

            adilger Andreas Dilger added a comment - We discussed at LAD that one problem with the request buffers is that the incoming LNET buffers (sorry, I don't have the correct LNET terms here) are allocated only large enough for the largest single request, though most requests are smaller than this. Unfortunately, as soon as a single RPC is waiting in the incoming buffer, there is no longer enough space in the buffer to receive a maximum-sized incoming request. This means that each buffer is only ever used for a single message, regardless of how many might fit. A solution that was discussed was to make the request buffer be 2x as large as the maximum request size and/or rounded up to the next power-of-two boundary. That would at least increase the buffer utilization to 50%, and would likely allow tens of requests per LNET buffer. It may be that the patch for LU-2424 will already address this issue?
            bobijam Zhenyu Xu added a comment -

            svc->srv_buf_size can be MDS_BUFSIZE = (362 + LOV_MAX_STRIPE_COUNT * 56 + 1024) ~= 110KB for MDS service, could it be problematic?

            bobijam Zhenyu Xu added a comment - svc->srv_buf_size can be MDS_BUFSIZE = (362 + LOV_MAX_STRIPE_COUNT * 56 + 1024) ~= 110KB for MDS service, could it be problematic?

            People

              bobijam Zhenyu Xu
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: