Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-25

Blocking network request in ldlm shrinker

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.1.0
    • Lustre 1.8.6
    • None
    • 3
    • 5075

    Description

      We have now verified at least twice that the ldlm shrinker is sometimes getting stuck waiting for a network transaction to complete. We are of the opinion that this is an architectural problem in lustre. A shrinker should NEVER block on network transactions.

      In particular, the ldlm_cli_pool_shrink() shrinker is the one that has been giving us trouble in 1.8 for as long as a year. It was only a couple of months ago that it was finally brought to my attention, and the problem is intermittent enough that is has been difficult to catch the problem as it happens. Now both Brian and I have caught the problem with crash, and we have some other sysreq console backtraces that the sysadmins have gotten us. The problem always looks like the following, at least from get_page_from_freelist() to ptlrpc_queue_wait():

      ptlrpc_queue_wait
      lustre_msg_set_opc
      ptlrpc_at_set_req_timeout
      lustre_msg_buf
      ldlm_cli_cancel_req
      ldlm_cli_cancel_list
      ldlm_cancel_passed_policy
      ldlm_cancel_lru
      ldlm_cli_pool_shrink
      ldlm_cli_pool_shrink
      ldlm_pool_shrink
      ldlm_namespace_move_locked
      ldlm_pools_shrink
      ldlm_pools_cli_shrink
      shrink_slab
      get_page_from_freelist
      __alloc_pages
      alloc_pages_current
      tcp_sendmsg
      inet_sendmsg
      do_sock_write
      sock_aio_write
      do_sync_write
      sys_getsockname
      vfs_write
      sys_write

      It is not clear why we are not getting the reply to the cancel request in a reasonable time. And it appears that this thread can be stuck in ptlrpc_queue_wait() for at least an hour. We are also not yet sure why that should block for so long.

      But hopefully we can agree that a blocking operation like that should NOT happen in the shrinker. You'll notice that the kernel entry point did not even involve lustre here. We have seen other entry points, such as page faults, that also needed to call get_page_from_freelist() and then got stuck in lustre's ldlm client pool shrinker.

      Brian Behlendorf voiced concerns about this independently in comment #31 of Oracle bug 23598.

      We were hoping that Oleg would take a look at this. Eric might also be interested, since this appears to be an architectural flaw.

      Our immediate need is to fix this particular shrinker, but we should probably make it policy to never perform blocking network transactions in shrinkers.

      Brian suggested that a quick work-around is to just disable this shrinker, since shrinkers are "best effort". As he suggested in comment #31, the longer term fix may be to make the shrink asynchronous, handling the blocking network transactions in another thread.

      Attachments

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: