Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4194

kfree fails kernel paging request under ldlm_lock_put()

Details

    • 3
    • 11361

    Description

      We have a lustre client dedicated to the task of running robinhood. This node is running lustre 2.4.0-18chaos and talking to our 55PB zfs-osd filesystem running a similar tag of lustre.

      The lustre client node that runs robinhood is hitting a "BUG: unable to handle kernel paging request at <pointer>" as a result of attempting to kfree() a bad pointer in lustre's ldlm_lock_put() function.

      There are two stacks that we have seen lead up to this, both ending in the same failed kfree.

      The first is from ldlm_bl_* threads:

      cfs_free
      ldlm_lock_put
      ldlm_cli_cancel_list
      ldlm_bl_thread_main
      

      The second was under a robinhood process:

      cfs_free
      ldlm_lock_put
      ldlm_cli_cancel_list
      ldlm_prep_elc_req
      ldlm_prep_enqueue_req
      mdc_intent_getattr_pack
      mdc_enqueue
      mdc_intent_lock
      lmv_intent_lookup
      lmv_intent_lock
      ll_lookup_it
      ll_lookup_nd
      do_lookup
      __link_path_walk
      path_walk
      do_path_lookup
      user_path_at
      vfs_fstatat
      vfs_lstat
      sys_newlstat
      system_call_fastpath
      

      Note that it was the second call to ldlm_cli_cancel_list() in ldlm_prep_elc_req() that we were under.

      The problem is pretty reproducible on our system. The client node usually crashes in less than an hour. But I am not aware of how to reproduce this elsewhere. We have other server clusters with robinhood instances on clients that are not crashing.

      Attachments

        Activity

          [LU-4194] kfree fails kernel paging request under ldlm_lock_put()
          yujian Jian Yu added a comment -

          Patches http://review.whamcloud.com/8270 and http://review.whamcloud.com/8298 were cherry-picked to Lustre b2_5 branch.

          yujian Jian Yu added a comment - Patches http://review.whamcloud.com/8270 and http://review.whamcloud.com/8298 were cherry-picked to Lustre b2_5 branch.

          It looks like we just hit this bug with one of our 2.4 clients.

          simmonsja James A Simmons added a comment - It looks like we just hit this bug with one of our 2.4 clients.
          pjones Peter Jones added a comment - - edited

          Landed for 2.6

          pjones Peter Jones added a comment - - edited Landed for 2.6

          Great job!! BTW, the fact it was an address from the VMALLOC range becomes obvious with the kmem sub-command translation.

          bfaccini Bruno Faccini (Inactive) added a comment - Great job!! BTW, the fact it was an address from the VMALLOC range becomes obvious with the kmem sub-command translation.

          Robinhood has been running for two hours one the client with patch http://review.whamcloud.com/8298 without a problem. That looks like our problem.

          I will be away at SC'13 all of next week. If you could take over working on a finalized version of http://review.whamcloud.com/8298 while I am gone, that would be great.

          morrone Christopher Morrone (Inactive) added a comment - Robinhood has been running for two hours one the client with patch http://review.whamcloud.com/8298 without a problem. That looks like our problem. I will be away at SC'13 all of next week. If you could take over working on a finalized version of http://review.whamcloud.com/8298 while I am gone, that would be great.

          Brian Behlendorf provided me with a key insight:

          #define VMALLOC_START    _AC(0xffffc90000000000, UL)
          #define VMALLOC_END      _AC(0xffffe8ffffffffff, UL)
          

          So 0xffffc90046218000, the address that we are attempting to kfree, is in the vmalloc address range.

          And sure enough, it looks like ldlm_lock_put() is using OBD_FREE(), which is kfree-only. However, there are a couple of functions, at least mdc_finish_enqueue() and ll_layout_fetch() that are allocating buffers using OBD_FREE_LARGE() and pointing to them from the ldlm_lock's l_lvb_data pointer.

          I am testing the following patch in which I attempt to use the alloc and free functions a bit more consistently.

          http://review.whamcloud.com/8298

          I am not entirely sure that we want to use OBD_ALLOC_LARGE all the time in ldlm_lock_create(). I don't know how often the buffers will be large enough to switch to vmalloc. The fix could certainly be implemented other ways. But this should suffice for testing.

          This code really needs some cleanup work.

          morrone Christopher Morrone (Inactive) added a comment - Brian Behlendorf provided me with a key insight: #define VMALLOC_START _AC(0xffffc90000000000, UL) #define VMALLOC_END _AC(0xffffe8ffffffffff, UL) So 0xffffc90046218000, the address that we are attempting to kfree, is in the vmalloc address range. And sure enough, it looks like ldlm_lock_put() is using OBD_FREE(), which is kfree-only. However, there are a couple of functions, at least mdc_finish_enqueue() and ll_layout_fetch() that are allocating buffers using OBD_FREE_LARGE() and pointing to them from the ldlm_lock's l_lvb_data pointer. I am testing the following patch in which I attempt to use the alloc and free functions a bit more consistently. http://review.whamcloud.com/8298 I am not entirely sure that we want to use OBD_ALLOC_LARGE all the time in ldlm_lock_create(). I don't know how often the buffers will be large enough to switch to vmalloc. The fix could certainly be implemented other ways. But this should suffice for testing. This code really needs some cleanup work.

          Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ??

          No, it does not. But I don't understand what I am seeing yet. I'll transcribe it here and maybe you can help me:

          crash> kmem 0xffffc90046218000
          VM_STRUCT                  ADDRESS  RANGE                          SIZE
          ffff885e3df9ed80      ffffc90046218000 - fffc9004621e000          24576
          
              PAGE              PHYSICAL       MAPPING          INDEX CNT  FLAGS
          ffffea0149d59378    5e3d059000             0              0   1  c0000000000000
          
          morrone Christopher Morrone (Inactive) added a comment - - edited Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ?? No, it does not. But I don't understand what I am seeing yet. I'll transcribe it here and maybe you can help me: crash> kmem 0xffffc90046218000 VM_STRUCT ADDRESS RANGE SIZE ffff885e3df9ed80 ffffc90046218000 - fffc9004621e000 24576 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffea0149d59378 5e3d059000 0 0 1 c0000000000000

          And patch attempt to set l_lvb_type coherent is at http://review.whamcloud.com/#/c/8270/.

          bfaccini Bruno Faccini (Inactive) added a comment - And patch attempt to set l_lvb_type coherent is at http://review.whamcloud.com/#/c/8270/ .
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Hello Chris, thanks again for this low-level VM debugging !

          I am not 100% sure but according to your last inputs, I think there is something contradictory here, how crash tool allows you to read poisoned memory meaning it was able to do virt2phys translation, when the corresponding page-table entry address resolution (to find corresponding/owning kmem-cache), itself using the same virt2phys translation to compute the page-frame number and find offset in page-table, fails ???

          Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ??

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Hello Chris, thanks again for this low-level VM debugging ! I am not 100% sure but according to your last inputs, I think there is something contradictory here, how crash tool allows you to read poisoned memory meaning it was able to do virt2phys translation, when the corresponding page-table entry address resolution (to find corresponding/owning kmem-cache), itself using the same virt2phys translation to compute the page-frame number and find offset in page-table, fails ??? Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ??

          It looks like the backtrace below kfree() where we hit the page fault failure looks a little like this:

          constant_test_bit
          test_bit
          virt_to_page
          virt_to_head_page
          virt_to_cache
          kfree
          

          At least crash claims that we are on the pointer dereference under constant_test_bit(), and looking forward and backwards a bit from kfree+0x173 seems to place us under virt_to_page().

          I started to try to calculate virt_to_page by hand, but that way lies madness. I am just going to assume that:

          virt_to_page(0xffffc90046218000) = 0xffffeae380f57540

          All that doesn't really change anything. But I feel somewhat enlightened.

          morrone Christopher Morrone (Inactive) added a comment - It looks like the backtrace below kfree() where we hit the page fault failure looks a little like this: constant_test_bit test_bit virt_to_page virt_to_head_page virt_to_cache kfree At least crash claims that we are on the pointer dereference under constant_test_bit(), and looking forward and backwards a bit from kfree+0x173 seems to place us under virt_to_page(). I started to try to calculate virt_to_page by hand, but that way lies madness. I am just going to assume that: virt_to_page(0xffffc90046218000) = 0xffffeae380f57540 All that doesn't really change anything. But I feel somewhat enlightened.

          People

            bfaccini Bruno Faccini (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: