Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4194

kfree fails kernel paging request under ldlm_lock_put()

Details

    • 3
    • 11361

    Description

      We have a lustre client dedicated to the task of running robinhood. This node is running lustre 2.4.0-18chaos and talking to our 55PB zfs-osd filesystem running a similar tag of lustre.

      The lustre client node that runs robinhood is hitting a "BUG: unable to handle kernel paging request at <pointer>" as a result of attempting to kfree() a bad pointer in lustre's ldlm_lock_put() function.

      There are two stacks that we have seen lead up to this, both ending in the same failed kfree.

      The first is from ldlm_bl_* threads:

      cfs_free
      ldlm_lock_put
      ldlm_cli_cancel_list
      ldlm_bl_thread_main
      

      The second was under a robinhood process:

      cfs_free
      ldlm_lock_put
      ldlm_cli_cancel_list
      ldlm_prep_elc_req
      ldlm_prep_enqueue_req
      mdc_intent_getattr_pack
      mdc_enqueue
      mdc_intent_lock
      lmv_intent_lookup
      lmv_intent_lock
      ll_lookup_it
      ll_lookup_nd
      do_lookup
      __link_path_walk
      path_walk
      do_path_lookup
      user_path_at
      vfs_fstatat
      vfs_lstat
      sys_newlstat
      system_call_fastpath
      

      Note that it was the second call to ldlm_cli_cancel_list() in ldlm_prep_elc_req() that we were under.

      The problem is pretty reproducible on our system. The client node usually crashes in less than an hour. But I am not aware of how to reproduce this elsewhere. We have other server clusters with robinhood instances on clients that are not crashing.

      Attachments

        Activity

          [LU-4194] kfree fails kernel paging request under ldlm_lock_put()

          It looks like we just hit this bug with one of our 2.4 clients.

          simmonsja James A Simmons added a comment - It looks like we just hit this bug with one of our 2.4 clients.
          pjones Peter Jones added a comment - - edited

          Landed for 2.6

          pjones Peter Jones added a comment - - edited Landed for 2.6

          Great job!! BTW, the fact it was an address from the VMALLOC range becomes obvious with the kmem sub-command translation.

          bfaccini Bruno Faccini (Inactive) added a comment - Great job!! BTW, the fact it was an address from the VMALLOC range becomes obvious with the kmem sub-command translation.

          Robinhood has been running for two hours one the client with patch http://review.whamcloud.com/8298 without a problem. That looks like our problem.

          I will be away at SC'13 all of next week. If you could take over working on a finalized version of http://review.whamcloud.com/8298 while I am gone, that would be great.

          morrone Christopher Morrone (Inactive) added a comment - Robinhood has been running for two hours one the client with patch http://review.whamcloud.com/8298 without a problem. That looks like our problem. I will be away at SC'13 all of next week. If you could take over working on a finalized version of http://review.whamcloud.com/8298 while I am gone, that would be great.

          Brian Behlendorf provided me with a key insight:

          #define VMALLOC_START    _AC(0xffffc90000000000, UL)
          #define VMALLOC_END      _AC(0xffffe8ffffffffff, UL)
          

          So 0xffffc90046218000, the address that we are attempting to kfree, is in the vmalloc address range.

          And sure enough, it looks like ldlm_lock_put() is using OBD_FREE(), which is kfree-only. However, there are a couple of functions, at least mdc_finish_enqueue() and ll_layout_fetch() that are allocating buffers using OBD_FREE_LARGE() and pointing to them from the ldlm_lock's l_lvb_data pointer.

          I am testing the following patch in which I attempt to use the alloc and free functions a bit more consistently.

          http://review.whamcloud.com/8298

          I am not entirely sure that we want to use OBD_ALLOC_LARGE all the time in ldlm_lock_create(). I don't know how often the buffers will be large enough to switch to vmalloc. The fix could certainly be implemented other ways. But this should suffice for testing.

          This code really needs some cleanup work.

          morrone Christopher Morrone (Inactive) added a comment - Brian Behlendorf provided me with a key insight: #define VMALLOC_START _AC(0xffffc90000000000, UL) #define VMALLOC_END _AC(0xffffe8ffffffffff, UL) So 0xffffc90046218000, the address that we are attempting to kfree, is in the vmalloc address range. And sure enough, it looks like ldlm_lock_put() is using OBD_FREE(), which is kfree-only. However, there are a couple of functions, at least mdc_finish_enqueue() and ll_layout_fetch() that are allocating buffers using OBD_FREE_LARGE() and pointing to them from the ldlm_lock's l_lvb_data pointer. I am testing the following patch in which I attempt to use the alloc and free functions a bit more consistently. http://review.whamcloud.com/8298 I am not entirely sure that we want to use OBD_ALLOC_LARGE all the time in ldlm_lock_create(). I don't know how often the buffers will be large enough to switch to vmalloc. The fix could certainly be implemented other ways. But this should suffice for testing. This code really needs some cleanup work.

          Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ??

          No, it does not. But I don't understand what I am seeing yet. I'll transcribe it here and maybe you can help me:

          crash> kmem 0xffffc90046218000
          VM_STRUCT                  ADDRESS  RANGE                          SIZE
          ffff885e3df9ed80      ffffc90046218000 - fffc9004621e000          24576
          
              PAGE              PHYSICAL       MAPPING          INDEX CNT  FLAGS
          ffffea0149d59378    5e3d059000             0              0   1  c0000000000000
          
          morrone Christopher Morrone (Inactive) added a comment - - edited Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ?? No, it does not. But I don't understand what I am seeing yet. I'll transcribe it here and maybe you can help me: crash> kmem 0xffffc90046218000 VM_STRUCT ADDRESS RANGE SIZE ffff885e3df9ed80 ffffc90046218000 - fffc9004621e000 24576 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffea0149d59378 5e3d059000 0 0 1 c0000000000000

          And patch attempt to set l_lvb_type coherent is at http://review.whamcloud.com/#/c/8270/.

          bfaccini Bruno Faccini (Inactive) added a comment - And patch attempt to set l_lvb_type coherent is at http://review.whamcloud.com/#/c/8270/ .
          bfaccini Bruno Faccini (Inactive) added a comment - - edited

          Hello Chris, thanks again for this low-level VM debugging !

          I am not 100% sure but according to your last inputs, I think there is something contradictory here, how crash tool allows you to read poisoned memory meaning it was able to do virt2phys translation, when the corresponding page-table entry address resolution (to find corresponding/owning kmem-cache), itself using the same virt2phys translation to compute the page-frame number and find offset in page-table, fails ???

          Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ??

          bfaccini Bruno Faccini (Inactive) added a comment - - edited Hello Chris, thanks again for this low-level VM debugging ! I am not 100% sure but according to your last inputs, I think there is something contradictory here, how crash tool allows you to read poisoned memory meaning it was able to do virt2phys translation, when the corresponding page-table entry address resolution (to find corresponding/owning kmem-cache), itself using the same virt2phys translation to compute the page-frame number and find offset in page-table, fails ??? Again, can you tell me what returns the "kmem 0xffffc90046218000" sub-command ? Does-it return a page-table entry address of 0xffffeae380f57540 ??

          It looks like the backtrace below kfree() where we hit the page fault failure looks a little like this:

          constant_test_bit
          test_bit
          virt_to_page
          virt_to_head_page
          virt_to_cache
          kfree
          

          At least crash claims that we are on the pointer dereference under constant_test_bit(), and looking forward and backwards a bit from kfree+0x173 seems to place us under virt_to_page().

          I started to try to calculate virt_to_page by hand, but that way lies madness. I am just going to assume that:

          virt_to_page(0xffffc90046218000) = 0xffffeae380f57540

          All that doesn't really change anything. But I feel somewhat enlightened.

          morrone Christopher Morrone (Inactive) added a comment - It looks like the backtrace below kfree() where we hit the page fault failure looks a little like this: constant_test_bit test_bit virt_to_page virt_to_head_page virt_to_cache kfree At least crash claims that we are on the pointer dereference under constant_test_bit(), and looking forward and backwards a bit from kfree+0x173 seems to place us under virt_to_page(). I started to try to calculate virt_to_page by hand, but that way lies madness. I am just going to assume that: virt_to_page(0xffffc90046218000) = 0xffffeae380f57540 All that doesn't really change anything. But I feel somewhat enlightened.

          I'm glad that you have a theory, because I'm lost.

          In one of the recent tests, the kernel complains about a page fault at

          ffffeae380f57540

          The lustre log extracted from that crash dump tells me the the last message from the thread that hit the page fault Oops was doing the final ldlm_lock_put() on the ldlm_lock at this address:

          ffff885e3d09ce00

          The lock->l_lvb_data pointer is:

          ffffc90046218000

          When I use crash's "rd" command to dump the contents of the l_lvb_data buffer, it is indeed poisoned with OBD_FREE's repeated bytes of 0x5a. The pointer to the buffer has not, on the other hand, been poisoned yet.

          Which all probably just confirms that we are in kfree() under cfs_free() under ldlm_lock_put() at the time of the Oops.

          But where does the page fault address of ffffeae380f57540 come from? I find it in both the RAX and CR2 registers under cfs_free from crash's backtrace output. RAX is a common register, and CR2 is the Page Fault Linear Address register, which makes sense because we Oops under the failed page fault.

          I just wish that I understood where kfree() was getting that particular address.

          morrone Christopher Morrone (Inactive) added a comment - I'm glad that you have a theory, because I'm lost. In one of the recent tests, the kernel complains about a page fault at ffffeae380f57540 The lustre log extracted from that crash dump tells me the the last message from the thread that hit the page fault Oops was doing the final ldlm_lock_put() on the ldlm_lock at this address: ffff885e3d09ce00 The lock->l_lvb_data pointer is: ffffc90046218000 When I use crash's "rd" command to dump the contents of the l_lvb_data buffer, it is indeed poisoned with OBD_FREE's repeated bytes of 0x5a. The pointer to the buffer has not, on the other hand, been poisoned yet. Which all probably just confirms that we are in kfree() under cfs_free() under ldlm_lock_put() at the time of the Oops. But where does the page fault address of ffffeae380f57540 come from? I find it in both the RAX and CR2 registers under cfs_free from crash's backtrace output. RAX is a common register, and CR2 is the Page Fault Linear Address register, which makes sense because we Oops under the failed page fault. I just wish that I understood where kfree() was getting that particular address.

          Thanks for the answers/confirmations.

          My comment before about "the fact that there is LVB data with a type of LVB_T_NONE comes from layout_fetch()" is wrong and must be inverted, it can NOT come from layout_fetch() !!… but I think now can better come from un-expected/requested Layout embedding within LVB reply from Server, and this can happen at least in ldlm_handle_cp_callback().

          I am currently testing a fix for that.

          bfaccini Bruno Faccini (Inactive) added a comment - Thanks for the answers/confirmations. My comment before about "the fact that there is LVB data with a type of LVB_T_NONE comes from layout_fetch()" is wrong and must be inverted, it can NOT come from layout_fetch() !!… but I think now can better come from un-expected/requested Layout embedding within LVB reply from Server, and this can happen at least in ldlm_handle_cp_callback(). I am currently testing a fix for that.

          People

            bfaccini Bruno Faccini (Inactive)
            morrone Christopher Morrone (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: