Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5722

memory allocation deadlock under lu_cache_shrink()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.7.0
    • Lustre 2.7.0
    • single-node testing on master (5c4f68be57 + http://review.whamcloud.com/11258 )
      kernel: 2.6.32-358.23.2.el6_lustre.gc9be53c.x86_64
      combined MDS+MGS+OSS, 2x MDT, 3xOST on LVM
    • 3
    • 16062

    Description

      While running sanity-benchmark.sh dbench, I hit the following memory allocation deadlock under mdc_read_page_remote():

      dbench D 0000000000000001 0 14532 1 0x00000004
      Call Trace:
      resched_task+0x68/0x80
      __mutex_lock_slowpath+0x13e/0x180
      mutex_lock+0x2b/0x50
      lu_cache_shrink+0x203/0x310 [obdclass]
      shrink_slab+0x11a/0x1a0
      do_try_to_free_pages+0x3f7/0x610
      try_to_free_pages+0x92/0x120
      __alloc_pages_nodemask+0x478/0x8d0
      alloc_pages_current+0xaa/0x110
      __page_cache_alloc+0x87/0x90
      mdc_read_page_remote+0x13c/0xd90 [mdc] do_read_cache_page+0x7b/0x180
      read_cache_page_async+0x19/0x20
      read_cache_page+0xe/0x20
      mdc_read_page+0x192/0x950 [mdc]
      lmv_read_page+0x1e0/0x1210 [lmv]
      ll_get_dir_page+0xbc/0x370 [lustre]
      ll_dir_read+0x9e/0x300 [lustre]
      ll_readdir+0x12a/0x4d0 [lustre]
      vfs_readdir+0xc0/0xe0
      sys_getdents+0x89/0xf0
      

      The page allocation is recursing into Lustre and the DLM slab shrinker, which is blocked on a lock that is being held. Presumably it needs to use GFP_NOFS during the allocation? I didn't actually check what locks were held, since the machine hung as I was trying to get more info.

      Attachments

        Issue Links

          Activity

            People

              cliffw Cliff White (Inactive)
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: