Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1645

shrinker not shrinking/taking too long to shrink?

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.1.2
    • None
    • SLES 11SP1 Kernel 2.6.32.54-0.3.1.20120223-nasa
    • 3
    • 7034

    Description

      We're seeing high load average on some Lustre clients accompanied by processes that are potentially stuck in the ldlm shrinker. Here's a sample stack trace:

      thread_return+0x38/0x34c
      wake_affine+0x357/0x3b0
      enqueue_sleeper+0x178/0x1c0
      enqueue_entity+0x158/0x1c0
      cfs_hash_bd_lookup_intent+0x27/0x110 [libcfs]
      cfs_hash_dual_bd_unlock+0x2c/0x80 [libcfs]
      cfs_hash_lookup+0x7a/0xa0 [libcfs]
      ldlm_pool_shrink+0x31/0xf0 [ptlrpc]
      cl_env_fetch+0x1d/0x60 [obdclass]
      cl_env_reexit+0xe/0x130 [obdclass]
      ldlm_pools_shrink+0x1d2/0x310 [ptlrpc]
      zone_watermark_ok+0x1b/0xd0
      get_page_from_freelist+0x17a/0x720
      apic_timer_interrupt+0xe/0x20
      smp_call_function_many+0x1c0/0x250
      drain_local_pages+0x0/0x10
      smp_call_function+0x20/0x30
      on_each_cpu+0x1d/0x40
      __alloc_pages_slowpath+0x278/0x5f0
      __alloc_pages_nodemask+0x13a/0x140
      __get_free_pages+0x9/0x50
      dup_task_struct+0x42/0x150
      copy_process+0xb4/0xe50
      do_fork+0x8c/0x3c0
      sys_rt_sigreturn+0x222/0x2a0
      stub_clone+0x13/0x20
      system_call_fastpath+0x16/0x1b

      FWIW, some of the traces have cfs_hash_bd_lookup_intent+0x27 as the top line. All of them

      About 3/4 of the memory is inactive:

      pfe11 ~ # cat /proc/meminfo
      MemTotal: 16333060 kB
      MemFree: 344568 kB
      Buffers: 86844 kB
      Cached: 1488340 kB
      SwapCached: 4864 kB
      Active: 1523184 kB
      Inactive: 12045612 kB
      Active(anon): 9152 kB
      Inactive(anon): 7012 kB
      Active(file): 1514032 kB
      Inactive(file): 12038600 kB
      Unevictable: 3580 kB
      Mlocked: 3580 kB
      SwapTotal: 10388652 kB
      SwapFree: 10136240 kB
      Dirty: 244 kB
      Writeback: 976 kB
      AnonPages: 15600 kB
      Mapped: 20296 kB
      Shmem: 0 kB
      Slab: 870808 kB
      SReclaimable: 64868 kB
      SUnreclaim: 805940 kB
      KernelStack: 4312 kB
      PageTables: 14840 kB
      NFS_Unstable: 0 kB
      Bounce: 0 kB
      WritebackTmp: 0 kB
      CommitLimit: 18555180 kB
      Committed_AS: 1074912 kB
      VmallocTotal: 34359738367 kB
      VmallocUsed: 544012 kB
      VmallocChunk: 34343786784 kB
      HardwareCorrupted: 0 kB
      HugePages_Total: 0
      HugePages_Free: 0
      HugePages_Rsvd: 0
      HugePages_Surp: 0
      Hugepagesize: 2048 kB
      DirectMap4k: 7168 kB
      DirectMap2M: 16769024 kB

      We've seen this on two clients in the last two days, and I think we have several other undiagnosed cases in the recent past. The client that did it yesterday was generating OOM messages at the time; today's client did not.

      I have a crash dump, but I'm having trouble getting good stack traces out of it. I'll attach the output from sysrq-t to start. I can't share the crash dump due to our security policies, but I can certainly run commands against it for you, as necessary.

      If there's more information I can gather from a running system before we reboot it, let me know - I imagine we'll have another one soon.

      Attachments

        Activity

          People

            bogl Bob Glossman (Inactive)
            rappleye jason.rappleye@nasa.gov (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: