Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14408

very large lustre_inode_cache

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • 3.10.0-1160.4.1.1chaos.ch6.x86_64
      Server: lustre-2.12.6_3.llnl-1.ch6.x86_64
      Client: lustre-2.14.0-something
      starfish "agent"
    • 3
    • 9223372036854775807

    Description

      The ptlrpc_cache repeatedly grows very, very large on a node running starfish (a policy engine similar to robinhood).

      [root@solfish2:~]# cat /tmp/t4
       Active / Total Objects (% used)    : 508941033 / 523041216 (97.3%)
       Active / Total Slabs (% used)      : 11219941 / 11219941 (100.0%)
       Active / Total Caches (% used)     : 87 / 122 (71.3%)
       Active / Total Size (% used)       : 112878003.58K / 114522983.04K (98.6%)
       Minimum / Average / Maximum Object : 0.01K / 0.22K / 8.00K
      
      OBJS      ACTIVE    USE   OBJ_SIZE  SLABS    OBJ/SLAB  CACHE_SIZE  NAME
      30545252  30067595  98%   1.12K     1092909  28        34973088K   ptlrpc_cache
      92347047  92347047  99%   0.31K     1810744  51        28971904K   bio-3
      92346672  92346672  100%  0.16K     1923889  48        15391112K   xfs_icr
      92409312  92409312  100%  0.12K     2887791  32        11551164K   kmalloc-128
      25717818  23912628  92%   0.19K     612329   42        4898632K    kmalloc-192
      25236420  24708346  97%   0.18K     573555   44        4588440K    xfs_log_ticket
      25286568  24717197  97%   0.17K     549708   46        4397664K    xfs_ili
      14103054  13252206  93%   0.19K     335787   42        2686296K    dentry
      ...
       

      The ptlrpc_cache shrinks from GB to MB in size upon

      echo 2 > /proc/sys/vm/drop_caches

      This particular node has 128GB of RAM, so this represents a very large portion of the total.

      After a suggestion by Oleg (see below) the node was rebooted with kernel command line parameters slag_nomerge and slub_nomerge.  After doing that, it was found that the actual cache taking up all the space was the lustre_inode_cache.

      At the same time I saw this I saw kthread_run() and fork() failures reported in the console log.  Those failures turned out to be a result of sysctl kernel.pid_max being too low, and were not related to the amount of memory that was in use or free.

      Attachments

        Issue Links

          Activity

            People

              green Oleg Drokin
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: