Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14408

very large lustre_inode_cache

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • 3.10.0-1160.4.1.1chaos.ch6.x86_64
      Server: lustre-2.12.6_3.llnl-1.ch6.x86_64
      Client: lustre-2.14.0-something
      starfish "agent"
    • 3
    • 9223372036854775807

      The ptlrpc_cache repeatedly grows very, very large on a node running starfish (a policy engine similar to robinhood).

      [root@solfish2:~]# cat /tmp/t4
       Active / Total Objects (% used)    : 508941033 / 523041216 (97.3%)
       Active / Total Slabs (% used)      : 11219941 / 11219941 (100.0%)
       Active / Total Caches (% used)     : 87 / 122 (71.3%)
       Active / Total Size (% used)       : 112878003.58K / 114522983.04K (98.6%)
       Minimum / Average / Maximum Object : 0.01K / 0.22K / 8.00K
      
      OBJS      ACTIVE    USE   OBJ_SIZE  SLABS    OBJ/SLAB  CACHE_SIZE  NAME
      30545252  30067595  98%   1.12K     1092909  28        34973088K   ptlrpc_cache
      92347047  92347047  99%   0.31K     1810744  51        28971904K   bio-3
      92346672  92346672  100%  0.16K     1923889  48        15391112K   xfs_icr
      92409312  92409312  100%  0.12K     2887791  32        11551164K   kmalloc-128
      25717818  23912628  92%   0.19K     612329   42        4898632K    kmalloc-192
      25236420  24708346  97%   0.18K     573555   44        4588440K    xfs_log_ticket
      25286568  24717197  97%   0.17K     549708   46        4397664K    xfs_ili
      14103054  13252206  93%   0.19K     335787   42        2686296K    dentry
      ...
       

      The ptlrpc_cache shrinks from GB to MB in size upon

      echo 2 > /proc/sys/vm/drop_caches

      This particular node has 128GB of RAM, so this represents a very large portion of the total.

      After a suggestion by Oleg (see below) the node was rebooted with kernel command line parameters slag_nomerge and slub_nomerge.  After doing that, it was found that the actual cache taking up all the space was the lustre_inode_cache.

      At the same time I saw this I saw kthread_run() and fork() failures reported in the console log.  Those failures turned out to be a result of sysctl kernel.pid_max being too low, and were not related to the amount of memory that was in use or free.

            green Oleg Drokin
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated: