[LU-1576] client sluggish after running lpurge Created: 27/Jun/12  Updated: 27/Jul/12  Resolved: 27/Jul/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3

Type: Bug Priority: Major
Reporter: Ned Bass Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

https://github.com/chaos/lustre/commits/2.1.1-13chaos


Severity: 3
Rank (Obsolete): 4566

 Description   

We periodically run lpurge on lustre clients to keep filesystem capacity usage under control. lpurge recurses through the filesystem generating a list of files that have not been accessed within some time threshold and optionally removes them.

https://github.com/chaos/lustre-tools-llnl/blob/master/src/lpurge.c

We have found the nodes running lpurge on a large number of files eventually become unusably slow. In some cases the node is evicted and lpurge terminates, but the slowness persists. There is noticable keyboard lag and delays starting and running processes.

Here are some memory statistic on a slow node. In this example we see about 10G in lustre_inode_cache slab and 30G in Inactive(file). Dropping caches clears out the slabs and the node becomes responsive again. However, Inactive(file) remains unchanged.

The backtraces below show processes stuck in the kernel shrinker, but the lustre-related slabs don't shrink unless we drop caches manually.

# free
             total       used       free     shared    buffers     cached
Mem:      49416632   46140416    3276216          0     143212     749056
-/+ buffers/cache:   45248148    4168484
Swap:      4000232          0    4000232
# slabtop -o -s c | head
Active / Total Objects (% used)    : 21568317 / 21691269 (99.4%)
 Active / Total Slabs (% used)      : 1878088 / 1878091 (100.0%)
 Active / Total Caches (% used)     : 134 / 231 (58.0%)
 Active / Total Size (% used)       : 11945321.57K / 11964171.77K (99.8%)
 Minimum / Average / Maximum Object : 0.02K / 0.55K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
9035425 9035342  99%    1.06K 1290775        7  10326200K lustre_inode_cache
5804960 5804471  99%    0.19K 290248       20   1160992K dentry
5667330 5659606  99%    0.12K 188911       30    755644K size-128
 33005  33005 100%    8.00K  33005        1    264040K size-8192
141100 140027  99%    0.78K  28220        5    112880K ext3_inode_cache
406687 400332  98%    0.06K   6893       59     27572K size-64
232619 156570  67%    0.10K   6287       37     25148K buffer_head
 22296  21202  95%    1.00K   5574        4     22296K size-1024
  9336   9301  99%    2.00K   4668        2     18672K size-2048
 28217  21235  75%    0.55K   4031        7     16124K radix_tree_node
 74500  74356  99%    0.19K   3725       20     14900K size-192
   230    230 100%   32.12K    230        1     14720K kmem_cache
 20128  19625  97%    0.50K   2516        8     10064K size-512
  1161   1161 100%    6.65K   1161        1      9288K ll_obd_dev_cache

/proc/meminfo before and after 'echo 3 > /proc/sys/vm/drop_caches'

Before drop_caches            After drop_caches
MemTotal:       49416632 kB   MemTotal:       49416632 kB
MemFree:         3195016 kB   MemFree:        16576276 kB
Buffers:          143724 kB   Buffers:             416 kB
Cached:           836660 kB   Cached:            12572 kB
SwapCached:            0 kB   SwapCached:            0 kB
Active:           473304 kB   Active:            30836 kB
Inactive:       31535004 kB   Inactive:       31010004 kB
Active(anon):      22280 kB   Active(anon):      22356 kB
Inactive(anon):     1304 kB   Inactive(anon):     1304 kB
Active(file):     451024 kB   Active(file):       8480 kB
Inactive(file): 31533700 kB   Inactive(file): 31008700 kB
Unevictable:           0 kB   Unevictable:           0 kB
Mlocked:               0 kB   Mlocked:               0 kB
SwapTotal:       4000232 kB   SwapTotal:       4000232 kB
SwapFree:        4000232 kB   SwapFree:        4000232 kB
Dirty:                 4 kB   Dirty:                 0 kB
Writeback:             0 kB   Writeback:             0 kB
AnonPages:         23468 kB   AnonPages:         23472 kB
Mapped:            11988 kB   Mapped:            11992 kB
Shmem:               192 kB   Shmem:               192 kB
Slab:           12823932 kB   Slab:             409052 kB
SReclaimable:    1327712 kB   SReclaimable:      12436 kB
SUnreclaim:     11496220 kB   SUnreclaim:       396616 kB
KernelStack:        2768 kB   KernelStack:        2768 kB
PageTables:         3256 kB   PageTables:         3256 kB
NFS_Unstable:          0 kB   NFS_Unstable:          0 kB
Bounce:                0 kB   Bounce:                0 kB
WritebackTmp:          0 kB   WritebackTmp:          0 kB
CommitLimit:    28708548 kB   CommitLimit:    28708548 kB
Committed_AS:     135712 kB   Committed_AS:     135708 kB
VmallocTotal:   34359738367 kBVmallocTotal:   34359738367 kB
VmallocUsed:     1180768 kB   VmallocUsed:     1180768 kB
VmallocChunk:   34332553664 kBVmallocChunk:   34332553664 kB
HardwareCorrupted:     0 kB   HardwareCorrupted:     0 kB
AnonHugePages:         0 kB   AnonHugePages:         0 kB
HugePages_Total:       0      HugePages_Total:       0
HugePages_Free:        0      HugePages_Free:        0
HugePages_Rsvd:        0      HugePages_Rsvd:        0
HugePages_Surp:        0      HugePages_Surp:        0
Hugepagesize:       2048 kB   Hugepagesize:       2048 kB
DirectMap4k:        5312 kB   DirectMap4k:        5312 kB
DirectMap2M:     2082816 kB   DirectMap2M:     2082816 kB
DirectMap1G:    48234496 kB   DirectMap1G:    48234496 kB

Finally, sysrq-l backtraces from example slow processes show them in shrink_inactive_list:

Process in.mrlogind

isolate_pages_global
shrink_inactive_list
shrink_zone
zone_reclaim
get_page_from_freelist
__alloc_pages_nodemask
kmem_getpages
cache_grow
cache_alloc_refill
kmem_cache_alloc
__alloc_skb
sk_stream_alloc_skb
tcp_sendmsg
sock_aio_write
do_sync_write
vfs_write
sys_write
system_call_fastpath
Process opcontrol

__isolate_lru_page
isolate_pages_global
shrink_inactive_list
shrink_zone
zone_reclaim
isolate_pages_global
get_page_from_freelist
__alloc_pages_nodemask
alloc_pages_current
__pte_alloc
copy_pte_range
kmem_getpages
cache_grow
cache_alloc_refill
kmem_cache_alloc
dup_mm
copy_process
do_fork
alloc_fd
fd_install
sys_clone
stub_clone
system_call_fastpath

LLNL-bugzilla-ID: 1661



 Comments   
Comment by Peter Jones [ 27/Jun/12 ]

Bobijam

Could you please look into this one?

Thanks

Peter

Comment by Zhenyu Xu [ 28/Jun/12 ]

If the clients are using multiple core machines, patches from LU-1282 could help relieve some memory pressure.

And still I'll investigate the "inactive" issue.

Comment by Zhenyu Xu [ 02/Jul/12 ]

Ned Bass,

what kernel do you use? We think this issue might happen for kernels with version number > 2.6.18, and patch at http://review.whamcloud.com/3255 could fix it.

LU-1576 llite: correct page usage count

If kernel has add_to_page_cache_lru(), the ll_pagevec_add() is defined
as an empty function, while page_cache_get(page) only makes sense if
ll_pagevec_add() is defined.

This patch moves page_cache_get into ll_pagevec_add() macro
definition.

Comment by Christopher Morrone [ 03/Jul/12 ]

LU-1282 won't really help here. That is a one time static usage at mount time. There are plenty of gigs left after that. Lowering that static usage would just delay the problem a bit.

We are using RHEL 6.2's 2.6.32 with some local patches.

Ned is on vacation this week.

Let me know when the patch is reviewed and available for b2_1, and I'll add it to our branch.

Comment by Christopher Morrone [ 06/Jul/12 ]

Looks like it landed on master, and it applied cleanly to our 2.1.1-llnl so I pulled it in.

Comment by Jay Lan (Inactive) [ 25/Jul/12 ]

Hi Chris, is this patch in your production yet? Or under testing?
Thanks! Jay

Comment by Christopher Morrone [ 25/Jul/12 ]

It is in production some places. I have not heard yet whether they have run lpurge on one of the updated systems. I will ask.

Comment by Peter Jones [ 27/Jul/12 ]

Landed for 2.1.3 and 2.3. Will reopen if further work is needed

Generated at Sat Feb 10 01:17:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.