[LU-1645] shrinker not shrinking/taking too long to shrink? Created: 18/Jul/12  Updated: 27/Mar/13  Resolved: 27/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: jason.rappleye@nasa.gov (Inactive) Assignee: Bob Glossman (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

SLES 11SP1 Kernel 2.6.32.54-0.3.1.20120223-nasa


Attachments: Text File bt-a.txt     Text File service31-sysrq-t.txt    
Severity: 3
Rank (Obsolete): 7034

 Description   

We're seeing high load average on some Lustre clients accompanied by processes that are potentially stuck in the ldlm shrinker. Here's a sample stack trace:

thread_return+0x38/0x34c
wake_affine+0x357/0x3b0
enqueue_sleeper+0x178/0x1c0
enqueue_entity+0x158/0x1c0
cfs_hash_bd_lookup_intent+0x27/0x110 [libcfs]
cfs_hash_dual_bd_unlock+0x2c/0x80 [libcfs]
cfs_hash_lookup+0x7a/0xa0 [libcfs]
ldlm_pool_shrink+0x31/0xf0 [ptlrpc]
cl_env_fetch+0x1d/0x60 [obdclass]
cl_env_reexit+0xe/0x130 [obdclass]
ldlm_pools_shrink+0x1d2/0x310 [ptlrpc]
zone_watermark_ok+0x1b/0xd0
get_page_from_freelist+0x17a/0x720
apic_timer_interrupt+0xe/0x20
smp_call_function_many+0x1c0/0x250
drain_local_pages+0x0/0x10
smp_call_function+0x20/0x30
on_each_cpu+0x1d/0x40
__alloc_pages_slowpath+0x278/0x5f0
__alloc_pages_nodemask+0x13a/0x140
__get_free_pages+0x9/0x50
dup_task_struct+0x42/0x150
copy_process+0xb4/0xe50
do_fork+0x8c/0x3c0
sys_rt_sigreturn+0x222/0x2a0
stub_clone+0x13/0x20
system_call_fastpath+0x16/0x1b

FWIW, some of the traces have cfs_hash_bd_lookup_intent+0x27 as the top line. All of them

About 3/4 of the memory is inactive:

pfe11 ~ # cat /proc/meminfo
MemTotal: 16333060 kB
MemFree: 344568 kB
Buffers: 86844 kB
Cached: 1488340 kB
SwapCached: 4864 kB
Active: 1523184 kB
Inactive: 12045612 kB
Active(anon): 9152 kB
Inactive(anon): 7012 kB
Active(file): 1514032 kB
Inactive(file): 12038600 kB
Unevictable: 3580 kB
Mlocked: 3580 kB
SwapTotal: 10388652 kB
SwapFree: 10136240 kB
Dirty: 244 kB
Writeback: 976 kB
AnonPages: 15600 kB
Mapped: 20296 kB
Shmem: 0 kB
Slab: 870808 kB
SReclaimable: 64868 kB
SUnreclaim: 805940 kB
KernelStack: 4312 kB
PageTables: 14840 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 18555180 kB
Committed_AS: 1074912 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 544012 kB
VmallocChunk: 34343786784 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 7168 kB
DirectMap2M: 16769024 kB

We've seen this on two clients in the last two days, and I think we have several other undiagnosed cases in the recent past. The client that did it yesterday was generating OOM messages at the time; today's client did not.

I have a crash dump, but I'm having trouble getting good stack traces out of it. I'll attach the output from sysrq-t to start. I can't share the crash dump due to our security policies, but I can certainly run commands against it for you, as necessary.

If there's more information I can gather from a running system before we reboot it, let me know - I imagine we'll have another one soon.



 Comments   
Comment by Peter Jones [ 18/Jul/12 ]

Bob will look into this one

Comment by Jay Lan (Inactive) [ 19/Jul/12 ]

I uploaded bt-a.txt, of stack traces when crash dump was taken.
Note that CPU2 was in shrink_slab and CPU4 and CPU5 were in shrink_zone.

Comment by Bob Glossman (Inactive) [ 19/Jul/12 ]

There's a suspicion here that this may be an instance of a known bug, LU-1576. If you can reproduce the problem, you can try dropping caches with:

echo 3 > /proc/sys/vm/drop_caches

If that raises the MemFree amount a lot and eliminates the OOMs then it's probably the known bug.
If so, the patch in http://review.whamcloud.com/#change,3255 may help.

Comment by jason.rappleye@nasa.gov (Inactive) [ 19/Jul/12 ]

That looks promising. I've asked our operations staff to try that and collect /proc/meminfo before and after. I'll report back with the results after the next incident. Thanks!

Comment by jason.rappleye@nasa.gov (Inactive) [ 23/Jul/12 ]

lflush + drop caches doesn't work. What's the next step in debugging this problem? I have a crash dump or two that might help, but you'll need to let me know what you need - as per our security policies, I can't send them to you.

Comment by Peter Jones [ 27/Mar/13 ]

NASA report this no longer seems to be a problem so this quite possibly was a duplicate as the issue mentioned is fixed in the release that is in production nowadays.

Generated at Sat Feb 10 01:18:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.