Details
-
Bug
-
Resolution: Cannot Reproduce
-
Minor
-
None
-
Lustre 2.1.2
-
None
-
SLES 11SP1 Kernel 2.6.32.54-0.3.1.20120223-nasa
-
3
-
7034
Description
We're seeing high load average on some Lustre clients accompanied by processes that are potentially stuck in the ldlm shrinker. Here's a sample stack trace:
thread_return+0x38/0x34c
wake_affine+0x357/0x3b0
enqueue_sleeper+0x178/0x1c0
enqueue_entity+0x158/0x1c0
cfs_hash_bd_lookup_intent+0x27/0x110 [libcfs]
cfs_hash_dual_bd_unlock+0x2c/0x80 [libcfs]
cfs_hash_lookup+0x7a/0xa0 [libcfs]
ldlm_pool_shrink+0x31/0xf0 [ptlrpc]
cl_env_fetch+0x1d/0x60 [obdclass]
cl_env_reexit+0xe/0x130 [obdclass]
ldlm_pools_shrink+0x1d2/0x310 [ptlrpc]
zone_watermark_ok+0x1b/0xd0
get_page_from_freelist+0x17a/0x720
apic_timer_interrupt+0xe/0x20
smp_call_function_many+0x1c0/0x250
drain_local_pages+0x0/0x10
smp_call_function+0x20/0x30
on_each_cpu+0x1d/0x40
__alloc_pages_slowpath+0x278/0x5f0
__alloc_pages_nodemask+0x13a/0x140
__get_free_pages+0x9/0x50
dup_task_struct+0x42/0x150
copy_process+0xb4/0xe50
do_fork+0x8c/0x3c0
sys_rt_sigreturn+0x222/0x2a0
stub_clone+0x13/0x20
system_call_fastpath+0x16/0x1b
FWIW, some of the traces have cfs_hash_bd_lookup_intent+0x27 as the top line. All of them
About 3/4 of the memory is inactive:
pfe11 ~ # cat /proc/meminfo
MemTotal: 16333060 kB
MemFree: 344568 kB
Buffers: 86844 kB
Cached: 1488340 kB
SwapCached: 4864 kB
Active: 1523184 kB
Inactive: 12045612 kB
Active(anon): 9152 kB
Inactive(anon): 7012 kB
Active(file): 1514032 kB
Inactive(file): 12038600 kB
Unevictable: 3580 kB
Mlocked: 3580 kB
SwapTotal: 10388652 kB
SwapFree: 10136240 kB
Dirty: 244 kB
Writeback: 976 kB
AnonPages: 15600 kB
Mapped: 20296 kB
Shmem: 0 kB
Slab: 870808 kB
SReclaimable: 64868 kB
SUnreclaim: 805940 kB
KernelStack: 4312 kB
PageTables: 14840 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 18555180 kB
Committed_AS: 1074912 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 544012 kB
VmallocChunk: 34343786784 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 7168 kB
DirectMap2M: 16769024 kB
We've seen this on two clients in the last two days, and I think we have several other undiagnosed cases in the recent past. The client that did it yesterday was generating OOM messages at the time; today's client did not.
I have a crash dump, but I'm having trouble getting good stack traces out of it. I'll attach the output from sysrq-t to start. I can't share the crash dump due to our security policies, but I can certainly run commands against it for you, as necessary.
If there's more information I can gather from a running system before we reboot it, let me know - I imagine we'll have another one soon.