Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.1.2
Labels:
None
Environment:
SLES 11SP1 Kernel 2.6.32.54-0.3.1.20120223-nasa

Severity:
3
Rank (Obsolete):
7034

Description

We're seeing high load average on some Lustre clients accompanied by processes that are potentially stuck in the ldlm shrinker. Here's a sample stack trace:

thread_return+0x38/0x34c
wake_affine+0x357/0x3b0
enqueue_sleeper+0x178/0x1c0
enqueue_entity+0x158/0x1c0
cfs_hash_bd_lookup_intent+0x27/0x110 [libcfs]
cfs_hash_dual_bd_unlock+0x2c/0x80 [libcfs]
cfs_hash_lookup+0x7a/0xa0 [libcfs]
ldlm_pool_shrink+0x31/0xf0 [ptlrpc]
cl_env_fetch+0x1d/0x60 [obdclass]
cl_env_reexit+0xe/0x130 [obdclass]
ldlm_pools_shrink+0x1d2/0x310 [ptlrpc]
zone_watermark_ok+0x1b/0xd0
get_page_from_freelist+0x17a/0x720
apic_timer_interrupt+0xe/0x20
smp_call_function_many+0x1c0/0x250
drain_local_pages+0x0/0x10
smp_call_function+0x20/0x30
on_each_cpu+0x1d/0x40
__alloc_pages_slowpath+0x278/0x5f0
__alloc_pages_nodemask+0x13a/0x140
__get_free_pages+0x9/0x50
dup_task_struct+0x42/0x150
copy_process+0xb4/0xe50
do_fork+0x8c/0x3c0
sys_rt_sigreturn+0x222/0x2a0
stub_clone+0x13/0x20
system_call_fastpath+0x16/0x1b

FWIW, some of the traces have cfs_hash_bd_lookup_intent+0x27 as the top line. All of them

About 3/4 of the memory is inactive:

pfe11 ~ # cat /proc/meminfo
MemTotal: 16333060 kB
MemFree: 344568 kB
Buffers: 86844 kB
Cached: 1488340 kB
SwapCached: 4864 kB
Active: 1523184 kB
Inactive: 12045612 kB
Active(anon): 9152 kB
Inactive(anon): 7012 kB
Active(file): 1514032 kB
Inactive(file): 12038600 kB
Unevictable: 3580 kB
Mlocked: 3580 kB
SwapTotal: 10388652 kB
SwapFree: 10136240 kB
Dirty: 244 kB
Writeback: 976 kB
AnonPages: 15600 kB
Mapped: 20296 kB
Shmem: 0 kB
Slab: 870808 kB
SReclaimable: 64868 kB
SUnreclaim: 805940 kB
KernelStack: 4312 kB
PageTables: 14840 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 18555180 kB
Committed_AS: 1074912 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 544012 kB
VmallocChunk: 34343786784 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 7168 kB
DirectMap2M: 16769024 kB

We've seen this on two clients in the last two days, and I think we have several other undiagnosed cases in the recent past. The client that did it yesterday was generating OOM messages at the time; today's client did not.

I have a crash dump, but I'm having trouble getting good stack traces out of it. I'll attach the output from sysrq-t to start. I can't share the crash dump due to our security policies, but I can certainly run commands against it for you, as necessary.

If there's more information I can gather from a running system before we reboot it, let me know - I imagine we'll have another one soon.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bt-a.txt
18 kB
19/Jul/12 5:47 PM
service31-sysrq-t.txt
328 kB
18/Jul/12 3:43 PM

Activity

People

Assignee:: Bob Glossman (Inactive)

Reporter:: jason.rappleye@nasa.gov (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Jul/12 3:43 PM

Updated:: 27/Mar/13 2:53 PM

Resolved:: 27/Mar/13 2:53 PM