[LU-1645] shrinker not shrinking/taking too long to shrink? Created: 18/Jul/12 Updated: 27/Mar/13 Resolved: 27/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | jason.rappleye@nasa.gov (Inactive) | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
SLES 11SP1 Kernel 2.6.32.54-0.3.1.20120223-nasa |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 7034 |
| Description |
|
We're seeing high load average on some Lustre clients accompanied by processes that are potentially stuck in the ldlm shrinker. Here's a sample stack trace: thread_return+0x38/0x34c FWIW, some of the traces have cfs_hash_bd_lookup_intent+0x27 as the top line. All of them About 3/4 of the memory is inactive: pfe11 ~ # cat /proc/meminfo We've seen this on two clients in the last two days, and I think we have several other undiagnosed cases in the recent past. The client that did it yesterday was generating OOM messages at the time; today's client did not. I have a crash dump, but I'm having trouble getting good stack traces out of it. I'll attach the output from sysrq-t to start. I can't share the crash dump due to our security policies, but I can certainly run commands against it for you, as necessary. If there's more information I can gather from a running system before we reboot it, let me know - I imagine we'll have another one soon. |
| Comments |
| Comment by Peter Jones [ 18/Jul/12 ] |
|
Bob will look into this one |
| Comment by Jay Lan (Inactive) [ 19/Jul/12 ] |
|
I uploaded bt-a.txt, of stack traces when crash dump was taken. |
| Comment by Bob Glossman (Inactive) [ 19/Jul/12 ] |
|
There's a suspicion here that this may be an instance of a known bug, echo 3 > /proc/sys/vm/drop_caches If that raises the MemFree amount a lot and eliminates the OOMs then it's probably the known bug. |
| Comment by jason.rappleye@nasa.gov (Inactive) [ 19/Jul/12 ] |
|
That looks promising. I've asked our operations staff to try that and collect /proc/meminfo before and after. I'll report back with the results after the next incident. Thanks! |
| Comment by jason.rappleye@nasa.gov (Inactive) [ 23/Jul/12 ] |
|
lflush + drop caches doesn't work. What's the next step in debugging this problem? I have a crash dump or two that might help, but you'll need to let me know what you need - as per our security policies, I can't send them to you. |
| Comment by Peter Jones [ 27/Mar/13 ] |
|
NASA report this no longer seems to be a problem so this quite possibly was a duplicate as the issue mentioned is fixed in the release that is in production nowadays. |