[LU-3897] Hang up in ldlm_pools_shrink under OOM Created: 06/Sep/13 Updated: 05/Dec/13 Resolved: 05/Dec/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Sebastien Buisson (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 10187 |
| Description |
|
Hi, Several Bull customers running lustre 2.1.x had a hang in ldlm_pools_shrink on a lustre client (login node). Each time, we can see a lot of OOM messages in the syslog of the dump files. This issue looks like Thanks, |
| Comments |
| Comment by Peter Jones [ 06/Sep/13 ] |
|
Bobijam What do you recommend here? Thanks Peter |
| Comment by Zhenyu Xu [ 09/Sep/13 ] |
|
Oleg, Would you consider http://review.whamcloud.com/#/c/4954/ ?I don't know whether Vitaly has addressed the issue somewhere. |
| Comment by Oleg Drokin [ 09/Sep/13 ] |
|
I dout that patch would help. Sebastien, can we get kernel dmesg log please, I wonder what's there/ It seems that something forgot to release namespace lock on the client (at least I don't see anything that's in an area guarded by this lock, everybody seems to wait to acquire it), possibly there's some place that forgets to release it on an error exit path and here's hope there's some clue in the dmesg. |
| Comment by Sebastien Buisson (Inactive) [ 09/Sep/13 ] |
|
Hi, This is the dmesg requested by Oleg, taken from the crash dump. Sebastien. |
| Comment by Zhenyu Xu [ 08/Nov/13 ] |
|
hmm, why the dmesg does not have any lustre log informations? I just saw call trace and mem-info report in it. |
| Comment by Sebastien Buisson (Inactive) [ 08/Nov/13 ] |
|
Probably there is so many information dumped in the dmesg regarding the OOM issue that it replaces the Lustre logs. From the dump, do you know how we can access the memory containing the Lustre debug logs? |
| Comment by Zhenyu Xu [ 13/Nov/13 ] |
|
I haven't find relevant info to get the root cause. What's the memory usage on the clients? I wonder whether it is the client uses too much memory for lock or for data cache. what's the "lctl get_param ldlm.namespaces.*.lru_size" values? |
| Comment by Sebastien Buisson (Inactive) [ 14/Nov/13 ] |
|
Hi, Not too easy to get the lru_size from the crash dump. All values are 0. HTH, |
| Comment by Zhenyu Xu [ 27/Nov/13 ] |
|
If lru_size is 0, it means lock does not take much memory, what is the "lctl get_param llite.*.max_cached_mb" output? Can you set it to a smaller value to keep less data cache on the client side? |
| Comment by Sebastien Buisson (Inactive) [ 29/Nov/13 ] |
|
Hi, In the crash dump I looked at the value of ll_async_page_max, which is 12365796, so max_cached_mb is around 48 GB, ie 3/4 of total node memory. But in any case, remember there is a regression in CLIO in 2.1 causing this max_cached_mb value to be never used! So there is no means to limit Lustre data cache Sebastien. |
| Comment by Jinshan Xiong (Inactive) [ 03/Dec/13 ] |
|
Hi Sebastien, can you please show me the output of `slabtop'? |
| Comment by Sebastien Buisson (Inactive) [ 03/Dec/13 ] |
|
Hi, I have no idea how I can run slabtop inside crash :/ Maybe you will be able to get the information you are looking for. Thanks, |
| Comment by Jinshan Xiong (Inactive) [ 03/Dec/13 ] |
|
`kmem -s' will print out slabinfo in this case. Thanks for the crashdump, and I will take a look. |
| Comment by Jinshan Xiong (Inactive) [ 03/Dec/13 ] |
|
the system ran out of memory, from the output of `kmem -i': crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 16487729 62.9 GB ----
FREE 37266 145.6 MB 0% of TOTAL MEM
USED 16450463 62.8 GB 99% of TOTAL MEM
SHARED 401 1.6 MB 0% of TOTAL MEM
BUFFERS 99 396 KB 0% of TOTAL MEM
CACHED 57 228 KB 0% of TOTAL MEM
SLAB 22513 87.9 MB 0% of TOTAL MEM
TOTAL SWAP 255857 999.4 MB ----
SWAP USED 255856 999.4 MB 99% of TOTAL SWAP
SWAP FREE 1 4 KB 0% of TOTAL SWAP
Both system memory and swap space were used up. Page cache and slab cache were normal, no excessive usage at all. It turned out most of the memory were used for anonymous memory mapping. crash> kmem -V
VM_STAT:
NR_FREE_PAGES: 37266
NR_INACTIVE_ANON: 956162
NR_ACTIVE_ANON: 15158596
NR_INACTIVE_FILE: 0
NR_ACTIVE_FILE: 99
NR_UNEVICTABLE: 0
NR_MLOCK: 0
NR_ANON_PAGES: 16115037
there was 16115037 NR_ANON_PAGES which is 61G in size. I guess the application is probably using too much memory or badly written. I don't think we can help in this case. |
| Comment by Jinshan Xiong (Inactive) [ 05/Dec/13 ] |
|
Please reopen this ticket if you have more questions |