[LU-7780] MDS crashed with oom-killer - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Labels:
- soak
Environment:
lola
build: 2.8.50-6-gf9ca359 ; commit f9ca359284357d145819beb08b316e932f7a3060

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Error happened during soak testing of build '20160215' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150215). DNE is enabled.
MDT had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active-active HA failover configuration.

Please note that build 20150215 is a vanilla build of the master brunch.
This issue might be addressed by the changes included in build '20160210' as we didn't observe this issue in a two day test session.

2016-02-15 15:37:51,169:fsmgmt.fsmgmt:INFO triggering fault mds_failover (for lola-11)
2016-02-15 15:44:57,839:fsmgmt.fsmgmt:INFO mds_failover just completed (for lola-11)
After that the slabs memory consumption of slabs continuously increased till all resources are exhausted at 2016-02-15 22:38.

Most pages are allocated by size-1048576 slabs. High score list reads as

#Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
20160215 22:46:20 size-1048576 29147 30562844672 29147 30562844672 29147 30562844672 29147 30562844672 0 0
20160215 22:46:20 size-262144 1793 470024192 1793 470024192 1793 470024192 1793 470024192 0 0
20160215 22:46:20 ptlrpc_cache 399364 306711552 399380 306723840 79873 327159808 79876 327172096 180224 0
20160215 22:46:20 size-1024 229179 234679296 229188 234688512 57295 234680320 57297 234688512 -24576 0
20160215 22:46:20 size-512 256540 131348480 258232 132214784 32278 132210688 32279 132214784 86016 0
20160215 22:46:20 size-192 460848 88482816 460880 88488960 23043 94384128 23044 94388224 28672 0
20160215 22:46:20 size-8192 5776 47316992 5776 47316992 5776 47316992 5776 47316992 -8192 0
20160215 22:46:20 size-128 265120 33935360 266250 34080000 8875 36352000 8875 36352000 0 0
20160215 22:46:20 size-65536 361 23658496 361 23658496 361 23658496 361 23658496 0 0
20160215 22:46:20 kmem_cache 289 9506944 289 9506944 289 18939904 289 18939904 0 0

(see attached file slab-usage-by-allocation-descending.dat)

Attached files messages, console file of lola-11. Sorted slab usage as oom-killer was started.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console-lola-11.log.bz2
87 kB
16/Feb/16 12:00 PM
lola-11-memory-counter-20160215.dat.bz2
37 kB
16/Feb/16 12:00 PM
messages-lola-11.log.bz2
72 kB
16/Feb/16 12:00 PM
slab-details.tar.bz2
831 kB
16/Feb/16 12:00 PM
slab-usage-by-allocation-descending.dat.bz2
4 kB
16/Feb/16 12:00 PM

Issue Links

is related to

LU-7836 MDSes crashed with oom-killer

Resolved

Activity

[LU-7780] MDS crashed with oom-killer

Peter Jones added a comment - 16/Mar/16 7:27 PM

ok thanks Di

Peter Jones added a comment - 16/Mar/16 7:27 PM ok thanks Di

Di Wang (Inactive) added a comment - 16/Mar/16 7:24 PM

Yes, this is duplicate with lu-7836

Di Wang (Inactive) added a comment - 16/Mar/16 7:24 PM Yes, this is duplicate with lu-7836

Peter Jones added a comment - 16/Mar/16 6:58 PM

Moving to 2.9 because it seems that this issue only occurs with multiple MDTs per MDS and does not happen with the more common configuration of a single MDT per MDS. Is this a duplicate of ~~LU-7836~~?

Peter Jones added a comment - 16/Mar/16 6:58 PM Moving to 2.9 because it seems that this issue only occurs with multiple MDTs per MDS and does not happen with the more common configuration of a single MDT per MDS. Is this a duplicate of LU-7836 ?

Frank Heckes (Inactive) added a comment - 15/Mar/16 3:36 PM

Debug filter has been changed to : debug_mb=128 debug=+malloc +trace

Frank Heckes (Inactive) added a comment - 15/Mar/16 3:36 PM Debug filter has been changed to : debug_mb=128 debug=+malloc +trace

Andreas Dilger added a comment - 14/Mar/16 7:33 PM

Frank or Cliff,
it might be useful to increase the debug buffer size and turn on trace logging (lctl set_param debug_mb=128 debug=+trace) and then let it continue to run for a minute or two and dump the debug log without having to wait for the crash to be hit. The size-1024 items are being allocated 10/sec so the cause will appear in any logs. Similarly, the size-1048576 allocations are done every minute so the cause should also be in any logs dumped. The size-262144 is only being allocated in chunks every 15 minutes, so it might be related to recovery if that is the frequency of failovers, so that might need to be timed to coordinate with a failover or similar, or the size of the log increased to 512 or 1024 MB.

Di, the memory is all used in ptlrpc_cache and large (1MB and 512KB) slab allocations. My first guess is that these large allocations relate to striped directory RPCs (i.e. OUT) and recovery, and are probably not being freed except at shutdown. There are 30k 1MB allocations consuming a whopping 30GB of memory, 400k ptlrpc_cache entries consuming 300MB of memory, and 230k 1KB allocations consuming 235MB. All of those slabs are growing continuously from startup, yet there aren't any objects (ldlm_lock, ldlm_resource, ldiskfs_inode, buffer_head, dentry, etc) that might normally grow as the node is caching a lot of data.

It might be that we are just keeping too many requests in ptlrpc_cache? In that case, we might need to add a slab shrinker for ptlrpc_cache to keep the size of this slab in check, but I'm not sure this is the root problem because it seems all of the ptlrpc_cache items are in use, so it may be that there is a stray reference on the request that isn't being put somewhere? It is also a bit sad that the cache name is ptlrpc_cache but the internal usage is request_cache. The slab cache name should be changed to ptlrpc_request_cache to match the code so that grep can find all of the relevant code at once.

Is it possible that ptlrpc_history is enabled on these nodes? That would also keep a lot of requests pinned in memory in ptlrpc_server_drop_request(). Normally this wouldn't be a problem, but if there are so many large buffers attached to the RPCs this could be causing the OOM, when smaller (e.g. 32KB) buffers wouldn't be a problem. It also isn't clear that there is a benefit to keeping the request buffers in the request history, since they are never used, so it might also be possible to release the buffers before inserting the request into the history, but I'm not sure if that is required here or not?

Andreas Dilger added a comment - 14/Mar/16 7:33 PM Frank or Cliff, it might be useful to increase the debug buffer size and turn on trace logging ( lctl set_param debug_mb=128 debug=+trace ) and then let it continue to run for a minute or two and dump the debug log without having to wait for the crash to be hit. The size-1024 items are being allocated 10/sec so the cause will appear in any logs. Similarly, the size-1048576 allocations are done every minute so the cause should also be in any logs dumped. The size-262144 is only being allocated in chunks every 15 minutes, so it might be related to recovery if that is the frequency of failovers, so that might need to be timed to coordinate with a failover or similar, or the size of the log increased to 512 or 1024 MB. Di, the memory is all used in ptlrpc_cache and large (1MB and 512KB) slab allocations. My first guess is that these large allocations relate to striped directory RPCs (i.e. OUT) and recovery, and are probably not being freed except at shutdown. There are 30k 1MB allocations consuming a whopping 30GB of memory, 400k ptlrpc_cache entries consuming 300MB of memory, and 230k 1KB allocations consuming 235MB. All of those slabs are growing continuously from startup, yet there aren't any objects (ldlm_lock, ldlm_resource, ldiskfs_inode, buffer_head, dentry, etc) that might normally grow as the node is caching a lot of data. It might be that we are just keeping too many requests in ptlrpc_cache ? In that case, we might need to add a slab shrinker for ptlrpc_cache to keep the size of this slab in check, but I'm not sure this is the root problem because it seems all of the ptlrpc_cache items are in use, so it may be that there is a stray reference on the request that isn't being put somewhere? It is also a bit sad that the cache name is ptlrpc_cache but the internal usage is request_cache . The slab cache name should be changed to ptlrpc_request_cache to match the code so that grep can find all of the relevant code at once. Is it possible that ptlrpc_history is enabled on these nodes? That would also keep a lot of requests pinned in memory in ptlrpc_server_drop_request() . Normally this wouldn't be a problem, but if there are so many large buffers attached to the RPCs this could be causing the OOM, when smaller (e.g. 32KB) buffers wouldn't be a problem. It also isn't clear that there is a benefit to keeping the request buffers in the request history, since they are never used, so it might also be possible to release the buffers before inserting the request into the history, but I'm not sure if that is required here or not?

Frank Heckes (Inactive) added a comment - 07/Mar/16 2:11 PM

currently '+malloc' has been added to thed debug on server nodes

Frank Heckes (Inactive) added a comment - 07/Mar/16 2:11 PM currently '+malloc' has been added to thed debug on server nodes

Oleg Drokin added a comment - 22/Feb/16 6:43 PM

Can you please do the tip of b2_8 testign to see if it's also affected?

Also on master - if you can enable memory allocation tracing, reproduce, and then extract the debug log out of a crashdump so that we can see where are all of those allocations come from.

Oleg Drokin added a comment - 22/Feb/16 6:43 PM Can you please do the tip of b2_8 testign to see if it's also affected? Also on master - if you can enable memory allocation tracing, reproduce, and then extract the debug log out of a crashdump so that we can see where are all of those allocations come from.

Frank Heckes (Inactive) added a comment - 22/Feb/16 10:07 AM

Three (different) MDSes crashed with oom-killer started over the weekend (2016-02-20 - 2016-02-21).
Performance counters and event sequence can be provided if necessary

Frank Heckes (Inactive) added a comment - 22/Feb/16 10:07 AM Three (different) MDSes crashed with oom-killer started over the weekend (2016-02-20 - 2016-02-21). Performance counters and event sequence can be provided if necessary

People

Assignee:: Frank Heckes (Inactive)

Reporter:: Frank Heckes (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 16/Feb/16 11:46 AM

Updated:: 06/Jun/16 3:52 PM

Resolved:: 16/Mar/16 7:27 PM