Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • lola
       build: 2.8.50-6-gf9ca359 ; commit f9ca359284357d145819beb08b316e932f7a3060
    • 3
    • 9223372036854775807

    Description

      Error happened during soak testing of build '20160215' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150215). DNE is enabled.
      MDT had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active-active HA failover configuration.

      Please note that build 20150215 is a vanilla build of the master brunch.
      This issue might be addressed by the changes included in build '20160210' as we didn't observe this issue in a two day test session.

      • 2016-02-15 15:37:51,169:fsmgmt.fsmgmt:INFO triggering fault mds_failover (for lola-11)
      • 2016-02-15 15:44:57,839:fsmgmt.fsmgmt:INFO mds_failover just completed (for lola-11)
      • After that the slabs memory consumption of slabs continuously increased till all resources are exhausted at 2016-02-15 22:38.
      • Most pages are allocated by size-1048576 slabs. High score list reads as
        #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
        20160215 22:46:20 size-1048576 29147 30562844672 29147 30562844672 29147 30562844672 29147 30562844672 0 0
        20160215 22:46:20 size-262144 1793 470024192 1793 470024192 1793 470024192 1793 470024192 0 0
        20160215 22:46:20 ptlrpc_cache 399364 306711552 399380 306723840 79873 327159808 79876 327172096 180224 0
        20160215 22:46:20 size-1024 229179 234679296 229188 234688512 57295 234680320 57297 234688512 -24576 0
        20160215 22:46:20 size-512 256540 131348480 258232 132214784 32278 132210688 32279 132214784 86016 0
        20160215 22:46:20 size-192 460848 88482816 460880 88488960 23043 94384128 23044 94388224 28672 0
        20160215 22:46:20 size-8192 5776 47316992 5776 47316992 5776 47316992 5776 47316992 -8192 0
        20160215 22:46:20 size-128 265120 33935360 266250 34080000 8875 36352000 8875 36352000 0 0
        20160215 22:46:20 size-65536 361 23658496 361 23658496 361 23658496 361 23658496 0 0
        20160215 22:46:20 kmem_cache 289 9506944 289 9506944 289 18939904 289 18939904 0 0
        

        (see attached file slab-usage-by-allocation-descending.dat)

      Attached files messages, console file of lola-11. Sorted slab usage as oom-killer was started.

      Attachments

        Issue Links

          Activity

            [LU-7780] MDS crashed with oom-killer
            pjones Peter Jones added a comment -

            ok thanks Di

            pjones Peter Jones added a comment - ok thanks Di
            di.wang Di Wang added a comment -

            Yes, this is duplicate with lu-7836

            di.wang Di Wang added a comment - Yes, this is duplicate with lu-7836
            pjones Peter Jones added a comment -

            Moving to 2.9 because it seems that this issue only occurs with multiple MDTs per MDS and does not happen with the more common configuration of a single MDT per MDS. Is this a duplicate of LU-7836?

            pjones Peter Jones added a comment - Moving to 2.9 because it seems that this issue only occurs with multiple MDTs per MDS and does not happen with the more common configuration of a single MDT per MDS. Is this a duplicate of LU-7836 ?

            Debug filter has been changed to : debug_mb=128 debug=+malloc +trace

            heckes Frank Heckes (Inactive) added a comment - Debug filter has been changed to : debug_mb=128 debug=+malloc +trace

            Frank or Cliff,
            it might be useful to increase the debug buffer size and turn on trace logging (lctl set_param debug_mb=128 debug=+trace) and then let it continue to run for a minute or two and dump the debug log without having to wait for the crash to be hit. The size-1024 items are being allocated 10/sec so the cause will appear in any logs. Similarly, the size-1048576 allocations are done every minute so the cause should also be in any logs dumped. The size-262144 is only being allocated in chunks every 15 minutes, so it might be related to recovery if that is the frequency of failovers, so that might need to be timed to coordinate with a failover or similar, or the size of the log increased to 512 or 1024 MB.

            Di, the memory is all used in ptlrpc_cache and large (1MB and 512KB) slab allocations. My first guess is that these large allocations relate to striped directory RPCs (i.e. OUT) and recovery, and are probably not being freed except at shutdown. There are 30k 1MB allocations consuming a whopping 30GB of memory, 400k ptlrpc_cache entries consuming 300MB of memory, and 230k 1KB allocations consuming 235MB. All of those slabs are growing continuously from startup, yet there aren't any objects (ldlm_lock, ldlm_resource, ldiskfs_inode, buffer_head, dentry, etc) that might normally grow as the node is caching a lot of data.

            It might be that we are just keeping too many requests in ptlrpc_cache? In that case, we might need to add a slab shrinker for ptlrpc_cache to keep the size of this slab in check, but I'm not sure this is the root problem because it seems all of the ptlrpc_cache items are in use, so it may be that there is a stray reference on the request that isn't being put somewhere? It is also a bit sad that the cache name is ptlrpc_cache but the internal usage is request_cache. The slab cache name should be changed to ptlrpc_request_cache to match the code so that grep can find all of the relevant code at once.

            Is it possible that ptlrpc_history is enabled on these nodes? That would also keep a lot of requests pinned in memory in ptlrpc_server_drop_request(). Normally this wouldn't be a problem, but if there are so many large buffers attached to the RPCs this could be causing the OOM, when smaller (e.g. 32KB) buffers wouldn't be a problem. It also isn't clear that there is a benefit to keeping the request buffers in the request history, since they are never used, so it might also be possible to release the buffers before inserting the request into the history, but I'm not sure if that is required here or not?

            adilger Andreas Dilger added a comment - Frank or Cliff, it might be useful to increase the debug buffer size and turn on trace logging ( lctl set_param debug_mb=128 debug=+trace ) and then let it continue to run for a minute or two and dump the debug log without having to wait for the crash to be hit. The size-1024 items are being allocated 10/sec so the cause will appear in any logs. Similarly, the size-1048576 allocations are done every minute so the cause should also be in any logs dumped. The size-262144 is only being allocated in chunks every 15 minutes, so it might be related to recovery if that is the frequency of failovers, so that might need to be timed to coordinate with a failover or similar, or the size of the log increased to 512 or 1024 MB. Di, the memory is all used in ptlrpc_cache and large (1MB and 512KB) slab allocations. My first guess is that these large allocations relate to striped directory RPCs (i.e. OUT) and recovery, and are probably not being freed except at shutdown. There are 30k 1MB allocations consuming a whopping 30GB of memory, 400k ptlrpc_cache entries consuming 300MB of memory, and 230k 1KB allocations consuming 235MB. All of those slabs are growing continuously from startup, yet there aren't any objects (ldlm_lock, ldlm_resource, ldiskfs_inode, buffer_head, dentry, etc) that might normally grow as the node is caching a lot of data. It might be that we are just keeping too many requests in ptlrpc_cache ? In that case, we might need to add a slab shrinker for ptlrpc_cache to keep the size of this slab in check, but I'm not sure this is the root problem because it seems all of the ptlrpc_cache items are in use, so it may be that there is a stray reference on the request that isn't being put somewhere? It is also a bit sad that the cache name is ptlrpc_cache but the internal usage is request_cache . The slab cache name should be changed to ptlrpc_request_cache to match the code so that grep can find all of the relevant code at once. Is it possible that ptlrpc_history is enabled on these nodes? That would also keep a lot of requests pinned in memory in ptlrpc_server_drop_request() . Normally this wouldn't be a problem, but if there are so many large buffers attached to the RPCs this could be causing the OOM, when smaller (e.g. 32KB) buffers wouldn't be a problem. It also isn't clear that there is a benefit to keeping the request buffers in the request history, since they are never used, so it might also be possible to release the buffers before inserting the request into the history, but I'm not sure if that is required here or not?

            currently '+malloc' has been added to thed debug on server nodes

            heckes Frank Heckes (Inactive) added a comment - currently '+malloc' has been added to thed debug on server nodes
            green Oleg Drokin added a comment -

            Can you please do the tip of b2_8 testign to see if it's also affected?

            Also on master - if you can enable memory allocation tracing, reproduce, and then extract the debug log out of a crashdump so that we can see where are all of those allocations come from.

            green Oleg Drokin added a comment - Can you please do the tip of b2_8 testign to see if it's also affected? Also on master - if you can enable memory allocation tracing, reproduce, and then extract the debug log out of a crashdump so that we can see where are all of those allocations come from.

            Three (different) MDSes crashed with oom-killer started over the weekend (2016-02-20 - 2016-02-21).
            Performance counters and event sequence can be provided if necessary

            heckes Frank Heckes (Inactive) added a comment - Three (different) MDSes crashed with oom-killer started over the weekend (2016-02-20 - 2016-02-21). Performance counters and event sequence can be provided if necessary

            People

              heckes Frank Heckes (Inactive)
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: