Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • None
    • None
    • Server: 2.1.4, centos 6.3
      Client: 2.1.5, sles11sp1
    • 2
    • 9716

    Description

      We have ongoing problem of unreclaiming slab memory stuck in Lustre. It is different from LU-2613 in that unmounting Lustre FS did not release the stuck memory. Also we tried lflush and also the write technique suggested by Niu Yawei in LU-2613 at 15/Jan/13 8:54 AM. None worked for us.

      This is an ongoing problem and created a lot of problem in our production systems.

      I will append /proc/meminfo and a 'slabtop' output below. Let me know what other information you need.

      bridge2 /proc # cat meminfo
      MemTotal: 65978336 kB
      MemFree: 4417544 kB
      Buffers: 7804 kB
      Cached: 183036 kB
      SwapCached: 6068 kB
      Active: 101840 kB
      Inactive: 183404 kB
      Active(anon): 83648 kB
      Inactive(anon): 13036 kB
      Active(file): 18192 kB
      Inactive(file): 170368 kB
      Unevictable: 3480 kB
      Mlocked: 3480 kB
      SwapTotal: 2000052 kB
      SwapFree: 1669420 kB
      Dirty: 288 kB
      Writeback: 0 kB
      AnonPages: 92980 kB
      Mapped: 16964 kB
      Shmem: 136 kB
      Slab: 57633936 kB
      SReclaimable: 1029472 kB
      SUnreclaim: 56604464 kB
      KernelStack: 5280 kB
      PageTables: 15928 kB
      NFS_Unstable: 0 kB
      Bounce: 0 kB
      WritebackTmp: 0 kB
      CommitLimit: 34989220 kB
      Committed_AS: 737448 kB
      VmallocTotal: 34359738367 kB
      VmallocUsed: 2348084 kB
      VmallocChunk: 34297775112 kB
      HardwareCorrupted: 0 kB
      HugePages_Total: 0
      HugePages_Free: 0
      HugePages_Rsvd: 0
      HugePages_Surp: 0
      Hugepagesize: 2048 kB
      DirectMap4k: 7104 kB
      DirectMap2M: 67100672 kB
      bridge2 /proc #

      bridge2 ~ # slabtop --once

      Active / Total Objects (% used) : 2291913 / 500886088 (0.5%)
      Active / Total Slabs (% used) : 170870 / 14351991 (1.2%)
      Active / Total Caches (% used) : 151 / 249 (60.6%)
      Active / Total Size (% used) : 838108.56K / 53998141.57K (1.6%)
      Minimum / Average / Maximum Object : 0.01K / 0.11K / 4096.00K

      OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
      133434868 41138 0% 0.04K 1450379 92 5801516K lovsub_page_kmem
      124369720 77440 0% 0.19K 6218486 20 24873944K cl_page_kmem
      115027759 41264 0% 0.05K 1493867 77 5975468K lov_page_kmem
      77597568 41174 0% 0.08K 1616616 48 6466464K vvp_page_kmem
      44004405 38371 0% 0.26K 2933627 15 11734508K osc_page_kmem
      1558690 9106 0% 0.54K 222670 7 890680K radix_tree_node
      1435785 457262 31% 0.25K 95719 15 382876K size-256
      991104 24455 2% 0.50K 123888 8 495552K size-512
      591420 573510 96% 0.12K 19714 30 78856K size-128
      583038 507363 87% 0.06K 9882 59 39528K size-64
      399080 4356 1% 0.19K 19954 20 79816K cred_jar
      112112 81796 72% 0.03K 1001 112 4004K size-32
      106368 106154 99% 0.08K 2216 48 8864K sysfs_dir_cache
      89740 26198 29% 1.00K 22435 4 89740K size-1024
      87018 1601 1% 0.62K 14503 6 58012K proc_inode_cache
      53772 2845 5% 0.58K 8962 6 35848K inode_cache
      44781 44746 99% 8.00K 44781 1 358248K size-8192
      42700 28830 67% 0.19K 2135 20 8540K dentry
      38990 2213 5% 0.79K 7798 5 31192K ext3_inode_cache
      25525 24880 97% 0.78K 5105 5 20420K shmem_inode_cache
      23394 16849 72% 0.18K 1114 21 4456K vm_area_struct
      22340 6262 28% 0.19K 1117 20 4468K filp
      20415 19243 94% 0.25K 1361 15 5444K skbuff_head_cache
      19893 2152 10% 0.20K 1047 19 4188K ll_obdo_cache
      15097 15006 99% 4.00K 15097 1 60388K size-4096
      14076 1837 13% 0.04K 153 92 612K osc_req_kmem
      12696 1448 11% 0.04K 138 92 552K lovsub_req_kmem
      11684 1444 12% 0.04K 127 92 508K lov_req_kmem
      10028 1477 14% 0.04K 109 92 436K ccc_req_kmem
      9750 3000 30% 0.12K 325 30 1300K nfs_page

      Attachments

        Issue Links

          Activity

            [LU-3771] stuck 56G of SUnreclaim memory
            pjones Peter Jones added a comment -

            ok - thanks Jay!

            pjones Peter Jones added a comment - ok - thanks Jay!

            I now think the problem was probably caused by a certain application by certain user(s). For about a week after crash, about 90% of system memory were in slab. Last Friday I checked again, the slab percentage dropped down to 38%. Today the slab percentage was 30%.

            We can close this ticket. Next time should the problem happen again, we will track down the user and help him/her figure out how to address the problem.

            jaylan Jay Lan (Inactive) added a comment - I now think the problem was probably caused by a certain application by certain user(s). For about a week after crash, about 90% of system memory were in slab. Last Friday I checked again, the slab percentage dropped down to 38%. Today the slab percentage was 30%. We can close this ticket. Next time should the problem happen again, we will track down the user and help him/her figure out how to address the problem.

            We have not done umount of Lustre fs. From past observation unmount would not free up the slab memory until we unload lustre modules. The fact that unloading the lustre modules frees up slabs suggests some communication not right between kernel and the lustre modules. Why and how? I do not know.

            Lustre should have freed all slab objects (done by kmem_cache_free()) after umount, but it doesn't mean slab cache will free the memory used by object immediately, slab cache will still hold the memory for next use, the memory will only be freed when slab thinks memory is tight or the slab is destroyed. (when unload Lustre module, the slabs will be destroyed)

            If the slab cache consumed too much memory, and that result in unusable/sluggish system, I think there could be some defects in the slab reap mechanism (slab cache is run by kernel, not Lustre), what we can do is to reduce use of slab in Lustre, the fix of LU-744 could be helpful.

            niu Niu Yawei (Inactive) added a comment - We have not done umount of Lustre fs. From past observation unmount would not free up the slab memory until we unload lustre modules. The fact that unloading the lustre modules frees up slabs suggests some communication not right between kernel and the lustre modules. Why and how? I do not know. Lustre should have freed all slab objects (done by kmem_cache_free()) after umount, but it doesn't mean slab cache will free the memory used by object immediately, slab cache will still hold the memory for next use, the memory will only be freed when slab thinks memory is tight or the slab is destroyed. (when unload Lustre module, the slabs will be destroyed) If the slab cache consumed too much memory, and that result in unusable/sluggish system, I think there could be some defects in the slab reap mechanism (slab cache is run by kernel, not Lustre), what we can do is to reduce use of slab in Lustre, the fix of LU-744 could be helpful.

            Niu, it is not exactly as what you said "the slab objects are already freed and put back in the slab cache after umount".

            Bridge2 was last rebooted 2 days ago at Aug 19 04:38. All 8 lustre fs have not been umounted since. Here is the 'slabtop' output:

            Active / Total Objects (% used) : 8844277 / 385193960 (2.3%)
            Active / Total Slabs (% used) : 411903 / 9494044 (4.3%)
            Active / Total Caches (% used) : 151 / 249 (60.6%)
            Active / Total Size (% used) : 1729034.95K / 35916957.66K (4.8%)
            Minimum / Average / Maximum Object : 0.01K / 0.09K / 4096.00K

            OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
            124673156 1052956 0% 0.04K 1355143 92 5420572K lovsub_page_kmem

            104933521 1057165 1% 0.05K 1362773 77 5451092K lov_page_kmem

            68093040 1050624 1% 0.08K 1418605 48 5674420K vvp_page_kmem

            53508260 2119860 3% 0.19K 2675413 20 10701652K cl_page_kmem

            12914768 27449 0% 0.50K 1614346 8 6457384K size-512

            11721315 1058247 9% 0.26K 781421 15 3125684K osc_page_kmem

            5503680 81827 1% 0.03K 49140 112 196560K size-32
            1856400 1690306 91% 0.12K 61880 30 247520K size-128

            639760 4178 0% 0.19K 31988 20 127952K cred_jar
            475658 124818 26% 0.06K 8062 59 32248K size-64
            250560 87321 34% 0.25K 16704 15 66816K size-256
            106416 106154 99% 0.08K 2217 48 8868K sysfs_dir_cache
            65527 65103 99% 0.54K 9361 7 37444K radix_tree_node
            46062 46027 99% 8.00K 46062 1 368496K size-8192
            43160 41625 96% 0.19K 2158 20 8632K dentry
            42600 35980 84% 1.00K 10650 4 42600K size-1024
            37962 11173 29% 0.10K 1026 37 4104K buffer_head
            28440 1072 3% 0.12K 948 30 3792K nfs_page
            27798 5355 19% 0.58K 4633 6 18532K inode_cache
            25545 24886 97% 0.78K 5109 5 20436K shmem_inode_cache
            20910 19187 91% 0.25K 1394 15 5576K skbuff_head_cache
            17346 4255 24% 0.62K 2891 6 11564K proc_inode_cache
            15049 14950 99% 4.00K 15049 1 60196K size-4096
            13902 12067 86% 0.18K 662 21 2648K vm_area_struct
            10200 6255 61% 0.19K 510 20 2040K filp
            8408 8192 97% 0.44K 1051 8 4204K ib_mad
            5900 5586 94% 2.00K 2950 2 11800K size-2048
            4445 3003 67% 0.56K 635 7 2540K ldlm_locks
            4032 2810 69% 0.02K 28 144 112K anon_vma
            3696 3360 90% 0.08K 77 48 308K Acpi-State
            3638 2106 57% 0.11K 107 34 428K journal_head
            3498 2873 82% 0.07K 66 53 264K Acpi-Operand
            3312 72 2% 0.02K 23 144 92K journal_handle
            3017 2366 78% 0.50K 431 7 1724K skbuff_fclone_cache

            We have not done umount of Lustre fs. From past observation unmount would not free up the slab memory until we unload lustre modules. The fact that unloading the lustre modules frees up slabs suggests some communication not right between kernel and the lustre modules. Why and how? I do not know.

            jaylan Jay Lan (Inactive) added a comment - Niu, it is not exactly as what you said "the slab objects are already freed and put back in the slab cache after umount". Bridge2 was last rebooted 2 days ago at Aug 19 04:38. All 8 lustre fs have not been umounted since. Here is the 'slabtop' output: Active / Total Objects (% used) : 8844277 / 385193960 (2.3%) Active / Total Slabs (% used) : 411903 / 9494044 (4.3%) Active / Total Caches (% used) : 151 / 249 (60.6%) Active / Total Size (% used) : 1729034.95K / 35916957.66K (4.8%) Minimum / Average / Maximum Object : 0.01K / 0.09K / 4096.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 124673156 1052956 0% 0.04K 1355143 92 5420572K lovsub_page_kmem 104933521 1057165 1% 0.05K 1362773 77 5451092K lov_page_kmem 68093040 1050624 1% 0.08K 1418605 48 5674420K vvp_page_kmem 53508260 2119860 3% 0.19K 2675413 20 10701652K cl_page_kmem 12914768 27449 0% 0.50K 1614346 8 6457384K size-512 11721315 1058247 9% 0.26K 781421 15 3125684K osc_page_kmem 5503680 81827 1% 0.03K 49140 112 196560K size-32 1856400 1690306 91% 0.12K 61880 30 247520K size-128 639760 4178 0% 0.19K 31988 20 127952K cred_jar 475658 124818 26% 0.06K 8062 59 32248K size-64 250560 87321 34% 0.25K 16704 15 66816K size-256 106416 106154 99% 0.08K 2217 48 8868K sysfs_dir_cache 65527 65103 99% 0.54K 9361 7 37444K radix_tree_node 46062 46027 99% 8.00K 46062 1 368496K size-8192 43160 41625 96% 0.19K 2158 20 8632K dentry 42600 35980 84% 1.00K 10650 4 42600K size-1024 37962 11173 29% 0.10K 1026 37 4104K buffer_head 28440 1072 3% 0.12K 948 30 3792K nfs_page 27798 5355 19% 0.58K 4633 6 18532K inode_cache 25545 24886 97% 0.78K 5109 5 20436K shmem_inode_cache 20910 19187 91% 0.25K 1394 15 5576K skbuff_head_cache 17346 4255 24% 0.62K 2891 6 11564K proc_inode_cache 15049 14950 99% 4.00K 15049 1 60196K size-4096 13902 12067 86% 0.18K 662 21 2648K vm_area_struct 10200 6255 61% 0.19K 510 20 2040K filp 8408 8192 97% 0.44K 1051 8 4204K ib_mad 5900 5586 94% 2.00K 2950 2 11800K size-2048 4445 3003 67% 0.56K 635 7 2540K ldlm_locks 4032 2810 69% 0.02K 28 144 112K anon_vma 3696 3360 90% 0.08K 77 48 308K Acpi-State 3638 2106 57% 0.11K 107 34 428K journal_head 3498 2873 82% 0.07K 66 53 264K Acpi-Operand 3312 72 2% 0.02K 23 144 92K journal_handle 3017 2366 78% 0.50K 431 7 1724K skbuff_fclone_cache We have not done umount of Lustre fs. From past observation unmount would not free up the slab memory until we unload lustre modules. The fact that unloading the lustre modules frees up slabs suggests some communication not right between kernel and the lustre modules. Why and how? I do not know.

            LU-2613 found a case where unreclaimable slabs should have been released but not. We may hit another case? Don't know. But, as Andreas commented those number were unreasonably high.

            LU-2613 is totally different, in LU-2613, Lustre holds the slab objects, in this ticket, Lustre doesn't hold any object (fs is already umount, and all objects freed), but kernel doesn't free memory in slab cache.

            I think the reason of such a high object number is that the filesystem has been umount/umount & used for a very long time, so lots of objects were created.

            niu Niu Yawei (Inactive) added a comment - LU-2613 found a case where unreclaimable slabs should have been released but not. We may hit another case? Don't know. But, as Andreas commented those number were unreasonably high. LU-2613 is totally different, in LU-2613 , Lustre holds the slab objects, in this ticket, Lustre doesn't hold any object (fs is already umount, and all objects freed), but kernel doesn't free memory in slab cache. I think the reason of such a high object number is that the filesystem has been umount/umount & used for a very long time, so lots of objects were created.

            Yes, you're right about this. The slab memory should be in SRclaimable but it was in SUnreclaim for unknown reason.

            The slab memory is accounted in SUnreclaim when the slab cache is created without SLAB_RECLAIM_ACCOUNT flag, the cl/lov/osc page slabs are created without this flag, so they showed in SUnreclaim, and I think adding the flag and shrinker callback won't help, because the problem now is that slab cache isn't reaped but not the slab objects are not freed.

            Based on the low number of `active objs' in slabinfo, it doesn't look like a memory leak problem - was the memory all released after unloading lustre modules?

            Right, it's not a memory leak problem, and all the slab memory will be freed after unloading lustre modules (see Jay's previous comment)

            I don't think it's a lustre problem, the slab objects are already freed and put back in the slab cache after umount, so the problem is that kernel didn't reap the slab cache for some reason (actually, I don't know how to reap slab cache initiatively in 2.6 kernel).

            niu Niu Yawei (Inactive) added a comment - Yes, you're right about this. The slab memory should be in SRclaimable but it was in SUnreclaim for unknown reason. The slab memory is accounted in SUnreclaim when the slab cache is created without SLAB_RECLAIM_ACCOUNT flag, the cl/lov/osc page slabs are created without this flag, so they showed in SUnreclaim, and I think adding the flag and shrinker callback won't help, because the problem now is that slab cache isn't reaped but not the slab objects are not freed. Based on the low number of `active objs' in slabinfo, it doesn't look like a memory leak problem - was the memory all released after unloading lustre modules? Right, it's not a memory leak problem, and all the slab memory will be freed after unloading lustre modules (see Jay's previous comment) I don't think it's a lustre problem, the slab objects are already freed and put back in the slab cache after umount, so the problem is that kernel didn't reap the slab cache for some reason (actually, I don't know how to reap slab cache initiatively in 2.6 kernel).

            Yes, you're right about this. The slab memory should be in SRclaimable but it was in SUnreclaim for unknown reason.

            Based on the low number of `active objs' in slabinfo, it doesn't look like a memory leak problem - was the memory all released after unloading lustre modules?

            jay Jinshan Xiong (Inactive) added a comment - Yes, you're right about this. The slab memory should be in SRclaimable but it was in SUnreclaim for unknown reason. Based on the low number of `active objs' in slabinfo, it doesn't look like a memory leak problem - was the memory all released after unloading lustre modules?

            The min_slab_ratio defines the threshold when the kernel will free the reclaimable slab. But in our cases, the slabs held up by Lustre were in Unreclaimable slabs. Changing that value would not help.

            LU-2613 found a case where unreclaimable slabs should have been released but not. We may hit another case? Don't know. But, as Andreas commented those number were unreasonably high.

            jaylan Jay Lan (Inactive) added a comment - The min_slab_ratio defines the threshold when the kernel will free the reclaimable slab. But in our cases, the slabs held up by Lustre were in Unreclaimable slabs. Changing that value would not help. LU-2613 found a case where unreclaimable slabs should have been released but not. We may hit another case? Don't know. But, as Andreas commented those number were unreasonably high.

            There is /proc/sys/vm/min_slab_ratio in linux kernel and default is 5, you may set it higher and see if it can help.

            jay Jinshan Xiong (Inactive) added a comment - There is /proc/sys/vm/min_slab_ratio in linux kernel and default is 5, you may set it higher and see if it can help.

            patch 3bffa4d would mitigate the problem a little bit because less slab data structures will be used for a page, but that is definitely not a fix. Actually we can't do anything for that because it's up to linux kernel vm management for when to free those memory.

            Niu, probably we should take a look at slab implementation to check if there is any tunable parameters for this.

            jay Jinshan Xiong (Inactive) added a comment - patch 3bffa4d would mitigate the problem a little bit because less slab data structures will be used for a page, but that is definitely not a fix. Actually we can't do anything for that because it's up to linux kernel vm management for when to free those memory. Niu, probably we should take a look at slab implementation to check if there is any tunable parameters for this.

            People

              niu Niu Yawei (Inactive)
              jaylan Jay Lan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: