Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-907

ior test on nfs over lustre causes OOM in nfs server (lustre client)

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 2.2.0
    • None
    • 3
    • 4233

    Description

      When we run ior test on nfs over lustre, we consistently hit OOM on the nfs server/lustre client

      10:04:52:Lustre: DEBUG MARKER: == sanity test iorssf: iorssf ======================================================================== 10:04:49 (1323367489)
      10:05:17:Lustre: 2881:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1323367509/real 0] req@ffff8100048c3800 x1387647840891989/t0(0) o400->lustre-OST0000-osc-MDT0000@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323367516 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      10:05:17:Lustre: lustre-OST0000-osc-MDT0000: Connection to service lustre-OST0000 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
      10:05:17:Lustre: 2881:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1323367509/real 0] req@ffff81002a5f7400 x1387647840891991/t0(0) o400->lustre-OST0002-osc-MDT0000@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323367516 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      10:05:17:Lustre: lustre-OST0002-osc-MDT0000: Connection to service lustre-OST0002 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
      10:05:18:Lustre: 2949:0:(ldlm_lib.c:634:target_handle_reconnect()) MGS: db6f2199-0ea2-5617-813c-4e0e918000c0 reconnecting
      10:05:18:Lustre: 2949:0:(ldlm_lib.c:909:target_handle_connect()) MGS: connection from db6f2199-0ea2-5617-813c-4e0e918000c0@10.10.4.187@tcp t0 exp ffff81007617a400 cur 1323367517 last 1323367517
      10:05:18:LustreError: 2949:0:(obd_class.h:509:obd_set_info_async()) obd_set_info_async: dev 0 no operation
      10:05:22:Lustre: 2879:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1323367516/real 1323367517] req@ffff81003dd67c00 x1387647840893130/t0(0) o8->lustre-OST0000-osc-MDT0000@10.10.4.187@tcp:28/4 lens 368/512 e 0 to 1 dl 1323367522 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      10:05:22:Lustre: 2879:0:(client.c:1789:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
      10:05:22:Lustre: lustre-OST0005-osc-ffff81004a393800: Connection to service lustre-OST0005 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
      10:05:22:Lustre: Skipped 5 previous similar messages
      10:05:22:Lustre: 2882:0:(import.c:525:import_select_connection()) lustre-OST0000-osc-MDT0000: tried all connections, increasing latency to 6s
      10:05:22:Lustre: 2882:0:(import.c:525:import_select_connection()) lustre-OST0001-osc-MDT0000: tried all connections, increasing latency to 6s
      10:05:25:Lustre: 2880:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1323367516/real 1323367518] req@ffff81004d676c00 x1387647840893137/t0(0) o400->lustre-OST0000-osc-ffff81004a393800@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323367524 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      10:05:25:Lustre: 2880:0:(client.c:1789:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
      10:05:25:Lustre: lustre-OST0000-osc-ffff81004a393800: Connection to service lustre-OST0000 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
      10:05:26:Lustre: Skipped 2 previous similar messages
      10:05:29:Lustre: 2949:0:(ldlm_lib.c:634:target_handle_reconnect()) MGS: db6f2199-0ea2-5617-813c-4e0e918000c0 reconnecting
      10:05:29:Lustre: 2949:0:(ldlm_lib.c:909:target_handle_connect()) MGS: connection from db6f2199-0ea2-5617-813c-4e0e918000c0@10.10.4.187@tcp t0 exp ffff81007617a400 cur 1323367528 last 1323367528
      10:05:29:LustreError: 2949:0:(obd_class.h:509:obd_set_info_async()) obd_set_info_async: dev 0 no operation
      10:05:30:LustreError: 11-0: an error occurred while communicating with 10.10.4.187@tcp. The ost_connect operation failed with -16
      10:05:30:LustreError: 11-0: an error occurred while communicating with 10.10.4.187@tcp. The ost_connect operation failed with -16
      INIT: version 2.86 reloading
      10:34:10:automount invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
      10:34:10:
      10:34:10:Call Trace:
      10:34:11: [<ffffffff800c96d5>] out_of_memory+0x8e/0x2f3
      10:34:11: [<ffffffff8000f625>] __alloc_pages+0x27f/0x308
      10:34:11: [<ffffffff80012fe4>] __do_page_cache_readahead+0x96/0x179
      10:34:11: [<ffffffff8001391f>] filemap_nopage+0x14c/0x360
      10:34:11: [<ffffffff80008964>] __handle_mm_fault+0x1fb/0x1039
      10:34:12: [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
      10:34:12: [<ffffffff80062ff0>] thread_return+0x62/0xfe
      10:34:12: [<ffffffff8005dde9>] error_exit+0x0/0x84
      10:34:12:
      10:34:12:Node 0 DMA per-cpu:
      10:34:12:cpu 0 hot: high 0, batch 1 used:0
      10:34:12:cpu 0 cold: high 0, batch 1 used:0
      10:34:12:Node 0 DMA32 per-cpu:
      10:34:12:cpu 0 hot: high 186, batch 31 used:43
      10:34:12:cpu 0 cold: high 62, batch 15 used:49
      10:34:12:Node 0 Normal per-cpu: empty
      10:34:12:Node 0 HighMem per-cpu: empty
      10:34:12:Free pages: 8648kB (0kB HighMem)
      10:34:12:Active:32 inactive:7830 dirty:0 writeback:987 unstable:0 free:2162 slab:477062 mapped-file:1061 mapped-anon:599 pagetables:905
      10:34:12:Node 0 DMA free:3028kB min:24kB low:28kB high:36kB active:0kB inactive:0kB present:9728kB pages_scanned:0 all_unreclaimable? yes
      10:34:12:lowmem_reserve[]: 0 2003 2003 2003
      10:34:12:Node 0 DMA32 free:5620kB min:5712kB low:7140kB high:8568kB active:128kB inactive:31320kB present:2052060kB pages_scanned:82448 all_unreclaimable? yes
      10:34:12:lowmem_reserve[]: 0 0 0 0
      10:34:12:Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
      10:34:12:lowmem_reserve[]: 0 0 0 0
      10:34:13:Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
      10:34:13:lowmem_reserve[]: 0 0 0 0
      10:34:13:Node 0 DMA: 5*4kB 4*8kB 2*16kB 4*32kB 4*64kB 2*128kB 1*256kB 0*512kB 2*1024kB 0*2048kB 0*4096kB = 3028kB
      10:34:13:Node 0 DMA32: 1*4kB 4*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5620kB
      10:34:13:Node 0 Normal: empty
      10:34:13:Node 0 HighMem: empty
      10:34:13:7862 pagecache pages
      10:34:13:Swap cache: add 8522, delete 7923, find 82/144, race 0+0
      10:34:13:Free swap = 4072336kB
      10:34:13:Total swap = 4104596kB
      10:34:13:Out of memory: Killed process 1942, UID 51, (sendmail).
      10:34:15:automount invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
      10:34:15:
      10:34:15:Call Trace:
      10:34:15: [<ffffffff800c96d5>] out_of_memory+0x8e/0x2f3
      10:34:15: [<ffffffff8000f625>] __alloc_pages+0x27f/0x308
      10:34:15: [<ffffffff80012fe4>] __do_page_cache_readahead+0x96/0x179
      10:34:16: [<ffffffff8001391f>] filemap_nopage+0x14c/0x360
      10:34:16: [<ffffffff80008964>] __handle_mm_fault+0x1fb/0x1039
      10:34:16: [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
      10:34:16: [<ffffffff8005dde9>] error_exit+0x0/0x84
      10:34:16:
      10:34:16:Node 0 DMA per-cpu:
      10:34:16:cpu 0 hot: high 0, batch 1 used:0
      10:34:16:cpu 0 cold: high 0, batch 1 used:0
      10:34:16:Node 0 DMA32 per-cpu:
      10:34:16:cpu 0 hot: high 186, batch 31 used:113
      10:34:16:cpu 0 cold: high 62, batch 15 used:49
      10:34:16:Node 0 Normal per-cpu: empty
      10:34:16:Node 0 HighMem per-cpu: empty
      10:34:16:Free pages: 8648kB (0kB HighMem)
      10:34:16:Active:6 inactive:7856 dirty:0 writeback:987 unstable:0 free:2162 slab:476992 mapped-file:1061 mapped-anon:599 pagetables:905
      10:34:16:Node 0 DMA free:3028kB min:24kB low:28kB high:36kB active:0kB inactive:0kB present:9728kB pages_scanned:0 all_unreclaimable? yes
      10:34:16:lowmem_reserve[]: 0 2003 2003 2003
      10:34:16:Node 0 DMA32 free:5620kB min:5712kB low:7140kB high:8568kB active:24kB inactive:31424kB present:2052060kB pages_scanned:189044 all_unreclaimable? yes
      10:34:16:lowmem_reserve[]: 0 0 0 0
      10:34:17:Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
      10:34:17:lowmem_reserve[]: 0 0 0 0
      10:34:17:Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
      10:34:17:lowmem_reserve[]: 0 0 0 0
      10:34:17:Node 0 DMA: 5*4kB 4*8kB 2*16kB 4*32kB 4*64kB 2*128kB 1*256kB 0*512kB 2*1024kB 0*2048kB 0*4096kB = 3028kB
      10:34:17:Node 0 DMA32: 1*4kB 4*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5620kB
      10:34:17:Node 0 Normal: empty
      10:34:17:Node 0 HighMem: empty
      10:34:17:7862 pagecache pages
      10:34:17:Swap cache: add 8522, delete 7923, find 82/146, race 0+0
      10:34:17:Free swap = 4072336kB
      10:34:17:Total swap = 4104596kB
      10:34:21:Lustre: 2880:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1323369251/real 0] req@ffff8100766e3400 x1387647842440889/t0(0) o400->lustre-OST0000-osc-MDT0000@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323369258 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
      10:34:21:Lustre: lustre-OST0000-osc-MDT0000: Connection to service lustre-OST0000 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
      10:34:21:audispd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0
      10:34:21:
      10:34:21:Call Trace:
      10:34:21: [<ffffffff800c96d5>] out_of_memory+0x8e/0x2f3
      10:34:21: [<ffffffff8000f625>] __alloc_pages+0x27f/0x308
      10:34:21: [<ffffffff80032905>] read_swap_cache_async+0x45/0xd8
      10:34:21: [<ffffffff800cf48e>] swapin_readahead+0x60/0xd3
      10:34:21: [<ffffffff800092cb>] __handle_mm_fault+0xb62/0x1039
      10:34:21: [<ffffffff80153e83>] __next_cpu+0x19/0x28
      10:34:21: [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
      10:34:21: [<ffffffff8008dc34>] dequeue_task+0x18/0x37
      10:34:21: [<ffffffff80062ff0>] thread_return+0x62/0xfe
      10:34:21: [<ffffffff8005dde9>] error_exit+0x0/0x84
      10:34:21: [<ffffffff8003055f>] unix_poll+0x0/0x99
      10:34:34: [<ffffffff8002f8ee>] do_sys_poll+0x2b7/0x360
      10:34:35: [<ffffffff8002f815>] do_sys_poll+0x1de/0x360
      10:34:35: [<ffffffff8001ecfd>] __pollwait+0x0/0xe2
      10:34:35: [<ffffffff8008e438>] default_wake_function+0x0/0xe
      10:34:35: [<ffffffff8008e438>] default_wake_function+0x0/0xe
      10:34:35: [<ffffffff8006e608>] do_gettimeoffset_kvm+0x9/0x1d
      10:34:35: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
      10:34:35: [<ffffffff8005aa0e>] getnstimeofday+0x10/0x28
      10:34:35: [<ffffffff800a4939>] ktime_get_ts+0x1a/0x4e
      10:34:35: [<ffffffff800bfea4>] delayacct_end+0x5d/0x86
      10:34:35: [<ffffffff8004fca9>] finish_wait+0x32/0x5d
      10:34:35: [<ffffffff80063933>] __wait_on_bit_lock+0x5b/0x66
      10:34:35: [<ffffffff8003ff5b>] __lock_page+0x5e/0x64
      10:34:35: [<ffffffff800a2931>] wake_bit_function+0x0/0x23
      10:34:35: [<ffffffff800d4820>] swap_info_get+0x6a/0xa3
      10:34:35: [<ffffffff800095b9>] __handle_mm_fault+0xe50/0x1039
      10:34:35: [<ffffffff8006723e>] do_page_fault+0x4fe/0x874
      10:34:35: [<ffffffff8004c6c6>] sys_poll+0x2d/0x34
      10:34:35: [<ffffffff8005d116>] system_call+0x7e/0x83
      10:34:35:
      10:34:35:Node 0 DMA per-cpu:

      Attached is the maloo's log. I can't get the report because the test hung due to OOM

      Attachments

        Issue Links

          Activity

            [LU-907] ior test on nfs over lustre causes OOM in nfs server (lustre client)

            Close old ticket.

            adilger Andreas Dilger added a comment - Close old ticket.
            laisiyao Lai Siyao added a comment - - edited

            Oleg, I was busy on other bugs, sorry to reply so late.

            For NFS export, the call trace of lustre client for a normal write is like this:

            .permission
              -> take LOOKUP|UPDATE dlm lock
            .open
              -> revoke previous lock, take LOOKUP|OPEN lock
            .getattr
              -> revoke previous lock, take LOOKUP|UPDATE lock
            

            This happens for each nfs write, and it shows the open lock doesn't take any effect, but worsen the situation: update lock is not cached at all. So IMO the real problem is open lock cache doesn't work.

            Besides, for open(O_READONLY) a LOOKUP|OPEN with mode LCK_CR lock is fetched, if another client calls setattr(LA_MODE), this LOOKUP lock with LCK_CR mode won't be revoked, because setattr takes lock with mode LCK_PW, which is compatible with LCK_CR. This is wrong! So a LOOKUP lock with LCK_CR mode doesn't make sense. If OPEN lock is split from LOOKUP lock, this problem can be solved.

            Finally if OPEN lock is fetched only, the dentry won't be found by others. But as I said in the beginning, for NFS .permission is always called before .open, so generally a valid dentry is already there.

            In brief, if OPEN lock is mainly for NFS export usage, it's best to split OPEN lock from LOOKUP lock.

            laisiyao Lai Siyao added a comment - - edited Oleg, I was busy on other bugs, sorry to reply so late. For NFS export, the call trace of lustre client for a normal write is like this: .permission -> take LOOKUP|UPDATE dlm lock .open -> revoke previous lock, take LOOKUP|OPEN lock .getattr -> revoke previous lock, take LOOKUP|UPDATE lock This happens for each nfs write, and it shows the open lock doesn't take any effect, but worsen the situation: update lock is not cached at all. So IMO the real problem is open lock cache doesn't work. Besides, for open(O_READONLY) a LOOKUP|OPEN with mode LCK_CR lock is fetched, if another client calls setattr(LA_MODE), this LOOKUP lock with LCK_CR mode won't be revoked, because setattr takes lock with mode LCK_PW, which is compatible with LCK_CR. This is wrong! So a LOOKUP lock with LCK_CR mode doesn't make sense. If OPEN lock is split from LOOKUP lock, this problem can be solved. Finally if OPEN lock is fetched only, the dentry won't be found by others. But as I said in the beginning, for NFS .permission is always called before .open, so generally a valid dentry is already there. In brief, if OPEN lock is mainly for NFS export usage, it's best to split OPEN lock from LOOKUP lock.
            green Oleg Drokin added a comment -

            If the problem is a lot of uncommitted file opens, then we should attack the real problem and truncate the open requests in uncommitted queue so that they only include enough data for the current request, and not include all the empty space in the request buffer "just in case", since we know the striping already at that point.

            green Oleg Drokin added a comment - If the problem is a lot of uncommitted file opens, then we should attack the real problem and truncate the open requests in uncommitted queue so that they only include enough data for the current request, and not include all the empty space in the request buffer "just in case", since we know the striping already at that point.
            laisiyao Lai Siyao added a comment -

            sanityn.sh test 13 failed, I found two more issues for nfs support:

            • open lock is taken for directory.
            • if file is open(O_RDONLY), LCK_CR lock is fetched, but LOOKUP|UPDATE with LCK_CR doesn't make sense, because setattr takes LCR_PW, and getattr LCR_PR.

            The patch is updated.

            laisiyao Lai Siyao added a comment - sanityn.sh test 13 failed, I found two more issues for nfs support: open lock is taken for directory. if file is open(O_RDONLY), LCK_CR lock is fetched, but LOOKUP|UPDATE with LCK_CR doesn't make sense, because setattr takes LCR_PW, and getattr LCR_PR. The patch is updated.
            laisiyao Lai Siyao added a comment -

            review is on http://review.whamcloud.com/#change,1921

            Minh, could you help verify the fix?

            laisiyao Lai Siyao added a comment - review is on http://review.whamcloud.com/#change,1921 Minh, could you help verify the fix?
            laisiyao Lai Siyao added a comment -

            Debug log shows the memory allocated above is uncommitted open request; and after further looking into the code, I found open only fetches LOOKUP|OPEN lock, and a later getattr will take LOOKUP|UPDATE lock, as will conflict with the cached open lock, thus the open lock cache for NFS export doesn't take effect at all (1.8 code will take UPDATE lock for non-create case).

            I've made a fix for this, and will commit to review after test later.

            laisiyao Lai Siyao added a comment - Debug log shows the memory allocated above is uncommitted open request; and after further looking into the code, I found open only fetches LOOKUP|OPEN lock, and a later getattr will take LOOKUP|UPDATE lock, as will conflict with the cached open lock, thus the open lock cache for NFS export doesn't take effect at all (1.8 code will take UPDATE lock for non-create case). I've made a fix for this, and will commit to review after test later.
            laisiyao Lai Siyao added a comment -

            I output /proc/slabinfo before OOM:

            ...
            
            size-2048         229724 229724   2048    2    1 : tunables   24   12    8 : slabdata 114862 114862     84
            size-1024(DMA)         0      0   1024    4    1 : tunables   54   27    8 : slabdata      0      0      0
            size-1024         914900 914900   1024    4    1 : tunables   54   27    8 : slabdata 228725 228725      0
            size-512(DMA)          0      0    512    8    1 : tunables   54   27    8 : slabdata      0      0      0
            size-512          229328 229328    512    8    1 : tunables   54   27    8 : slabdata  28666  28666      0
            size-256(DMA)          0      0    256   15    1 : tunables  120   60    8 : slabdata      0      0      0
            size-256            1890   1920    256   15    1 : tunables  120   60    8 : slabdata    128    128      0
            size-128(DMA)          0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
            size-64(DMA)           0      0     64   59    1 : tunables  120   60    8 : slabdata      0      0      0
            size-64           345858 345858     64   59    1 : tunables  120   60    8 : slabdata   5862   5862      0
            size-32(DMA)           0      0     32  112    1 : tunables  120   60    8 : slabdata      0      0      0
            size-128           32820  32820    128   30    1 : tunables  120   60    8 : slabdata   1094   1094      0
            size-32           233968 233968     32  112    1 : tunables  120   60    8 : slabdata   2089   2089      0
            

            It shows common slab consumes a lot of memory (1.5G against 2G total), and if I stopped test and umount lustre, these slab memory will be freed right after. I'll try to collect some memory debug log to find where such memory is allocated and not freed.

            laisiyao Lai Siyao added a comment - I output /proc/slabinfo before OOM: ... size-2048 229724 229724 2048 2 1 : tunables 24 12 8 : slabdata 114862 114862 84 size-1024(DMA) 0 0 1024 4 1 : tunables 54 27 8 : slabdata 0 0 0 size-1024 914900 914900 1024 4 1 : tunables 54 27 8 : slabdata 228725 228725 0 size-512(DMA) 0 0 512 8 1 : tunables 54 27 8 : slabdata 0 0 0 size-512 229328 229328 512 8 1 : tunables 54 27 8 : slabdata 28666 28666 0 size-256(DMA) 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0 size-256 1890 1920 256 15 1 : tunables 120 60 8 : slabdata 128 128 0 size-128(DMA) 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0 size-64(DMA) 0 0 64 59 1 : tunables 120 60 8 : slabdata 0 0 0 size-64 345858 345858 64 59 1 : tunables 120 60 8 : slabdata 5862 5862 0 size-32(DMA) 0 0 32 112 1 : tunables 120 60 8 : slabdata 0 0 0 size-128 32820 32820 128 30 1 : tunables 120 60 8 : slabdata 1094 1094 0 size-32 233968 233968 32 112 1 : tunables 120 60 8 : slabdata 2089 2089 0 It shows common slab consumes a lot of memory (1.5G against 2G total), and if I stopped test and umount lustre, these slab memory will be freed right after. I'll try to collect some memory debug log to find where such memory is allocated and not freed.
            laisiyao Lai Siyao added a comment -

            Yes, I also observed this. In my test environment I installed client and mds on different node, and found OOM on client.

            laisiyao Lai Siyao added a comment - Yes, I also observed this. In my test environment I installed client and mds on different node, and found OOM on client.
            mdiep Minh Diep added a comment -

            actually, we mount lustre client on the mds and make it nfs server. The OOM actually relates to lustre client more.
            Note: if we reduce the filesize/blocksize to 1G instead of 6g, then we don't see OOM anymore

            mdiep Minh Diep added a comment - actually, we mount lustre client on the mds and make it nfs server. The OOM actually relates to lustre client more. Note: if we reduce the filesize/blocksize to 1G instead of 6g, then we don't see OOM anymore
            laisiyao Lai Siyao added a comment -

            Let me clarify one thing first: this OOM happened on MDS, not Lustre client. I'll try to reproduce and collect memory usage on MDS first.

            laisiyao Lai Siyao added a comment - Let me clarify one thing first: this OOM happened on MDS, not Lustre client. I'll try to reproduce and collect memory usage on MDS first.

            People

              laisiyao Lai Siyao
              mdiep Minh Diep
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: