[LU-907] ior test on nfs over lustre causes OOM in nfs server (lustre client) Created: 09/Dec/11  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Minh Diep Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File report.tgz    
Issue Links:
Related
is related to LU-814 automate NFSv3/v4 over Lustre Testing Resolved
Severity: 3
Rank (Obsolete): 4233

 Description   

When we run ior test on nfs over lustre, we consistently hit OOM on the nfs server/lustre client

10:04:52:Lustre: DEBUG MARKER: == sanity test iorssf: iorssf ======================================================================== 10:04:49 (1323367489)
10:05:17:Lustre: 2881:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1323367509/real 0] req@ffff8100048c3800 x1387647840891989/t0(0) o400->lustre-OST0000-osc-MDT0000@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323367516 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
10:05:17:Lustre: lustre-OST0000-osc-MDT0000: Connection to service lustre-OST0000 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
10:05:17:Lustre: 2881:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1323367509/real 0] req@ffff81002a5f7400 x1387647840891991/t0(0) o400->lustre-OST0002-osc-MDT0000@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323367516 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
10:05:17:Lustre: lustre-OST0002-osc-MDT0000: Connection to service lustre-OST0002 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
10:05:18:Lustre: 2949:0:(ldlm_lib.c:634:target_handle_reconnect()) MGS: db6f2199-0ea2-5617-813c-4e0e918000c0 reconnecting
10:05:18:Lustre: 2949:0:(ldlm_lib.c:909:target_handle_connect()) MGS: connection from db6f2199-0ea2-5617-813c-4e0e918000c0@10.10.4.187@tcp t0 exp ffff81007617a400 cur 1323367517 last 1323367517
10:05:18:LustreError: 2949:0:(obd_class.h:509:obd_set_info_async()) obd_set_info_async: dev 0 no operation
10:05:22:Lustre: 2879:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1323367516/real 1323367517] req@ffff81003dd67c00 x1387647840893130/t0(0) o8->lustre-OST0000-osc-MDT0000@10.10.4.187@tcp:28/4 lens 368/512 e 0 to 1 dl 1323367522 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
10:05:22:Lustre: 2879:0:(client.c:1789:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
10:05:22:Lustre: lustre-OST0005-osc-ffff81004a393800: Connection to service lustre-OST0005 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
10:05:22:Lustre: Skipped 5 previous similar messages
10:05:22:Lustre: 2882:0:(import.c:525:import_select_connection()) lustre-OST0000-osc-MDT0000: tried all connections, increasing latency to 6s
10:05:22:Lustre: 2882:0:(import.c:525:import_select_connection()) lustre-OST0001-osc-MDT0000: tried all connections, increasing latency to 6s
10:05:25:Lustre: 2880:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1323367516/real 1323367518] req@ffff81004d676c00 x1387647840893137/t0(0) o400->lustre-OST0000-osc-ffff81004a393800@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323367524 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
10:05:25:Lustre: 2880:0:(client.c:1789:ptlrpc_expire_one_request()) Skipped 16 previous similar messages
10:05:25:Lustre: lustre-OST0000-osc-ffff81004a393800: Connection to service lustre-OST0000 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
10:05:26:Lustre: Skipped 2 previous similar messages
10:05:29:Lustre: 2949:0:(ldlm_lib.c:634:target_handle_reconnect()) MGS: db6f2199-0ea2-5617-813c-4e0e918000c0 reconnecting
10:05:29:Lustre: 2949:0:(ldlm_lib.c:909:target_handle_connect()) MGS: connection from db6f2199-0ea2-5617-813c-4e0e918000c0@10.10.4.187@tcp t0 exp ffff81007617a400 cur 1323367528 last 1323367528
10:05:29:LustreError: 2949:0:(obd_class.h:509:obd_set_info_async()) obd_set_info_async: dev 0 no operation
10:05:30:LustreError: 11-0: an error occurred while communicating with 10.10.4.187@tcp. The ost_connect operation failed with -16
10:05:30:LustreError: 11-0: an error occurred while communicating with 10.10.4.187@tcp. The ost_connect operation failed with -16
INIT: version 2.86 reloading
10:34:10:automount invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
10:34:10:
10:34:10:Call Trace:
10:34:11: [<ffffffff800c96d5>] out_of_memory+0x8e/0x2f3
10:34:11: [<ffffffff8000f625>] __alloc_pages+0x27f/0x308
10:34:11: [<ffffffff80012fe4>] __do_page_cache_readahead+0x96/0x179
10:34:11: [<ffffffff8001391f>] filemap_nopage+0x14c/0x360
10:34:11: [<ffffffff80008964>] __handle_mm_fault+0x1fb/0x1039
10:34:12: [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
10:34:12: [<ffffffff80062ff0>] thread_return+0x62/0xfe
10:34:12: [<ffffffff8005dde9>] error_exit+0x0/0x84
10:34:12:
10:34:12:Node 0 DMA per-cpu:
10:34:12:cpu 0 hot: high 0, batch 1 used:0
10:34:12:cpu 0 cold: high 0, batch 1 used:0
10:34:12:Node 0 DMA32 per-cpu:
10:34:12:cpu 0 hot: high 186, batch 31 used:43
10:34:12:cpu 0 cold: high 62, batch 15 used:49
10:34:12:Node 0 Normal per-cpu: empty
10:34:12:Node 0 HighMem per-cpu: empty
10:34:12:Free pages: 8648kB (0kB HighMem)
10:34:12:Active:32 inactive:7830 dirty:0 writeback:987 unstable:0 free:2162 slab:477062 mapped-file:1061 mapped-anon:599 pagetables:905
10:34:12:Node 0 DMA free:3028kB min:24kB low:28kB high:36kB active:0kB inactive:0kB present:9728kB pages_scanned:0 all_unreclaimable? yes
10:34:12:lowmem_reserve[]: 0 2003 2003 2003
10:34:12:Node 0 DMA32 free:5620kB min:5712kB low:7140kB high:8568kB active:128kB inactive:31320kB present:2052060kB pages_scanned:82448 all_unreclaimable? yes
10:34:12:lowmem_reserve[]: 0 0 0 0
10:34:12:Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
10:34:12:lowmem_reserve[]: 0 0 0 0
10:34:13:Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
10:34:13:lowmem_reserve[]: 0 0 0 0
10:34:13:Node 0 DMA: 5*4kB 4*8kB 2*16kB 4*32kB 4*64kB 2*128kB 1*256kB 0*512kB 2*1024kB 0*2048kB 0*4096kB = 3028kB
10:34:13:Node 0 DMA32: 1*4kB 4*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5620kB
10:34:13:Node 0 Normal: empty
10:34:13:Node 0 HighMem: empty
10:34:13:7862 pagecache pages
10:34:13:Swap cache: add 8522, delete 7923, find 82/144, race 0+0
10:34:13:Free swap = 4072336kB
10:34:13:Total swap = 4104596kB
10:34:13:Out of memory: Killed process 1942, UID 51, (sendmail).
10:34:15:automount invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
10:34:15:
10:34:15:Call Trace:
10:34:15: [<ffffffff800c96d5>] out_of_memory+0x8e/0x2f3
10:34:15: [<ffffffff8000f625>] __alloc_pages+0x27f/0x308
10:34:15: [<ffffffff80012fe4>] __do_page_cache_readahead+0x96/0x179
10:34:16: [<ffffffff8001391f>] filemap_nopage+0x14c/0x360
10:34:16: [<ffffffff80008964>] __handle_mm_fault+0x1fb/0x1039
10:34:16: [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
10:34:16: [<ffffffff8005dde9>] error_exit+0x0/0x84
10:34:16:
10:34:16:Node 0 DMA per-cpu:
10:34:16:cpu 0 hot: high 0, batch 1 used:0
10:34:16:cpu 0 cold: high 0, batch 1 used:0
10:34:16:Node 0 DMA32 per-cpu:
10:34:16:cpu 0 hot: high 186, batch 31 used:113
10:34:16:cpu 0 cold: high 62, batch 15 used:49
10:34:16:Node 0 Normal per-cpu: empty
10:34:16:Node 0 HighMem per-cpu: empty
10:34:16:Free pages: 8648kB (0kB HighMem)
10:34:16:Active:6 inactive:7856 dirty:0 writeback:987 unstable:0 free:2162 slab:476992 mapped-file:1061 mapped-anon:599 pagetables:905
10:34:16:Node 0 DMA free:3028kB min:24kB low:28kB high:36kB active:0kB inactive:0kB present:9728kB pages_scanned:0 all_unreclaimable? yes
10:34:16:lowmem_reserve[]: 0 2003 2003 2003
10:34:16:Node 0 DMA32 free:5620kB min:5712kB low:7140kB high:8568kB active:24kB inactive:31424kB present:2052060kB pages_scanned:189044 all_unreclaimable? yes
10:34:16:lowmem_reserve[]: 0 0 0 0
10:34:17:Node 0 Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
10:34:17:lowmem_reserve[]: 0 0 0 0
10:34:17:Node 0 HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
10:34:17:lowmem_reserve[]: 0 0 0 0
10:34:17:Node 0 DMA: 5*4kB 4*8kB 2*16kB 4*32kB 4*64kB 2*128kB 1*256kB 0*512kB 2*1024kB 0*2048kB 0*4096kB = 3028kB
10:34:17:Node 0 DMA32: 1*4kB 4*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5620kB
10:34:17:Node 0 Normal: empty
10:34:17:Node 0 HighMem: empty
10:34:17:7862 pagecache pages
10:34:17:Swap cache: add 8522, delete 7923, find 82/146, race 0+0
10:34:17:Free swap = 4072336kB
10:34:17:Total swap = 4104596kB
10:34:21:Lustre: 2880:0:(client.c:1789:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1323369251/real 0] req@ffff8100766e3400 x1387647842440889/t0(0) o400->lustre-OST0000-osc-MDT0000@10.10.4.187@tcp:28/4 lens 192/192 e 0 to 1 dl 1323369258 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
10:34:21:Lustre: lustre-OST0000-osc-MDT0000: Connection to service lustre-OST0000 via nid 10.10.4.187@tcp was lost; in progress operations using this service will wait for recovery to complete.
10:34:21:audispd invoked oom-killer: gfp_mask=0x200d2, order=0, oomkilladj=0
10:34:21:
10:34:21:Call Trace:
10:34:21: [<ffffffff800c96d5>] out_of_memory+0x8e/0x2f3
10:34:21: [<ffffffff8000f625>] __alloc_pages+0x27f/0x308
10:34:21: [<ffffffff80032905>] read_swap_cache_async+0x45/0xd8
10:34:21: [<ffffffff800cf48e>] swapin_readahead+0x60/0xd3
10:34:21: [<ffffffff800092cb>] __handle_mm_fault+0xb62/0x1039
10:34:21: [<ffffffff80153e83>] __next_cpu+0x19/0x28
10:34:21: [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
10:34:21: [<ffffffff8008dc34>] dequeue_task+0x18/0x37
10:34:21: [<ffffffff80062ff0>] thread_return+0x62/0xfe
10:34:21: [<ffffffff8005dde9>] error_exit+0x0/0x84
10:34:21: [<ffffffff8003055f>] unix_poll+0x0/0x99
10:34:34: [<ffffffff8002f8ee>] do_sys_poll+0x2b7/0x360
10:34:35: [<ffffffff8002f815>] do_sys_poll+0x1de/0x360
10:34:35: [<ffffffff8001ecfd>] __pollwait+0x0/0xe2
10:34:35: [<ffffffff8008e438>] default_wake_function+0x0/0xe
10:34:35: [<ffffffff8008e438>] default_wake_function+0x0/0xe
10:34:35: [<ffffffff8006e608>] do_gettimeoffset_kvm+0x9/0x1d
10:34:35: [<ffffffff8006ec4e>] do_gettimeofday+0x40/0x90
10:34:35: [<ffffffff8005aa0e>] getnstimeofday+0x10/0x28
10:34:35: [<ffffffff800a4939>] ktime_get_ts+0x1a/0x4e
10:34:35: [<ffffffff800bfea4>] delayacct_end+0x5d/0x86
10:34:35: [<ffffffff8004fca9>] finish_wait+0x32/0x5d
10:34:35: [<ffffffff80063933>] __wait_on_bit_lock+0x5b/0x66
10:34:35: [<ffffffff8003ff5b>] __lock_page+0x5e/0x64
10:34:35: [<ffffffff800a2931>] wake_bit_function+0x0/0x23
10:34:35: [<ffffffff800d4820>] swap_info_get+0x6a/0xa3
10:34:35: [<ffffffff800095b9>] __handle_mm_fault+0xe50/0x1039
10:34:35: [<ffffffff8006723e>] do_page_fault+0x4fe/0x874
10:34:35: [<ffffffff8004c6c6>] sys_poll+0x2d/0x34
10:34:35: [<ffffffff8005d116>] system_call+0x7e/0x83
10:34:35:
10:34:35:Node 0 DMA per-cpu:

Attached is the maloo's log. I can't get the report because the test hung due to OOM



 Comments   
Comment by Peter Jones [ 09/Dec/11 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Lai Siyao [ 26/Dec/11 ]

Let me clarify one thing first: this OOM happened on MDS, not Lustre client. I'll try to reproduce and collect memory usage on MDS first.

Comment by Minh Diep [ 29/Dec/11 ]

actually, we mount lustre client on the mds and make it nfs server. The OOM actually relates to lustre client more.
Note: if we reduce the filesize/blocksize to 1G instead of 6g, then we don't see OOM anymore

Comment by Lai Siyao [ 29/Dec/11 ]

Yes, I also observed this. In my test environment I installed client and mds on different node, and found OOM on client.

Comment by Lai Siyao [ 30/Dec/11 ]

I output /proc/slabinfo before OOM:

...

size-2048         229724 229724   2048    2    1 : tunables   24   12    8 : slabdata 114862 114862     84
size-1024(DMA)         0      0   1024    4    1 : tunables   54   27    8 : slabdata      0      0      0
size-1024         914900 914900   1024    4    1 : tunables   54   27    8 : slabdata 228725 228725      0
size-512(DMA)          0      0    512    8    1 : tunables   54   27    8 : slabdata      0      0      0
size-512          229328 229328    512    8    1 : tunables   54   27    8 : slabdata  28666  28666      0
size-256(DMA)          0      0    256   15    1 : tunables  120   60    8 : slabdata      0      0      0
size-256            1890   1920    256   15    1 : tunables  120   60    8 : slabdata    128    128      0
size-128(DMA)          0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
size-64(DMA)           0      0     64   59    1 : tunables  120   60    8 : slabdata      0      0      0
size-64           345858 345858     64   59    1 : tunables  120   60    8 : slabdata   5862   5862      0
size-32(DMA)           0      0     32  112    1 : tunables  120   60    8 : slabdata      0      0      0
size-128           32820  32820    128   30    1 : tunables  120   60    8 : slabdata   1094   1094      0
size-32           233968 233968     32  112    1 : tunables  120   60    8 : slabdata   2089   2089      0

It shows common slab consumes a lot of memory (1.5G against 2G total), and if I stopped test and umount lustre, these slab memory will be freed right after. I'll try to collect some memory debug log to find where such memory is allocated and not freed.

Comment by Lai Siyao [ 05/Jan/12 ]

Debug log shows the memory allocated above is uncommitted open request; and after further looking into the code, I found open only fetches LOOKUP|OPEN lock, and a later getattr will take LOOKUP|UPDATE lock, as will conflict with the cached open lock, thus the open lock cache for NFS export doesn't take effect at all (1.8 code will take UPDATE lock for non-create case).

I've made a fix for this, and will commit to review after test later.

Comment by Lai Siyao [ 05/Jan/12 ]

review is on http://review.whamcloud.com/#change,1921

Minh, could you help verify the fix?

Comment by Lai Siyao [ 08/Jan/12 ]

sanityn.sh test 13 failed, I found two more issues for nfs support:

  • open lock is taken for directory.
  • if file is open(O_RDONLY), LCK_CR lock is fetched, but LOOKUP|UPDATE with LCK_CR doesn't make sense, because setattr takes LCR_PW, and getattr LCR_PR.

The patch is updated.

Comment by Oleg Drokin [ 16/Feb/12 ]

If the problem is a lot of uncommitted file opens, then we should attack the real problem and truncate the open requests in uncommitted queue so that they only include enough data for the current request, and not include all the empty space in the request buffer "just in case", since we know the striping already at that point.

Comment by Lai Siyao [ 24/Feb/12 ]

Oleg, I was busy on other bugs, sorry to reply so late.

For NFS export, the call trace of lustre client for a normal write is like this:

.permission
  -> take LOOKUP|UPDATE dlm lock
.open
  -> revoke previous lock, take LOOKUP|OPEN lock
.getattr
  -> revoke previous lock, take LOOKUP|UPDATE lock

This happens for each nfs write, and it shows the open lock doesn't take any effect, but worsen the situation: update lock is not cached at all. So IMO the real problem is open lock cache doesn't work.

Besides, for open(O_READONLY) a LOOKUP|OPEN with mode LCK_CR lock is fetched, if another client calls setattr(LA_MODE), this LOOKUP lock with LCK_CR mode won't be revoked, because setattr takes lock with mode LCK_PW, which is compatible with LCK_CR. This is wrong! So a LOOKUP lock with LCK_CR mode doesn't make sense. If OPEN lock is split from LOOKUP lock, this problem can be solved.

Finally if OPEN lock is fetched only, the dentry won't be found by others. But as I said in the beginning, for NFS .permission is always called before .open, so generally a valid dentry is already there.

In brief, if OPEN lock is mainly for NFS export usage, it's best to split OPEN lock from LOOKUP lock.

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:11:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.