[LU-2139] Tracking unstable pages Created: 14/Sep/12  Updated: 01/Sep/15  Resolved: 09/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Major
Reporter: Prakash Surya (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: JL, MB
Environment:

2_3_49_92_1-llnl


Attachments: PNG File fio-vulcanio65-ib0-2012-09-20-2.png     File fio-vulcanio65-ib0-2012-09-20-2.tar.gz     PNG File fio-vulcanio65-ib0-2012-09-20-3.png     File fio-vulcanio65-ib0-2012-09-20-3.tar.gz     File isolate_lru_page.diff     File vulcanio10-ib0-1347641093961.lpagecache.bz2     File vulcanio10-ib0-1347641093961.meminfo     File vulcanio10-ib0-1347641093961.vmstat     File vulcanio18-ib0-2012-09-17-0.eps     PNG File vulcanio18-ib0-2012-09-17-0.png     File vulcanio18.tar.gz     PNG File vulcanio3-ib0-2012-09-18-0.png     File vulcanio3-ib0-2012-09-18-0.tar.gz    
Issue Links:
Blocker
is blocked by LU-3274 osc_cache.c:1774:osc_dec_unstable_pag... Resolved
is blocked by LU-3277 LU-2139 may cause the performance reg... Resolved
Duplicate
is duplicated by LU-70 Lustre client OOM with async journal ... Resolved
Related
is related to LU-5483 recovery-mds-scale test failover_mds:... Reopened
is related to LU-2576 Hangs in osc_enter_cache due to dirty... Resolved
is related to LU-3321 2.x single thread/process throughput ... Resolved
is related to LU-3277 LU-2139 may cause the performance reg... Resolved
is related to LU-3910 Interop 2.4.0<->2.5 failure on test s... Resolved
Severity: 3
Rank (Obsolete): 3127

 Description   

We've been seeing strange caching behavior on our PPC IO nodes, eventually resulting in OOM events. This is particularly harmful for us because there are critical system components running in user space on these nodes, forcing us to run with "panic_on_oom" enabled.

We see a large amount of "Active File" pages as reported by /proc/vmstat and /proc/meminfo which spikes during Lustre IOR jobs. For the test I am running that is unusual since I'm not running any executables out of Lustre, it should only be "inactive" IOR data accumulating in the page cache as a result of the Lustre IO. The really strange thing is, prior to testing Orion rebased code, "Active Files" would sometimes stay low (in the 100's Meg range) and sometimes it would grow very large (in the 5 Gig range). It's hard to tell if the variation still exists in the rebased code because the OOM events are hitting more frequently, basically every time I run an IOR.

We also see a large amount of "Inactive File" pages which we believe should be limited by the patch we carry from LU-744, but doesn't seem to be the case:

commit 98400981e6d6e5707233be2c090e4227a77e2c46
Author: Jinshan Xiong <jinshan.xiong@whamcloud.com>
Date:   Tue May 15 20:11:37 2012 -0700

    LU-744 osc: add lru pages management - new RPC 
    
    Add a cache management at osc layer, this way we can control how much
    memory can be used to cache lustre pages and avoid complex solution
    as what we did in b1_8.
    
    In this patch, admins can set how much memory will be used for caching
    lustre pages per file system. A self-adapative algorithm is used to
    balance those budget among OSCs.
    
    Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
    Change-Id: I76c840aef5ca9a3a4619f06fcaee7de7f95b05f5
    Revision-Id: 21

From what I can tell, Lustre is trying to limit the cache to the value we are setting 4G. When I dump the lustre page cache I roughly see 4G worth of pages, but the number of pages listed does not reflect the values seen in vmstat and meminfo.

So I have a few questions which I'd like to get an answer to:

 1. Why are Lustre pages being marked as "referenced" and moved to the 
    Active list in the first place? Without any running executables
    coming from Lustre I would not expect this to happen.

 2. Why more "Inactive File" pages are accumulating on the system past
    the 4G limit we are trying to set within Lustre?

 3. Why these "Inactive File" pages are unable to be reclaimed  when we
    hit a low memory situation? Ultimately resulting in an out of memory
    event and panic_on_oom triggering. This _might_ be related to (1) 
    above.

I added a systemtap script to disable the panic_on_oom flag and dump the Lustre page cache, /proc/vmstat, and /proc/meminfo file to try and gain some understanding into the problem. I'll upload those files as attachments in case they prove useful.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 14/Sep/12 ]

From the symptom you described, the root cause of this problem is that the pages were still in the kernel's LRU cache after lustre dropped those pages and kernel should be able to move the page from active -> inactive -> free by zone scanning - however, this is done by a single threaded kernel daemon so the speed of generating free pages is limited.

Can you please tell me the write speed of IOR job you saw on the node?

One way to fix this problem is to remove those pages out of kernel LRU list and drop them voluntarily. I will work out a patch soon.

Comment by Prakash Surya (Inactive) [ 14/Sep/12 ]

Right, it does appear Lustre is dropping the pages and they are sitting in the kernel pages cache. I'm unsure why the kernel cannot then free these pages through it's direct reclaim code paths. Does the speed at which we are moving pages affect this? I had assumed that GFP_KERNEL allocation would block until the direct reclaim paths had run, at which point the kernel should have been able to reclaim all of these "inactive" file pages.

Voluntarily removing pages from the kernel LRU seems like it is just dodging the real issue at hand; which is the fact that these "inactive" pages are not getting reclaimed in the direct memory reclamation code paths. Or perhaps I don't fully understand something?

I believe we've been seeing roughly 1.1 GB/s per IO node as reported by IOR for reads and writes.

Also, voluntarily removing pages doesn't address the fact that, from my understanding, we shouldn't be seeing any "active" file pages from Lustre during this test load.

Comment by Jinshan Xiong (Inactive) [ 14/Sep/12 ]

It's interesting that why it couldn't get any free pages even there was lots of inactive pages in the cache. Can you please monitor the CPU usage and /proc/meminfo when you're running it next time? Also, it would be a good idea to track ll_releasepage() to see the successful rate of releasing pages by kswapd(returning 1 means success).

I also worked out a patch to verify if removing pages from kernel LRU helps. Please give it a try.

Comment by Jinshan Xiong (Inactive) [ 14/Sep/12 ]

patch to remove pages from kernel LRU voluntarily.

Comment by Prakash Surya (Inactive) [ 14/Sep/12 ]

Sure. I'll add some monitoring when I dig into this some more next week. Also, I don't really like applying arbitrary patches to our branches.. So if you don't mind, when I apply the debug patch I'll also push it to to Gerrit. It makes it easier for us to track exactly what version of patches we are running.

Comment by Prakash Surya (Inactive) [ 17/Sep/12 ]

Pushed the debug patch here: http://review.whamcloud.com/4012

Comment by Jinshan Xiong (Inactive) [ 17/Sep/12 ]

Thank you Parakash, please notice that this is a debug patch which can only help understand the problem better.

Comment by Prakash Surya (Inactive) [ 17/Sep/12 ]

Jinshan, what are you expecting to learn from the debug patch? I installed it this morning and was still able to hit the OOM. I didn't have any scripts monitoring the CPU or memory usage, though. So the only information about the state of memory when the OOM is from the OOM message it prints to the console:

2012-09-17 11:10:35.889990 {DefaultControlEventListener} [mmcs]{43}.7.0: active_anon:34693 inactive_anon:4894 isolated_anon:0
2012-09-17 11:10:35.890389 {DefaultControlEventListener} [mmcs]{43}.7.0:  active_file:31999 inactive_file:31929 isolated_file:1632
2012-09-17 11:10:35.890818 {DefaultControlEventListener} [mmcs]{43}.7.0:  unevictable:0 dirty:136 writeback:220 unstable:0
2012-09-17 11:10:35.891211 {DefaultControlEventListener} [mmcs]{43}.7.0:  free:16924 slab_reclaimable:463 slab_unreclaimable:10761
2012-09-17 11:10:35.891629 {DefaultControlEventListener} [mmcs]{43}.7.0:  mapped:187 shmem:6495 pagetables:932 bounce:0
2012-09-17 11:10:35.892047 {DefaultControlEventListener} [mmcs]{43}.7.0: DMA free:1083136kB min:1048576kB low:1310720kB high:1572864kB active_anon:2220352kB inactive_anon:313216kB active_file:2047936kB inactive_file:2043456kB unevictable:0kB isolated(anon):0kB isolated(file):104448kB present:16762816kB mlocked:0kB dirty:8704kB writeback:14080kB mapped:11968kB shmem:415680kB slab_reclaimable:29632kB slab_unreclaimable:688704kB kernel_stack:26544kB pagetables:59648kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:34761 all_unreclaimable? no
2012-09-17 11:10:35.892488 {DefaultControlEventListener} [mmcs]{43}.7.0: lowmem_reserve[]: 0 0 0
2012-09-17 11:10:35.892896 {DefaultControlEventListener} [mmcs]{43}.7.0: DMA: 1774*64kB 136*128kB 16*256kB 6*512kB 5*1024kB 2*2048kB 1*4096kB 3*8192kB 1*16384kB 1*32768kB 1*65536kB 2*131072kB 2*262144kB = 1077120kB
2012-09-17 11:10:35.893322 {DefaultControlEventListener} [mmcs]{43}.7.0: 70831 total pagecache pages
2012-09-17 11:10:35.893671 {DefaultControlEventListener} [mmcs]{43}.7.0: 262143 pages RAM
2012-09-17 11:10:35.894091 {DefaultControlEventListener} [mmcs]{43}.7.0: 914 pages reserved
2012-09-17 11:10:35.894472 {DefaultControlEventListener} [mmcs]{43}.7.0: 148112 pages shared
2012-09-17 11:10:35.894857 {DefaultControlEventListener} [mmcs]{43}.7.0: 175592 pages non-shared

Keep in mind, we're explicitly setting min_free_kbytes to 1G as a temporary workaround for atomic page allocation failures we are seeing elsewhere.

Also, there are many console messages as a result of the debug patch:

2012-09-17 11:10:34.610731 {DefaultControlEventListener} [mmcs]{43}.0.0: BUG: Bad page state in process sysiod  pfn:05564
2012-09-17 11:10:34.610782 {DefaultControlEventListener} [mmcs]{43}.0.0: page:c0000000015eade0 flags:0000000000000040 count:0 mapcount:0 mapping:(null) index:198c (Tainted: G    B      ----------------  )

It prints a backtrace too, but I don't think it's necessary to paste it here. My guess is the message is a result of calling page_cache_release twice on the same page, once in vvp_page_fini and again in vvp_page_fini_common..?

Comment by Jinshan Xiong (Inactive) [ 17/Sep/12 ]

No, that's not due to page_release() in vvp_page_fini() as the comment said it just frees the page refcount from isolate_lru_page(). Despite the page state error, it can control the LRU page number for cache right now, as you can see it from here:

2012-09-17 11:10:35.890389

{DefaultControlEventListener}

[mmcs]

{43}

.7.0: active_file:31999 inactive_file:31929

the total memory used for file cache is around 4G.

Please post the backtrace here as I want to know if the badpage is seen at allocation or free stage.

Comment by Jinshan Xiong (Inactive) [ 17/Sep/12 ]

I guess the bad page is because of page active bit is still set so I pushed patch set 2 for this. I don't know why I didn't see the problem while I was testing the patch.

Please use vmstat to monitor the memory state and top to monitor who is using most of cpu time. THanks

Comment by Prakash Surya (Inactive) [ 17/Sep/12 ]

Here is the backtrace from the console:

2012-09-17 11:10:34.610731 {DefaultControlEventListener} [mmcs]{43}.0.0: BUG: Bad page state in process sysiod  pfn:05564
2012-09-17 11:10:34.610782 {DefaultControlEventListener} [mmcs]{43}.0.0: page:c0000000015eade0 flags:0000000000000040 count:0 mapcount:0 mapping:(null) index:198c (Tainted: G    B      ----------------  )
2012-09-17 11:10:34.610832 {DefaultControlEventListener} [mmcs]{43}.0.0: Call Trace:
2012-09-17 11:10:34.610883 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ee0b0] [c000000000008190] .show_stack+0x7c/0x184 (unreliable)
2012-09-17 11:10:34.610934 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ee160] [c00000000009a338] .bad_page+0x148/0x184
2012-09-17 11:10:34.610985 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ee1f0] [c00000000009f620] .__put_single_page+0x20/0x38
2012-09-17 11:10:34.611035 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ee270] [800000000661a53c] .vvp_page_fini_common+0x3c/0x200 [lustre]
2012-09-17 11:10:34.611086 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ee310] [800000000239da44] .cl_page_free+0x164/0x9b0 [obdclass]
2012-09-17 11:10:34.611137 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ee3d0] [800000000239e5fc] .cl_page_put+0x36c/0x8e0 [obdclass]
2012-09-17 11:10:34.611188 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ee500] [80000000043e2d34] .discard_pagevec+0xb4/0x150 [osc]
2012-09-17 11:10:34.611240 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ee5c0] [80000000043e36cc] .osc_lru_shrink+0x8fc/0x1410 [osc]
2012-09-17 11:10:34.611291 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ee750] [80000000043e468c] .osc_lru_del+0x3cc/0x780 [osc]
2012-09-17 11:10:34.611341 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ee880] [80000000043e6060] .osc_page_delete+0x150/0x550 [osc]
2012-09-17 11:10:34.611392 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ee950] [800000000239c0a0] .cl_page_delete0+0x150/0x820 [obdclass]
2012-09-17 11:10:34.611443 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06eea30] [800000000239c7f4] .cl_page_delete+0x84/0x250 [obdclass]
2012-09-17 11:10:34.611493 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06eeae0] [80000000066015f0] .ll_releasepage+0x170/0x1b0 [lustre]
2012-09-17 11:10:34.611544 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06eeb90] [c00000000009329c] .try_to_release_page+0x68/0x8c
2012-09-17 11:10:34.611595 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06eec00] [c0000000000a30cc] .shrink_page_list.clone.0+0x33c/0x550
2012-09-17 11:10:34.611646 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06eedc0] [c0000000000a36ec] .shrink_inactive_list+0x40c/0x71c
2012-09-17 11:10:34.611697 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06eefa0] [c0000000000a3cd8] .shrink_zone+0x2dc/0x2e0
2012-09-17 11:10:34.611747 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ef0d0] [c0000000000a4808] .try_to_free_pages+0x208/0x3d4
2012-09-17 11:10:34.611798 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ef230] [c00000000009c110] .__alloc_pages_nodemask+0x42c/0x6ec
2012-09-17 11:10:34.611849 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ef3b0] [c000000000094b28] .grab_cache_page_write_begin+0x7c/0xec
2012-09-17 11:10:34.611900 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ef460] [8000000006601744] .ll_write_begin+0x94/0x270 [lustre]
2012-09-17 11:10:34.611951 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ef530] [c0000000000958a4] .generic_file_buffered_write+0x140/0x354
2012-09-17 11:10:34.612001 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ef670] [c000000000096014] .__generic_file_aio_write+0x374/0x3d8
2012-09-17 11:10:34.612052 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ef770] [c0000000000960f0] .generic_file_aio_write+0x78/0xe8
2012-09-17 11:10:34.612103 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ef820] [800000000661fecc] .vvp_io_write_start+0xfc/0x3e0 [lustre]
2012-09-17 11:10:34.612154 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ef8f0] [80000000023acb4c] .cl_io_start+0xcc/0x220 [obdclass]
2012-09-17 11:10:34.612205 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06ef990] [80000000023b4964] .cl_io_loop+0x194/0x2c0 [obdclass]
2012-09-17 11:10:34.612256 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06efa40] [800000000659a028] .ll_file_io_generic+0x498/0x680 [lustre]
2012-09-17 11:10:34.612307 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06efb30] [800000000659a694] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
2012-09-17 11:10:34.612358 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06efc00] [800000000659a9b0] .ll_file_write+0x150/0x320 [lustre]
2012-09-17 11:10:34.612409 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06efce0] [c0000000000cbbec] .vfs_write+0xd0/0x1c4
2012-09-17 11:10:34.612460 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06efd80] [c0000000000cbddc] .SyS_write+0x54/0x98
2012-09-17 11:10:34.612511 {DefaultControlEventListener} [mmcs]{43}.0.0: [c0000003e06efe30] [c000000000000580] syscall_exit+0x0/0x2c
Comment by Prakash Surya (Inactive) [ 17/Sep/12 ]

I guess the bad page is because of page active bit is still set so I pushed patch set 2 for this. I don't know why I didn't see the problem while I was testing the patch.

This is part (if the the root cause) of the problem. We shouldn't be seeing the active bit getting set for any Lustre pages for this test load.

Comment by Prakash Surya (Inactive) [ 17/Sep/12 ]
  • vulcanio18.tar.gz - Logs of vmstat and meminfo of vulcanio18 over time. The script dumped those files once per second until it hit the OOM.
  • vulcanio18-ib0-2012-09-17-0.eps - A graph of the free memory, active files, and inactive files from meminfo vs. time.
Comment by Prakash Surya (Inactive) [ 17/Sep/12 ]

I need to set up the job to run on a single IO node in order to effectively watch "top" to view the CPU usage. As an alternative, I was thinking about running oprofile on the IO nodes.

Comment by Jinshan Xiong (Inactive) [ 18/Sep/12 ]

Hi Parakash, I can't open the file eps and failed to convert it to jpg with an online tool. Do you run the test with patch set 2 and can you still see the bad_page message?

Comment by Prakash Surya (Inactive) [ 18/Sep/12 ]

Here is the same image in PNG format.

Comment by Prakash Surya (Inactive) [ 18/Sep/12 ]

The attachments are from running with patch set 2, and I did not see the bad_page messages.

Comment by Jinshan Xiong (Inactive) [ 18/Sep/12 ]

from the picture, it looks all right until at time 60ish seconds, but the amount memory of active and inactive file were still around 4G. I don't know what happened at that time, maybe application is requesting a large amount of anon page - there are already large amount of anon pages in the system. Another strange thing is that the total size of using pages is far behind the system memory size, I don;t know where it goes.

After checking the kernel source code, it looks like it can't shrink inactive pages if pte referenced bit is still set on the page, but this can only happen if those pages are mmapped.

Can you please set llite.*.max_cached_mb to a lower value, says 1G or 2G and try again. I don't know if my algorithm has some problem with a PPC node. It works great on Ihara's machine.

Comment by Prakash Surya (Inactive) [ 18/Sep/12 ]

from the picture, it looks all right until at time 60ish seconds, but the amount memory of active and inactive file were still around 4G. I don't know what happened at that time, maybe application is requesting a large amount of anon page - there are already large amount of anon pages in the system. Another strange thing is that the total size of using pages is far behind the system memory size, I don;t know where it goes.

Well I don't think anything is changing in regards to the anon pages. Looking at the meminfo logs, I don't see a spike in the anon pages throughout the duration of the test. I also noticed the sum of the fields in meminfo did not add up to the total amount of memory in the system. Is this normal?

After checking the kernel source code, it looks like it can't shrink inactive pages if pte referenced bit is still set on the page, but this can only happen if those pages are mmapped.

This is the interesting part.. kswapd will set the referenced bit on the page only if page_mapped returns true. It will then move the page to the active list in a later iteration if the reference bit is set.

Can you please set llite.*.max_cached_mb to a lower value, says 1G or 2G and try again. I don't know if my algorithm has some problem with a PPC node. It works great on Ihara's machine.

Sure. I'm not sure what this will tell us though.

Comment by Prakash Surya (Inactive) [ 18/Sep/12 ]

I added another debug patch which dumps more of the page flags from /proc/fs/lustre/llite/*/dump_page_cache: http://review.whamcloud.com/4035. None of the pages are reporting as being "mapped", so I'm curious how the reference bit is getting set without page_mapped returning true.

Comment by Prakash Surya (Inactive) [ 18/Sep/12 ]

This is the same IOR test but with "max_cache_mb=1024" as opposed to the previous setting of 4096.

  • vulcanio3-ib0-2012-09-18-0.png: A graph of inactive and active file fields from /proc/meminfo vs time.
  • vulcanio3-ib0-2012-09-18-0.tar.gz: The data used to generate the graph, along with vmstat data over time.
Comment by Jinshan Xiong (Inactive) [ 19/Sep/12 ]

Well I don't think anything is changing in regards to the anon pages. Looking at the meminfo logs, I don't see a spike in the anon pages throughout the duration of the test. I also noticed the sum of the fields in meminfo did not add up to the total amount of memory in the system. Is this normal?

I think this is abnormal as I usually can sum up cache and free memory to total memory.

I added another debug patch which dumps more of the page flags from /proc/fs/lustre/llite/*/dump_page_cache: http://review.whamcloud.com/4035. None of the pages are reporting as being "mapped", so I'm curious how the reference bit is getting set without page_mapped returning true.

I was just guessing and it verified my guest was wrong, that's it.

Actually your latest run tells us lots of information, it looks like 6G was used up in 8 seconds (from 1348011717 to 1348011725) and I don;t know where it was used. Please compare the output of

{vmstat|meminfo}

_

{1348011717|1348011725}

.log where other fields are close to each other the only exception is that free_mem was dropped from 7G to 1G. The most interesting thing is that there is NO attempt to free lru pages directly before hitting OOM.

We need help from vm guru.

Comment by Jinshan Xiong (Inactive) [ 19/Sep/12 ]

Hi Andreas, can you please take a look at the problem I mentioned above?

Comment by Prakash Surya (Inactive) [ 19/Sep/12 ]

I'm not sure if /proc/

{meminfo,vmstat}

can be trusted. I ran the test again, with the "Remove pages from kernel LRU" patch reverted. The values printed on the console as part of the OOM message are vastly different than what is reported in the proc files.

For example, I installed a systemtap script to dump meminfo, vmstat, and lustre's page cache with the out_of_memory function is called. Here is a portion of those files compared against the console dump:

/proc/vmstat contents:

vulcanlac1@root:cat vulcanio81-ib0-1348084855213.vmstat | head -n20
nr_free_pages 173578
nr_inactive_anon 4931
nr_active_anon 35493
nr_inactive_file 1070
nr_active_file 29994
nr_unevictable 0
nr_mlock 0
nr_anon_pages 33803
nr_mapped 452 
nr_file_pages 23716
nr_dirty 69
nr_writeback 0
nr_slab_reclaimable 482 
nr_slab_unreclaimable 10505
nr_page_table_pages 925 
nr_kernel_stack 1662
nr_unstable 0
nr_bounce 0
nr_vmscan_write 908 
nr_writeback_temp 0

Console Log:

2012-09-19 13:00:55.386193 {DefaultControlEventListener} [mmcs]{72}.6.0: active_anon:35511 inactive_anon:4931 isolated_anon:0
2012-09-19 13:00:55.386632 {DefaultControlEventListener} [mmcs]{72}.6.0:  active_file:26357 inactive_file:162020 isolated_file:992
2012-09-19 13:00:55.387202 {DefaultControlEventListener} [mmcs]{72}.6.0:  unevictable:0 dirty:1 writeback:65 unstable:0
2012-09-19 13:00:55.387737 {DefaultControlEventListener} [mmcs]{72}.6.0:  free:14709 slab_reclaimable:481 slab_unreclaimable:10587
2012-09-19 13:00:55.388285 {DefaultControlEventListener} [mmcs]{72}.6.0:  mapped:490 shmem:6637 pagetables:946 bounce:0

My assumption was that the values should be roughly the same, but this is not the case. Specifically, "inactive_file" and "nr_inactive_file" are vastly different (two orders of magnitude). Theres a similar thing with "free" and "nr_free_pages". Perhaps I'm incorrectly assuming these values are reporting the same information?

A similar discrepancy is found with the meminfo file.

/proc/meminfo contents:

vulcanlac1@root:cat vulcanio81-ib0-1348084855213.meminfo | head -n20
MemTotal:       16718528 kB
MemFree:        11126336 kB
Buffers:               0 kB
Cached:          1519424 kB
SwapCached:            0 kB
Active:          4192896 kB
Inactive:         365248 kB
Active(anon):    2271360 kB
Inactive(anon):   315584 kB
Active(file):    1921536 kB
Inactive(file):    49664 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:              3520 kB
Writeback:             0 kB
AnonPages:       2163008 kB
Mapped:            28864 kB
Shmem:            424768 kB

Console log:

2012-09-19 13:00:55.388867 {DefaultControlEventListener} [mmcs]{72}.6.0: DMA free:941376kB min:1048576kB low:1310720kB high:1572864kB active_anon:2272704kB inactive_anon:315584kB active_file:1686848kB inactive_file:10369280kB unevictable:0kB isolated(anon):0kB isolated(file):63488kB present:16762816kB mlocked:0kB dirty:64kB writeback:4160kB mapped:31360kB shmem:424768kB slab_reclaimable:30784kB slab_unreclaimable:677568kB
 kernel_stack:26640kB pagetables:60544kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1320128 all_unreclaimable? yes

The most interesting thing is that there is NO attempt to free lru pages directly before hitting OOM.

I have been assuming the VM was going through the reclaim logic, but was just not able to free anyting because of the "active" bit being set on the Lustre pages. But I'm no longer so sure this is happening. Perhaps the real issue is the reclaim logic isn't occurring at all? I should try and verify whether or not this is happening.

Comment by Jinshan Xiong (Inactive) [ 19/Sep/12 ]

... Perhaps I'm incorrectly assuming these values are reporting the same information?

I just checked the code, they should fetch the data from the same source(vmstat.c and __show_free_areas()). Does it mean it was changed in that short time?

Comment by Prakash Surya (Inactive) [ 19/Sep/12 ]

I'm not sure if this is relevant, but pages just written to are taking many seconds to be reclaimed when I "echo 3 > /proc/sys/vm/drop_caches" on the system. Is this normal?

For example, I have a systemtap script monitoring ll_readpage, and that function returns 0 for many seconds before returning 1. My test case is a simple "dd" of a 1M file followed by multiple invocations of drop_caches..

In one shell:

vulcanio65-ib0@root:dd if=/dev/urandom of=/p/lstest/surya1/tmp.file count=16 bs=64K
16+0 records in
16+0 records out
1048576 bytes (1.0 MB) copied, 0.9945 s, 1.1 MB/s
vulcanio65-ib0@root:while true; do echo 3 > /proc/sys/vm/drop_caches; sleep 1; done

And in the other is my systemtap monitoring:

vulcanio65-ib0@root:/g/g0/surya1/seq-oom/ll_releasepage_probe.stp
Starting probe.
     0 bash(3985):->ll_releasepage vmpage=0xc0000000014edaa0 gfp_mask=0x0
    34 bash(3985):<-ll_releasepage return=0x0
     0 bash(3985):->ll_releasepage vmpage=0xc0000000014edaa0 gfp_mask=0x0
    30 bash(3985):<-ll_releasepage return=0x0
     0 bash(3985):->ll_releasepage vmpage=0xc0000000014edaa0 gfp_mask=0x0
    30 bash(3985):<-ll_releasepage return=0x0
     0 bash(3985):->ll_releasepage vmpage=0xc0000000014edaa0 gfp_mask=0x0
    30 bash(3985):<-ll_releasepage return=0x0
     0 bash(3985):->ll_releasepage vmpage=0xc0000000014edaa0 gfp_mask=0x0
    29 bash(3985):<-ll_releasepage return=0x0
     0 bash(3985):->ll_releasepage vmpage=0xc0000000014edaa0 gfp_mask=0x0
    29 bash(3985):<-ll_releasepage return=0x0
     0 bash(3985):->ll_releasepage vmpage=0xc0000000014edaa0 gfp_mask=0x0
    30 bash(3985):<-ll_releasepage return=0x0
     0 bash(3985):->ll_releasepage vmpage=0xc0000000014edaa0 gfp_mask=0x0
    90 bash(3985):<-ll_releasepage return=0x1

So in this test case, it took nearly 7 seconds for the page to be reclaimed. Should I be alarmed by that? Is it because of network latency, waiting for the page to safely reach the server?

Reads on the other hand don't suffer this "lag"

vulcanio65-ib0@root:dd if=/p/lstest/surya1/tmp.file of=/dev/null count=16 bs=64K
16+0 records in
16+0 records out
1048576 bytes (1.0 MB) copied, 0.0125329 s, 83.7 MB/s
vulcanio65-ib0@root:while true; do echo 3 > /proc/sys/vm/drop_caches; sleep 1; done
vulcanio65-ib0@root:/g/g0/surya1/seq-oom/ll_releasepage_probe.stp
Starting probe.
     0 bash(3985):->ll_releasepage vmpage=0xc0000000022b6b40 gfp_mask=0x0
    95 bash(3985):<-ll_releasepage return=0x1
Comment by Jinshan Xiong (Inactive) [ 19/Sep/12 ]

So in this test case, it took nearly 7 seconds for the page to be reclaimed. Should I be alarmed by that? Is it because of network latency, waiting for the page to safely reach the server?

The writeback interval should be 5 seconds by default so the result is not abnormal to me.

We can verify this by running a `sync' before dropping cache, or we can print the page state in systemtap script. in ll_releasepage(), the page won't be dropped in the following scenarios:
1. Page is dirty or being written back;
2. still have extra users, that's the refcount is greater than 3(caller, page private and page cache);
3. can't allocate lu_env and keys - but this is unlikely in your case.

Do you monitor all pages and just show me the vmpage @ 0xc0000000014edaa0, or the script could only see one page?

To simplify the case, can you please try some other performance benchmark tool, for example iozone or even dd, to see if we can still hit the same issue? Moreover, we can try to revert patch LU-744 and see how things are going. Sorry I have no idea about why the memory was used up in a short period so let's try to narrow the problem.

Comment by Andreas Dilger [ 20/Sep/12 ]

Actualy, 7s is the interval for data written asynchronously on the OST to be reported committed on the client, if there is no more IO inflight which would report the commit.

To be specific, data written with async journal commit on the OST will schedule an OBD_PING for 7s after the write is completed, so that if no more RPCs are sent it will send the "short ping" to get back an updated last_committed value from the OST (after the 5s journal commit) and can free the RPC(s) that are waiting in case of replay and pinning the client pages.

However, this 7s interval only applies to data written in the previous few seconds and uncommitted on the OST. Writes should typically commit earlier tan this, and will piggyback the last_committed update on other in-flight writes.

If there is only a single client writing to the OST, this can be problematic. We have discussed some potential mechanisms to allow the client to request earlier transaction commits, but nothing has been implemented yet. It does seen that the Sequoia configuration will be running in a mode with only a single client per OST at any time, so this needs to be addressed.

Oleg should be able to point at the existing bug for this issue.

Comment by Prakash Surya (Inactive) [ 20/Sep/12 ]

Do you monitor all pages and just show me the vmpage @ 0xc0000000014edaa0, or the script could only see one page?

I limited the output of the systemtap script to the first page it saw. Otherwise it would show ll_releasepage called all on 16 pages for the 1M file, as expected.

Comment by Prakash Surya (Inactive) [ 20/Sep/12 ]

Actualy, 7s is the interval for data written asynchronously on the OST to be reported committed on the client, if there is no more IO inflight which would report the commit.

To be specific, data written with async journal commit on the OST will schedule an OBD_PING for 7s after the write is completed, so that if no more RPCs are sent it will send the "short ping" to get back an updated last_committed value from the OST (after the 5s journal commit) and can free the RPC(s) that are waiting in case of replay and pinning the client pages.

However, this 7s interval only applies to data written in the previous few seconds and uncommitted on the OST. Writes should typically commit earlier tan this, and will piggyback the last_committed update on other in-flight writes.

OK. Thanks for the clarification. In that case this doesn't appear to be related, since there will be other IO going on in the OOM case unlike my simple "dd" test above.

Comment by Jinshan Xiong (Inactive) [ 20/Sep/12 ]

Good point on async commit, I realize this may be the root cause of this problem - this causes OSC LRU can't really free those pages as they are pinned by ptlrpc.

Can you please tell me how many OSTs are involved in your test, and what's the configuration of max pages per rpc?

As a workaround fix, you can try to disable async commit on the OSS and I guess it will survive your client nodes.

Comment by Prakash Surya (Inactive) [ 20/Sep/12 ]

Can you please tell me how many OSTs are involved in your test, and what's the configuration of max pages per rpc?

128 OSTs for the mounted filesystem, and 16 pages per RPC.

vulcanio65-ib0@root:cat /proc/fs/lustre/osc/lstest-OST0281-osc-c0000003c802f000/max_pages_per_rpc 
16
vulcanio65-ib0@root:cat /proc/fs/lustre/osc/lstest-OST0281-osc-c0000003c802f000/max_rpcs_in_flight 
8
vulcanio65-ib0@root:ls -1d /proc/fs/lustre/osc/lstest-OST* | wc -l
128
Comment by Prakash Surya (Inactive) [ 20/Sep/12 ]

To simplify the case, can you please try some other performance benchmark tool, for example iozone or even dd, to see if we can still hit the same issue?

I have been able to hit the same issue running FIO on the IO node itself, removing out the BGQ IO forwarding "stuff". I also tried "dd", but wasn't able to hit the OOM.

Comment by Prakash Surya (Inactive) [ 20/Sep/12 ]

Here are monitoring dumps from a couple runs of FIO which hit the OOM. Each test successfully ran the first read/write iteration, and hit the OOM during the second iteration (from what I can tell, as it was transitioning to the read phase).

The two tests differ in that, test "2012-09-20-3" had the rev-id 2 of "ORI-781 vvp: Remove pages from kernel LRU" applied, and "2012-09-20-2" had that patch reverted.

As you can see, the "Inactive" file pages varies significantly between the two tests.

It's interesting to see how the first r/w iteration in each test was able to eventually evict memory and get back into a "mostly free" memory state prior to starting the second iteration.

I'm curious if Lustre is just reclaiming too "slowly", maybe a result of the "async commit" mechanics? So..

1. Writes cause pages to accumulate in "inactive" pages
2. Reads start, pushing more pages into pagecache and causing a low memory situation
3. Reclaim starts, but Lustre retuns 0 from ll_releasepage for the recently written pages
4. No memory is reclaimed as a result of (3), resulting in an OOM even in direct reclaim code paths.

But that's mere speculation at this point..?

Comment by Andreas Dilger [ 21/Sep/12 ]

It would be interfering to try running the OSTs with a 1s commit interval, to see if this affects the problem on the client. There is a pool tunable for ZFS, and a mount option for ldiskfs.

It is my thought that the difficulty in reclaiming memory (especially if the write load is from a single client) is due to the asynchronous journal commits on the OST. This pins the dirty pages in the client cache for up to 5s at a time.

Cheers, Andreas

Comment by Andreas Dilger [ 21/Sep/12 ]

16 pages per RPC.

Is that correct? Do you have large pages on the PPC clients, or should this be 256 pages per RPC?

Comment by Prakash Surya (Inactive) [ 21/Sep/12 ]

It would be interfering to try running the OSTs with a 1s commit interval, to see if this affects the problem on the client. There is a pool tunable for ZFS, and a mount option for ldiskfs.

It is my thought that the difficulty in reclaiming memory (especially if the write load is from a single client) is due to the asynchronous journal commits on the OST. This pins the dirty pages in the client cache for up to 5s at a time.

Brian and I were thinking the same thing. I should get a chance to run that test later today.

Is that correct? Do you have large pages on the PPC clients, or should this be 256 pages per RPC?

The PPC clients have 64K pages, so in total it's the same (16*64K=1M, 256*4K=1M).

Comment by Jinshan Xiong (Inactive) [ 21/Sep/12 ]

The two tests differ in that, test "2012-09-20-3" had the rev-id 2 of "ORI-781 vvp: Remove pages from kernel LRU" applied, and "2012-09-20-2" had that patch reverted.

As you can see, the "Inactive" file pages varies significantly between the two tests.

From the test results, I think the patch of removing pages from kernel LRU worked great as it can control the memory usage of Lustre more accurate.

For the OOM problem, this is a problem of async commit. This problem can be easily hit at the end of test, because if there is no upcoming RPCs to the OSTs, the client has to wait up to 7 seconds until an OBD_PING RPC is sent the OSTs and piggyback committed transno and unpin the caching pages. This is why ll_releasepage() couldn't free those pages because ptlrpc took an extra reference on them.

1. Writes cause pages to accumulate in "inactive" pages
2. Reads start, pushing more pages into pagecache and causing a low memory situation
3. Reclaim starts, but Lustre retuns 0 from ll_releasepage for the recently written pages
4. No memory is reclaimed as a result of (3), resulting in an OOM even in direct reclaim code paths.

I guess the following happened, at the end of writing, it accumulated lots of inactive pages but they couldn't be free as they were currently pinned by ptlrpc. After a while, another iteration started which sent RPCs and piggybacked the committed transno. This is why the free memory was restored at the beginning of test.

In theory this should be a common problem and can be easily seen on performance test. I think I still missed something in my analysis.

To verify it, can you please try to disable async commit on the OST and see if this can survive the clients?

Is that correct? Do you have large pages on the PPC clients, or should this be 256 pages per RPC?

This is an PPC node and using 64KB page size, iirc.

Comment by Brian Behlendorf [ 21/Sep/12 ]

> To verify it, can you please try to disable async commit on the OST and see if this can survive the clients?

Yes, we can try this today. However, I expect that making the I/O synchronous will significantly degrade the OSS performance. It might be so bad that we simply don't hit the issue due to the reduced throughput at the client.

I think the more interesting test is to reduce the txg sync frequency from 5 seconds to 1 second. This should minimize the performance impact and reduce the number of pinned pages on the client. The default 5 second interval means we can easily pin 8.5GB (5s * 1.7GB/s = 8.5GB) of clean pages on the client. Dropping this to one second should limit it to 1.7GB (or so). That leaves reclaim on the client with far more eligible inactive pages to work with.

This may even improve performance by reducing the number ll_releasepage() calls which fail. In fact, it might be best to proactively informed the VM that these pages aren't reclaimable. One way to do this would be for Lustre to automatically move the pages on to the active LRU when they are submitted for writeback. This would prevent kswapd and direct reclaim from initially considering them during reclaim. They could then be moved to the inactive LRU (assuming there are no mmap() users) when Lustre releases the ptlrpc reference. Alternately, Lustre might just set PG_referenced on the page and let kswapd automatically promote it to the active LRU. As a bonus you would be able to glance at Active(file) in /proc/meminfo and get a rough idea of how many pages are in this state. This would even be consistent with the kernels definition of an active page.

This wouldn't fix this issue but it might be a nice performance improvement.

> For the OOM problem, this is a problem of async commit. This problem can be easily hit at the end of test, because if there is no upcoming RPCs to the OSTs, the client has to wait up to 7 seconds until an OBD_PING RPC is sent the OSTs and piggyback committed transno and unpin the caching pages. This is why ll_releasepage() couldn't free those pages because ptlrpc took an extra reference on them.

Yes, and no. It does seem like this is an async commit issue, although the problem doesn't occur at the end of the test. It occurs when IOR is transitioning from the write to the read phase. This is where the inactive file page spike occurs in the fio-vulcanio65-ib0-2012-09-20-2 graph.

The strange thing is that I would have expected the LU-744 changes to prevent this. If we're properly tracking all of the Lustre pages, then the right thing to do is block in ll_file_read() until one of these pinned pages (or another) gets released. We should not be allocating read pages past the defined limit, although we clearly are as shown by the spike.

Jinshan can you review the latest LU-744 patch to determine how this can happen?

Hopefully, in the short term we'll be able to just reduce the txg sync thread frequency until LU-744 can be fixed to limit this case.

Comment by Jinshan Xiong (Inactive) [ 21/Sep/12 ]

Yes, we can try this today. However, I expect that making the I/O synchronous will significantly degrade the OSS performance. It might be so bad that we simply don't hit the issue due to the reduced throughput at the client.

well at least you can use it while we're fixing the issue.

... to reduce the txg sync frequency from 5 seconds to 1 second

Set txg sync merely won;t fix the problem, you also need to reduce the interval of PING_INTERVAL_SHORT to 1 second for example.

This may even improve performance by reducing the .... This would even be consistent with the kernels definition of an active page.

This is a good idea. The only way to exclude from kswap page scanning is to take them out of kernel LRU and mark them unevictable(so that they won't disappear from page accounting), the only problem right now is that isolate_lru_page() is not exported by kernel and this is the only interface to operate kernel LRU.

Jinshan can you review the latest LU-744 patch to determine how this can happen?

http://review.whamcloud.com/4012 is a start to solve this problem. I don't know why this didn't happen on Ihara's X86 node even it can do write much faster. Do you have a X86 node to run the same test set?

Comment by Brian Behlendorf [ 21/Sep/12 ]

> well at least you can use it while we're fixing the issue.

Absolutely, if we need too!

> Set txg sync merely won;t fix the problem, you also need to reduce the interval of PING_INTERVAL_SHORT to 1 second for example.

Are you sure. My expectation was that reducing the short ping time wouldn't be needed since we have a constant stream of writes/reads to the server for the last committed ID to piggyback on. This should keep the client memory usage constrained. The OOM happens after the writes complete and we are starting the reads.

> The only way to exclude from kswap page scanning is to take them out of kernel LRU and mark them unevictable

That's unfortunate. After looking at the code I do see that if shrink_page_list->try_to_release_page->ll_releasepage() fails PG_Active will get set on the page. I think this nicely explains the accumulation of pages on the active list we initially reported. So it seems the VFS should only be scanning each of these pages once. Still it would be better if we could find a way to exempt them entirely from the scan.

> Do you have a X86 node to run the same test set

Yes, in fact we regularly run the test on our x86_64 machines. The key difference here is those machines have 64GB of total memory and the ppc64 machines only have 16GB. Pinning 8GB on the x86_64 box isn't a huge deal, there's still lots of memory available. On the ppc64 box it's much more significant.

Comment by Brian Behlendorf [ 21/Sep/12 ]

When setting 'sync_journal=1' on our OSSs we hit a kernel BUG, this was filed as ORI-783.

Comment by Prakash Surya (Inactive) [ 21/Sep/12 ]

I was able to roll a build of ZFS with the 'zfs_txg_timeout = 1' setting and still had OOM troubles running both FIO and IOR. I only had a chance to run a single IOR, but I hit the OOM with that one run. Things were a bit better on the single IO node FIO test. I was able to run the test 5 or so iterations, but eventually hit the OOM there too. So, this setting might help, judging by the FIO test, but it doesn't appear to be an outright fix. It also didn't seem to hurt performance from what I could tell, which was an open question.

The intent of LU-744 is a good way to tackle this, by limiting the amount of data cached by Lustre, but I don't think the patch is working in its current form. Even with http://review.whamcloud.com/4012, we still see the OOMs, the only change being "Inactive" pages are no longer showing up in meminfo and vmstat (which makes me think rev-id 4012 only affects the accounting and not the actual memory usage).

Comment by Andreas Dilger [ 22/Sep/12 ]

My expectation was that reducing the short ping time wouldn't be needed since we have a constant stream of writes/reads to the server for the last committed ID to piggyback on. This should keep the client memory usage constrained. The OOM happens after the writes complete and we are starting the reads.

I think this is exactly when the problem will manifest itself. When the writes stop, there will no longer be a stream of write RPCs to update the last_committed value, so those pages just sent will be pinned in the client memory until the first ping (or other RPC) is completed after the commit.

It would be interesting to see if adding a ~10s pause between the read and write phases (for the ping to complete and pages to be freed) avoids the OOM on the client?

Oleg and I had previously discussed a potential workaround for this memory pinning problem on the client. Basically, the client would send some kind of "soft sync" flag to the OST when doing write RPCs as it begins to get low on memory (e.g. 1/2 of memory is pinned). The OST will determine whether to begin a journal commit based on the number of clients requesting the soft sync (e.g. immediately if this is the only client doing IO), and the time since its last commit.

This will dynamically shorten the sync interval on the server to avoid pinning pages longer, and the client must also reduce its ping interval while pages are pinned in its memory.

With a hard 1s commit interval on the OST, and short ping interval of 1s this should simulate the above optimization to determine if it would actually fix the problem if implemented.

Comment by Prakash Surya (Inactive) [ 24/Sep/12 ]

I think this is exactly when the problem will manifest itself. When the writes stop, there will no longer be a stream of write RPCs to update the last_committed value, so those pages just sent will be pinned in the client memory until the first ping (or other RPC) is completed after the commit.

But wouldn't the write RPC piggyback the last_committed value too? Even though the writes stopped, we will then send read RPCs (at least on my write/read FIO test).

It would be interesting to see if adding a ~10s pause between the read and write phases (for the ping to complete and pages to be freed) avoids the OOM on the client?

I tried this and still saw the OOMs. But, with that said, I was able to see the OOMs under a write only test case as well. For some reason (maybe "fsync on close"?), there is a large spike in inactive file pages at the end of the write portion whether reads immediately follow it or not.

With a hard 1s commit interval on the OST, and short ping interval of 1s this should simulate the above optimization to determine if it would actually fix the problem if implemented.

I changed the sync interval in ZFS to 1s and still hit the issue. I didn't change any settings in Lustre though. Are there some tunables I can tweak to set the commit interval on the OSTs (is ZFS sync every 1s enough?) and to set the 1s short ping on the client?

Comment by Jinshan Xiong (Inactive) [ 24/Sep/12 ]

But wouldn't the write RPC piggyback the last_committed value too? Even though the writes stopped, we will then send read RPCs (at least on my write/read FIO test).

As you can see, there is a large spike of free memory between the tests. Sometimes it can fail really depending on when read starts. Setting the interval of write and read to 10s or longer can make it easier to hit this issue.

I changed the sync interval in ZFS to 1s and still hit the issue. I didn't change any settings in Lustre though. Are there some tunables I can tweak to set the commit interval on the OSTs (is ZFS sync every 1s enough?) and to set the 1s short ping on the client?

You need to patch the client. But it's really easy, in file obi_support.h, change PING_INTERVAL_SHORT to 1 and that's it. I have no idea for zfs txg write back interval.

Comment by Prakash Surya (Inactive) [ 24/Sep/12 ]

Jinshan, in the current state of LU-744, does it remove pages from the LRU that are waiting to be committed on the server? My expectation of the patch is given a 4G max, if 4G of write RPCs are issued to the server but uncommitted, the client would not be able to dirty any more pages until the inflight requests are committed and then removed from the client LRU. Is this not the case? It appears like the inflight data is removed from the LRU, allowing more pages to be used even though the client's max has already been exceeded and the pages are sitting in an unreclaimable state awaiting it's commit completion.

Comment by Prakash Surya (Inactive) [ 24/Sep/12 ]

You need to patch the client. But it's really easy, in file obi_support.h, change PING_INTERVAL_SHORT to 1 and that's it. I have no idea for zfs txg write back interval.

OK. I'll try this out.

Comment by Jinshan Xiong (Inactive) [ 24/Sep/12 ]

Jinshan, in the current state of LU-744, does it remove pages from the LRU that are waiting to be committed on the server? My expectation of the patch is given a 4G max, if 4G of write RPCs are issued to the server but uncommitted, the client would not be able to dirty any more pages until the inflight requests are committed and then removed from the client LRU. Is this not the case?

No, it isn't. Actually this is not a good way to go if it lacks a way to notify the OSTs to commit the data ASAP because otherwise the client may have to wait up to 7 seconds before it can allocate a new page for IO.

Comment by Ned Bass [ 24/Sep/12 ]

I accidentally hit "i" and assigned the ticket to myself. Giving it back to Jinshan.

Comment by Prakash Surya (Inactive) [ 24/Sep/12 ]

No, it isn't. Actually this is not a good way to go if it lacks a way to notify the OSTs to commit the data ASAP because otherwise the client may have to wait up to 7 seconds before it can allocate a new page for IO.

What benefit does the LU-744 patch provide, then, if the client can have an unlimited amount of uncommitted data in it's cache? I guess I'm misunderstanding the purpose of the limit. Sure, we're capping the amount of pages that are dirty or in writeback; but this doesn't limit on the total amount of data Lustre pins on the system, leading to the issue we see here.

Comment by Andreas Dilger [ 24/Sep/12 ]

Prakash, a few reasons we haven't seen this problem in the past:

  • async journal commit is relatively new
  • the client couldn't fill up all of the cache in a few seconds
  • most testing is done with multiple clients (less per-client bandwidth, more journal commits)
Comment by Prakash Surya (Inactive) [ 24/Sep/12 ]

async journal commit is relatively new

From my understanding we've essentially been running this way on ZFS the whole time. Not because Lustre specifically supported it, but because of the way ZFS defers its txg sync until the write limit is hit or the timeout is triggered. But of course, Lustre may be doing something new which I'm unaware of, making the situation worse than before.

the client couldn't fill up all of the cache in a few seconds

I think this is the main reason we haven't seen this previously. We saw OOMs while chasing ORI-768 sporadically running the same tests, and could never really nail down a cause for them.

Now on the rebased client code, we're seeing much better rates from a single client which I believe is compounding the issue (previously the best I remember seeing was around 1.7 GB/s per IO node, now I see 2.6-2.9 GB/s as reported by ltop). So that extra 5-7.5 GB of uncommitted data (1-1.5 GB/s increase * 5s for ZFS txg sync interval) is probably stressing the VM in ways it wasn't previously.

I want to try setting the short ping interval and the zfs sync interval each to 1s to see if that alleviates the situation, but I don't think I'll be able to test that out until later this week.

most testing is done with multiple clients (less per-client bandwidth, more journal commits)

Sure. Although our internal testing hasn't changed between testing the rebased vs. non-rebased orion branch, so I'm more inclined to pin this issue on the increased per IO node bandwidth.

Comment by Johann Lombardi (Inactive) [ 26/Sep/12 ]

FWIW, we used to work on a patch which accounted pages hold for replay under NR_UNSTABLE_NFS (might still be useful) and send OST_SYNC RPCs when short of memory (we probably don't want to do that any more with ZFS). See bugzilla 23529 (patch is against 1.8) for more information.

Comment by Prakash Surya (Inactive) [ 26/Sep/12 ]

Interesting, thanks for the pointer to that bug. Did any of those patches land on an official 1.8 release? It's hard to tell from the comments.

Comment by Johann Lombardi (Inactive) [ 26/Sep/12 ]

No, none of the patches have been landed.

Comment by Prakash Surya (Inactive) [ 26/Sep/12 ]

Johann, you say we probably wouldn't want to send OST_SYNC RPCs with ZFS.. Why is that?

Comment by Brian Behlendorf [ 01/Oct/12 ]

Last week at LAD12 there was some discussion of this issue which is now well understood. This isn't a new issue but do to a variety reasons, such as...

  • Relatively small client memory,
  • Improved client performance
  • Enabled async journal commits

we're now regularly able to reproduce this on Sequoia client I/O nodes. It has existed since the async journal feature was added. My understanding is that the preferred fix goes like this:

  • LU-744 currently only tracks clean pages and doesn't distinguish between unstable and stable pages. Limiting clean pages on a client can be useful and this change should be merged to master. This patch is working as expected for us and others, and it's going to add some of the basic infrastructure the following suggested changes need. However, it doesn't address our issue which caused by the accumulation of unstable clean pages on the clients.
  • To resolve the root cause we must limit unstable clean pages on the client. We should put together a patch (or several) which do the following:
    • Track all clean unstable pages. For the moment, it's probably easiest to track these per filesystem as was done with LU-744. However, long term I do feel a global limit is preferable.
    • For increased visibility these pages should additionally be tracked as NR_UNSTABLE_NFS which can be seen in /proc/meminfo
    • A limit on the maximum number of unstable clean pages should be set (perhaps 1/4 of memory by default). All new writes must block until there is cache space available.
    • Prior to reaching this limit (perhaps 1/2 way to the limit) the client should issue a 'soft sync' to all the servers with unstable clean pages. This soft sync can be handled asynchronously and the servers may optionally ignore it if they are already in the process of a sync. The ability to ignore the sync is particularly import for the ZFS OSDs to avoid thrashing.

Further optimization is possible but this is a good start to resolve our OOM issues.

Comment by Prakash Surya (Inactive) [ 10/Oct/12 ]

Please see: http://review.whamcloud.com/4245

Comment by Jinshan Xiong (Inactive) [ 16/Nov/12 ]

Hi Prakash, can you please take a look at LU-744 the comment at "Frederik Ferner added a comment - 13/Nov/12 8:44 AM", there was a strange behavior that IO only lasted for a couple of seconds and then stalled for ~20 seconds then restarted. I suspect that is related to our patches.

Comment by Prakash Surya (Inactive) [ 19/Nov/12 ]

Jinshan, please see my comment on LU-744. What is described in your dialogue with Frederik looks normal and expected, details are provided in my comment.

Comment by Jinshan Xiong (Inactive) [ 19/Nov/12 ]

Hi Prakash, Lustre doesn't rely on balance_dirty_pages() to write back dirty pages because it has its own IO engine; also your patch should work with old server anyways.

Comment by Prakash Surya (Inactive) [ 19/Nov/12 ]

Lustre doesn't rely on balance_dirty_pages() to write back dirty pages because it has its own IO engine

OK, that's fine. The kernel still calls balance_dirty_pages when IO is submitted though. For example:

    fio           S 00000fffae72633c     0 59338  59283 0x00000000
    Call Trace:
    [c0000003e0deed20] [c0000003e0deede0] 0xc0000003e0deede0 (unreliable)
    [c0000003e0deeef0] [c000000000008e10] .__switch_to+0xc4/0x100
    [c0000003e0deef80] [c00000000042b0e0] .schedule+0x858/0x9c0
    [c0000003e0def230] [c00000000042b7c8] .schedule_timeout+0x1f8/0x240
    [c0000003e0def310] [c00000000042a444] .io_schedule_timeout+0x54/0x98
    [c0000003e0def3a0] [c00000000009ddfc] .balance_dirty_pages+0x294/0x390
    [c0000003e0def520] [c000000000095a2c] .generic_file_buffered_write+0x268/0x354
    [c0000003e0def660] [c000000000096074] .__generic_file_aio_write+0x374/0x3d8
    [c0000003e0def760] [c000000000096150] .generic_file_aio_write+0x78/0xe8
    [c0000003e0def810] [8000000006a7062c] .vvp_io_write_start+0xfc/0x3e0 [lustre]
    [c0000003e0def8e0] [800000000249a81c] .cl_io_start+0xcc/0x220 [obdclass]
    [c0000003e0def980] [80000000024a2634] .cl_io_loop+0x194/0x2c0 [obdclass]
    [c0000003e0defa30] [80000000069ea208] .ll_file_io_generic+0x498/0x670 [lustre]
    [c0000003e0defb30] [80000000069ea864] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
    [c0000003e0defc00] [80000000069eab80] .ll_file_write+0x150/0x320 [lustre]
    [c0000003e0defce0] [c0000000000d1e9c] .vfs_write+0xd0/0x1c4
    [c0000003e0defd80] [c0000000000d208c] .SyS_write+0x54/0x98
    [c0000003e0defe30] [c000000000000580] syscall_exit+0x0/0x2c

By properly accounting NR_UNSTABLE_NFS, we still run into dirty page limitation in that function.

also your patch should work with old server anyways

And it does work with old servers, AFAIK. It might bump up against the dirty page limits, thus stalling any outstanding IO, but that's better than an OOM event.

Comment by Prakash Surya (Inactive) [ 29/Nov/12 ]

Regarding Brian's comment:

  • Track all clean unstable pages. For the moment, it's probably easiest to track these per filesystem as was done with LU-744. However, long term I do feel a global limit is preferable.
  • For increased visibility these pages should additionally be tracked as NR_UNSTABLE_NFS which can be seen in /proc/meminfo
  • A limit on the maximum number of unstable clean pages should be set (perhaps 1/4 of memory by default). All new writes must block until there is cache space available.
  • Prior to reaching this limit (perhaps 1/2 way to the limit) the client should issue a 'soft sync' to all the servers with unstable clean pages. This soft sync can be handled asynchronously and the servers may optionally ignore it if they are already in the process of a sync. The ability to ignore the sync is particularly import for the ZFS OSDs to avoid thrashing.

We've done some preliminary testing on Sequoia at scale using my LU-2139 patch stack (4245, 4374, and 4375) and stability and performance are much better as a result. Write workloads are much more sustainable and predictable, and the number of unstable pages on the system as a result of Lustre fluctuates around 2G (the point at which soft syncs are sent to the OSTs). Prior to the patch stack being applied, the number of unstable pages resulting from Lustre IO would easily grow beyond 10G causing the OOMs.

Comment by Erich Focht [ 17/Apr/13 ]

Applied the patch stack (4245, 4374, 4375) to a 2.6.32-358.0.1.el6.x86_64 kernel. Getting soft lockups when unmounting. The servers run Lustre 2.1.5.

Pid: 3792, comm: umount Not tainted 2.6.32-358.0.1.el6.x86_64 #1 Supermicro X9DRT/X9DRT
RIP: 0010:[<ffffffffa0b65e2c>]  [<ffffffffa0b65e2c>] ll_put_super+0x10c/0x510 [lustre]
RSP: 0018:ffff881016853d28  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff881016853e28 RCX: ffff881027b6c110
RDX: ffff88102d7d1540 RSI: 000000000000005a RDI: ffff88104a73fc00
RBP: ffffffff8100bb8e R08: 0000000000000000 R09: ffff881049e980c0
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff881016853d98 R14: ffff88106ebfba00 R15: ffff881027afa044
FS:  00002ab82d4c7740(0000) GS:ffff88089c520000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002ab82d167360 CR3: 000000102d0ca000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process umount (pid: 3792, threadinfo ffff881016852000, task ffff88102d7d1540)
Stack:
 ffffffff81fcb700 01ff88086e17b148 ffff88102c932400 ffff88102c932000
<d> ffff88102d7d1540 ffff88106ebfba00 ffff88086e17b350 ffff88086e17b138
<d> ffff881016853d88 ffffffff8119cccf ffff88086e17b138 ffff88086e17b138
Call Trace:
 [<ffffffff8119cccf>] ? destroy_inode+0x2f/0x60
 [<ffffffff8119d19c>] ? dispose_list+0xfc/0x120
 [<ffffffff8119d596>] ? invalidate_inodes+0xf6/0x190
 [<ffffffff8118334b>] ? generic_shutdown_super+0x5b/0xe0
 [<ffffffff81183436>] ? kill_anon_super+0x16/0x60
 [<ffffffffa06e82ea>] ? lustre_kill_super+0x4a/0x60 [obdclass]
 [<ffffffff81183bd7>] ? deactivate_super+0x57/0x80
 [<ffffffff811a1c4f>] ? mntput_no_expire+0xbf/0x110
 [<ffffffff811a26bb>] ? sys_umount+0x7b/0x3a0
 [<ffffffff810863b1>] ? sigprocmask+0x71/0x110
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Code: c0 4c 89 a5 10 ff ff ff 4c 8d b5 40 ff ff ff 48 89 95 20 ff ff ff 49 89 cd 49 89 d4 0f 85 ad 00 00 00 45 85 ff 0f 85 a4 00 00 00 <8b> 83 0c 01 00 00 85 c0 74 f6 4c 89 f7 e8 72 f3 a4 ff 4c 89 f6
Call Trace:
 [<ffffffffa0b65dc7>] ? ll_put_super+0xa7/0x510 [lustre]
 [<ffffffff8119cccf>] ? destroy_inode+0x2f/0x60
 [<ffffffff8119d19c>] ? dispose_list+0xfc/0x120
 [<ffffffff8119d596>] ? invalidate_inodes+0xf6/0x190
 [<ffffffff8118334b>] ? generic_shutdown_super+0x5b/0xe0
 [<ffffffff81183436>] ? kill_anon_super+0x16/0x60
 [<ffffffffa06e82ea>] ? lustre_kill_super+0x4a/0x60 [obdclass]
 [<ffffffff81183bd7>] ? deactivate_super+0x57/0x80
 [<ffffffff811a1c4f>] ? mntput_no_expire+0xbf/0x110
 [<ffffffff811a26bb>] ? sys_umount+0x7b/0x3a0
 [<ffffffff810863b1>] ? sigprocmask+0x71/0x110
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
BUG: soft lockup - CPU#25 stuck for 67s! [umount:3792]
Comment by Prakash Surya (Inactive) [ 19/Apr/13 ]

efocht Please make sure you are using patch-set 30 of http://review.whamcloud.com/4245. The earlier patch-sets had deficiencies in them, and patch-set 29 specifically had a bug in it causing umounts to hang in ll_put_super (which looks like the problem you are having).

Comment by Oleg Drokin [ 05/May/13 ]

Hit assertion introduced by first patch in the series:
"LustreError: 17404:0:(osc_cache.c:1774:osc_dec_unstable_pages()) ASSERTION( atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0 ) failed"

See LU-3274 for more details.

Comment by Peter Jones [ 22/Sep/13 ]

Lai

Could you please take care of refreshing Jinshan's patches so they are suitable for landing to master?

Thanks

Peter

Comment by Peter Jones [ 02/Oct/13 ]

This remains a support priority to get these patches refreshed and landed but the patches are not presently ready for consideration to include in 2.5.0

Comment by Prakash Surya (Inactive) [ 02/Oct/13 ]

I've just refreshed these three patches onto HEAD of master:

1) http://review.whamcloud.com/6284
2) http://review.whamcloud.com/4374
3) http://review.whamcloud.com/4375

It would be nice to get some feedback on them, we've been running with previous versions of these patches on Sequoia/Grove for nearly a year now.

Comment by Peter Jones [ 03/Oct/13 ]

Prakash

Thanks for the refresh. Yes, we will review these patches shortly

Peter

Comment by Christopher Morrone [ 12/Nov/13 ]

Remaining:

http://review.whamcloud.com/5935

Comment by Peter Jones [ 09/Dec/13 ]

All patches now landed to master

Comment by Jian Yu [ 07/Nov/14 ]

Here are the back-ported patches for Lustre b2_5 branch:

  1. http://review.whamcloud.com/12604 (from http://review.whamcloud.com/6284)
  2. http://review.whamcloud.com/12605 (from http://review.whamcloud.com/4374)
  3. http://review.whamcloud.com/12606 (from http://review.whamcloud.com/4375)
  4. http://review.whamcloud.com/12612 (from http://review.whamcloud.com/5935)
Generated at Sat Feb 10 01:22:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.