[LU-3018] small reads occur during bulk writes hurting overall performance Created: 22/Mar/13  Updated: 09/May/14  Resolved: 09/May/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Andrew Perepechko Assignee: Cliff White (Inactive)
Resolution: Won't Fix Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 7341

 Description   

During obdfilter-survey on OSS node or IOR write test from the client there are small reads during bulk writes which impact the performance of the OST.
Based on the block trace and dumpe2fs output these reads are for ldiskfs block bitmaps.

A patch will be uploaded shortly.



 Comments   
Comment by Andrew Perepechko [ 22/Mar/13 ]

http://review.whamcloud.com/#change,5811

Comment by Andrew Perepechko [ 22/Mar/13 ]

The logs showing how bitmap pages are evicted:

 [<ffffffff81135c28>] ? __remove_mapping+0xd8/0x160
 [<ffffffff81136b7d>] ? shrink_page_list.clone.0+0x47d/0x5e0
 [<ffffffff81136fd0>] ? shrink_inactive_list+0x2f0/0x730
 [<ffffffffa04e90fd>] ? cfs_hash_rw_unlock+0x1d/0x30 [libcfs]
 [<ffffffffa04e7ac4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
 [<ffffffff8113821f>] ? shrink_zone+0x38f/0x510
 [<ffffffff8109cc99>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff8113849e>] ? do_try_to_free_pages+0xfe/0x520
 [<ffffffff81138abf>] ? try_to_free_pages+0x9f/0x130
 [<ffffffff81139c40>] ? isolate_pages_global+0x0/0x380
 [<ffffffff811301a7>] ? __alloc_pages_nodemask+0x447/0x920
 [<ffffffff81164e2a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff8111ccf7>] ? __page_cache_alloc+0x87/0x90
 [<ffffffff8111db0f>] ? find_or_create_page+0x4f/0xb0

 [<ffffffff81135c28>] ? __remove_mapping+0xd8/0x160
 [<ffffffff81135cc6>] ? remove_mapping+0x16/0x30
 [<ffffffff81134bf2>] ? invalidate_inode_page+0x82/0xb0
 [<ffffffff81134efa>] ? invalidate_mapping_pages+0xda/0x150
 [<ffffffff814fb8eb>] ? _spin_unlock+0x2b/0x40
 [<ffffffff811a2ec0>] ? shrink_icache_memory+0x1c0/0x2e0
 [<ffffffff811a2f95>] ? shrink_icache_memory+0x295/0x2e0
 [<ffffffff811362ed>] ? shrink_slab+0x14d/0x1b0
 [<ffffffff8113963d>] ? balance_pgdat+0x5ad/0x810
 [<ffffffff81139c40>] ? isolate_pages_global+0x0/0x380
 [<ffffffff811399e4>] ? kswapd+0x144/0x3a0

 [<ffffffff81139deb>] isolate_pages_global+0x1ab/0x380
 [<ffffffff81136d99>] ? shrink_inactive_list+0xb9/0x730
 [<ffffffff81136e42>] shrink_inactive_list+0x162/0x730
 [<ffffffffa04e90fd>] ? cfs_hash_rw_unlock+0x1d/0x30 [libcfs]
 [<ffffffffa04e7ac4>] ? cfs_hash_dual_bd_unlock+0x34/0x60 [libcfs]
 [<ffffffffa04e9c12>] ? cfs_hash_lookup+0x82/0xa0 [libcfs]
 [<ffffffffa06a20f5>] ? cl_env_fetch+0x25/0x80 [obdclass]
 [<ffffffff8113821f>] shrink_zone+0x38f/0x510
 [<ffffffff811397a9>] balance_pgdat+0x719/0x810
 [<ffffffff81139c40>] ? isolate_pages_global+0x0/0x380
 [<ffffffff811399e4>] kswapd+0x144/0x3a0

Note they are not passing through shrink_active_list();

I_NEW is needed to avoid the following code path:

 [<ffffffff81134efa>] ? invalidate_mapping_pages+0xda/0x150
 [<ffffffff814fb8eb>] ? _spin_unlock+0x2b/0x40
 [<ffffffff811a2ec0>] ? shrink_icache_memory+0x1c0/0x2e0
Comment by Andrew Perepechko [ 22/Mar/13 ]

Xyratex-bug-id: MRP-691

Comment by James A Simmons [ 22/Mar/13 ]

Would you mind if I update the patch to support SLES11 SP2 as well?

Comment by Andrew Perepechko [ 22/Mar/13 ]

Hello James!
That would be very nice.
Thank you.

Comment by Andrew Perepechko [ 23/Mar/13 ]

Using mark_page_accessed() is not enough to avoid page eviction.

find_or_create_page() allocates a page and links it to the corresponding cpu buffer.

struct page *find_or_create_page(struct address_space *mapping,
                pgoff_t index, gfp_t gfp_mask)
{
        struct page *page;
        int err;
repeat:
        page = find_lock_page(mapping, index);
        if (!page) {
                page = __page_cache_alloc(gfp_mask);
                if (!page)
                        return NULL;
                /*
                 * We want a regular kernel memory (not highmem or DMA etc)
                 * allocation for the radix tree nodes, but we need to honour
                 * the context-specific requirements the caller has asked for.
                 * GFP_RECLAIM_MASK collects those requirements.
                 */
                err = add_to_page_cache_lru(page, mapping, index,
                        (gfp_mask & GFP_RECLAIM_MASK));
...
}
void __lru_cache_add(struct page *page, enum lru_list lru)
{
        struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];

        page_cache_get(page);
        if (!pagevec_add(pvec, page))
                ____pagevec_lru_add(pvec, lru);
        put_cpu_var(lru_add_pvecs);
}

Note that ____pagevec_lru_add() which calls SetPageLRU() is only performed when the cpu buffer is full.

make_page_accessed() activates the page only if it is on LRU. Otherwise, the page is marked or kept referenced:

void mark_page_accessed(struct page *page)
{
        if (!PageActive(page) && !PageUnevictable(page) &&
                        PageReferenced(page) && PageLRU(page)) {
                activate_page(page);
                ClearPageReferenced(page);
        } else if (!PageReferenced(page)) {
                SetPageReferenced(page);
        }
}

shrink_inactive_list() drains the buffer and evicts the pages even if mark_page_accessed() was called a lot of times:

static unsigned long shrink_inactive_list(unsigned long max_scan,
                        struct zone *zone, struct scan_control *sc,
                        int priority, int file)
{
        LIST_HEAD(page_list);
        struct pagevec pvec;
        unsigned long nr_scanned = 0;
        unsigned long nr_reclaimed = 0;
        unsigned long nr_dirty = 0;
        unsigned long nr_writeback = 0;
        struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);

        while (unlikely(too_many_isolated(zone, file, sc))) {
                congestion_wait(BLK_RW_ASYNC, HZ/10);

                /* We are about to die and free our memory. Return now. */
                if (fatal_signal_pending(current))
                        return SWAP_CLUSTER_MAX;
        }

        pagevec_init(&pvec, 1);

        lru_add_drain();
...
}
Comment by Keith Mannthey (Inactive) [ 25/Mar/13 ]

Can you please post any detailed performance data you have? What environment are you testing in and what results do you see?

Comment by Keith Mannthey (Inactive) [ 21/May/13 ]

Andrew Perepechko, Any update?

Comment by Andrew Perepechko [ 22/May/13 ]

Keith Mannthey, there is a lot ongoing activity in LKML.

Comment by Keith Mannthey (Inactive) [ 22/May/13 ]

That is excellent news.

Comment by Andrew Perepechko [ 27/Sep/13 ]

this ticket should be closed and the long-term solution backported from the vanilla kernel

Comment by Cliff White (Inactive) [ 09/May/14 ]

Closing ticket per Andrew

Generated at Sat Feb 10 01:30:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.