[LU-485] Remove CPT_TRANSIENT pages out of radix tree Created: 05/Jul/11  Updated: 08/Jul/11  Resolved: 08/Jul/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Jinshan Xiong (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Rank (Obsolete): 9722

 Description   

In clio, we currently put both cache and transient pages into radix tree, this makes the code complex and ugly. We should remove transient pages out of radix tree.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 05/Jul/11 ]

patch is at http://review.whamcloud.com/1055.

Comment by Johann Lombardi (Inactive) [ 06/Jul/11 ]

hm, it might conflict with the new grant approach in Sequoia. We consume grant for direct I/O so we might need to keep those pages in the radix tree ... need to think further

Comment by Niu Yawei (Inactive) [ 06/Jul/11 ]

Hi, Johann

I don't quite understand why dio need to consume grant?

Storing those disposable 'transient' pages in the radix tree doesn't make sense to me, and it makes unaligned dio hard to be implemented (in LU-247, I made a patch for unaligned dio, which doesn't associate cl_page with the dio user page, and doesn't maintain the radix tree for the user pages as well, seem it conflict with the new grant approach too. )

Comment by Johann Lombardi (Inactive) [ 06/Jul/11 ]

> I don't quite understand why dio need to consume grant?

To prevent DIOs from failing with ENOSPC while the client which sent the request still has available grants.

> Storing those disposable 'transient' pages in the radix tree doesn't make sense to me

I tend to agree. The problem is that the new RPC engine is supposed to use the radix tree to select pages to put together in a BRW, instead of the LRU list.

> and it makes unaligned dio hard to be implemented (in LU-247, I made a patch for unaligned dio, which doesn't associate cl_page with the dio user page, and doesn't maintain the radix tree for the user pages as well, seem it conflict with the new grant approach too. )

I see, let me think about it ...

Comment by Jinshan Xiong (Inactive) [ 06/Jul/11 ]

Actually, I tend to think DIO would be fine with radix tree grant. This is because:
1. for caching page, the grants are decreased when the pages are added, that is in osc_page_cache_add() function, where we should check if the grant has already held by block size. These pages must have OBD_BRW_FROM_GRANT set;
2. for DIO pages, they are submitted via cl_io_submit() function and should not have FROM_GRANT set. Do we really have to consume grant under this case? I mean application is able to see ENOSPC error in this case. Even we need to hold grant for DIO pages, this is easy to handle since DIO pages should be contiguous so that we can just deduct grant by block size directly, this may lose some grants but the maximum losing blocks are 2, so this is fine.

Comment by Johann Lombardi (Inactive) [ 06/Jul/11 ]

> Actually, I tend to think DIO would be fine with radix tree grant. This is because:
> 1. for caching page, the grants are decreased when the pages are added, that is in osc_page_cache_add() function, where we should check if the grant has already held by block size. These pages must have OBD_BRW_FROM_GRANT set;

Right.

> 2. for DIO pages, they are submitted via cl_io_submit() function and should not have FROM_GRANT set.

For DIO, grant space is consumed in osc_io_submit_page() via a call to osc_enter_cache_try().

> Do we really have to consume grant under this case?

I don't think this is a strong requirement.

> I mean application is able to see ENOSPC error in this case. Even we need to hold grant for DIO pages,
> this is easy to handle since DIO pages should be contiguous so that we can just deduct grant by block
> size directly, this may lose some grants but the maximum losing blocks are 2, so this is fine.

Unlike b1_8, CLIO uses the same path to send RPCs for buffered & DIO writes (through osc_oap_to_pending()). The result is that a bulk write can have a mix of pages for direct I/O and pages from buffered writes.

Comment by Jinshan Xiong (Inactive) [ 06/Jul/11 ]

>> Unlike b1_8, CLIO uses the same path to send RPCs for buffered & DIO writes (through osc_oap_to_pending()). The result is that a
>> write can have a mix of pages for direct I/O and pages from buffered writes.

Niu is working on a patch where DIO pages are written independently.

Comment by Niu Yawei (Inactive) [ 07/Jul/11 ]

I think it's possible that there are both 'transient' page and cached page at the same time for a given file offset (imagine there are cocurrent dio and buffered read), but we store the 'transient' page and the cached page in the same radix tree with same key(page offset), then I don't see how can we handle the cocurrent dio and buffered read?

I think it has caused trouble for us (see LU-481), to make it worse, there are two level radix trees (for object and sub-object), race can happen in each level of cl_page_find0().

Comment by Johann Lombardi (Inactive) [ 07/Jul/11 ]

> Niu is working on a patch where DIO pages are written independently.

That's good news actually. As long as those pages are not submitted through osc_oap_to_pending(), it is fine with me. We can handle grants for DIO separately.

> I think it's possible that there are both 'transient' page and cached page at the same time
> for a given file offset (imagine there are cocurrent dio and buffered read), but we store
> the 'transient' page and the cached page in the same radix tree with same key(page offset),

We cannot have a transient & a cached page at the same offset in the radix tree (at least, not with the current CLIO code) because ll_direct_rw_pages() looks up the index in the radix tree and uses the cache page for the transfer if there is one. But what you can have is two contiguous pages, one being transient and the other one cached, both bundled in the same bulk write.

Comment by Jinshan Xiong (Inactive) [ 07/Jul/11 ]

to Niu: actually we cannot have transient and caching pages for the same offset at the same time, with current clio implementation or you're working on(where dio pages should be submitted separately). If a node is doing dio and cache write at the same time, the result is undefined, because it may make cache inconsistent, or stale data to be read.

Comment by Niu Yawei (Inactive) [ 08/Jul/11 ]

Will fix it in LU-481.

Generated at Sat Feb 10 01:07:30 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.