My word, that's quite a thread. I should've gotten involved a few days ago. It's also degraded a bit, since Linus and Dave are ... getting along well ... in their most recent notes.
I'm going to gather my thoughts/explain a bit here, I know some of this should go on the list, but I've got a lot of catching up to do before I can comment there.
So our workload is lots of threads doing buffered i/o to one file*, and we're spending lots of time on spinlocks in two functions:
*I don't know where Kent gets the idea that multiple writers must mean buffered and direct, but that's a sampling from the middle and I have about another hundred messages to read...
(as seen in Ihara's flame graph https://jira.whamcloud.com/secure/attachment/32772/lustre-ssf.svg)
__add_to_page_cache_locked
and:
__set_page_dirty_nobuffers
The particular spinlock is the one that protects the page cache itself. Used to mapping->tree_lock, but it's not called a tree any more, so the lock has a new name... Same lock, though.
These functions are both modifying the radix tree/xarray (the data structure used for the page cache was renamed and re-APIed in the last year [no functional changes]) that holds the pages for this particular file. The first is adding pages, and the second one is tagging radix tree entries as dirty. (Pages have dirty state, but the radix tree/xarray entries are also tagged as dirty, which is a separate thing, and which allows writeback (mostly writeback) to rapidly find dirty pages. It is this tagging that needs to write the tree.)
In upstream, both of these operations are done a single page at a time, going back several calls.
The dirty tagging side is quite simple to do with a pagevec of pages instead, and not release the lock for each page. This is still tagging one page at a time, but at least it holds the lock. That's what my patch does - Hold the lock around the loop where the pages are added. (The latest version takes advantage of Lustre specific behavior to move several operations outside of the lock, an upstream version would look more like the earlier one... But as Ihara proved, that didn't make much difference anyway.)
The xarray/radix tree has the ability to tag pages in batches, but, bizarrely, it's only exposed with a function that does "if have TAG X, add TAG Y", which is used for marking dirty pages for writeback. There is no way to just batch TAG X (and you can't trick the function either). It's a slightly bizarre choice and would be very easy to fix with a modification of that function.
Briefly put, the adding side is notably more complex to do in batches, mostly because there are a bunch of operations that can potentially fail, and also because the write functions deal with one page at a time, so there's no obvious place to pull in the pages.
The core write function in the kernel (generic_perform_write) is a big loop that's basically:
"ask file system for page at offset X"
(this calls .write_begin from the address space ops...)
"copy data to that page"
"tell file system we're done"
(this calls .write_end)
All one page at a time.
This is the thing Kent open-coded. He doesn't use write_begin/write_end (though he's basically got versions of them).
He does each step in a separate loop over a bunch of pages each time, but he still gets his pages in the page cache one at a time - So if he allowed multiple writers to a file (I'm guessing he doesn't based on his comments), he'd crash in to the same wall we're hitting. He might get better behavior because he calls the page getting stuff in a tight loop (I have a patch for that I want to play with, but it's complicated.)
Like I said, I'll have to pore over that thread carefully, but the straightforward things to do would be:
Allow dirtying with a pagevec
Allow bulk tagging in the xarray interface
The basic problem with the rest is code complexity, and then getting anything else to benefit. Without multiple concurrent writers to a single file (so, range locking rather than the inode mutex), there's not really contention on managing the per-file page tracking... Any benefit experienced by anyone without that would be very small.
Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40347
Subject:
LU-9920vvp: dirty pages with pagevecProject: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: b5e6d9840b8515de5102a15a11bc66f95bf25a27