The BUG happened because ll_readhead() was called from ll_readpage() with
mapping == NULL, so the line
inode = mapping->host
caused the oops.
At the begining of ll_readpage() page->mapping was not NULL, because the line
struct inode *inode = page->mapping->host;
worked fine. Therefore, page->mapping changed to NULL while running ll_readpage().
At first sight it looks like the page kept locked all the time at ll_readpage().
However, the following race seems to be possible:
Process 1:
ll_readpage() called via handle_mm_fault()>__do_fault()>ll_fault()->filemap_fault():
starts with locked page P and calls ll_issue_page_read() which queues the page for i/o.
Once the page is queued it can be involved into group i/o triggered by another thread.
Process 2:
obd_trigger_group_io()
triggers i/o for group of queued pages, including the page P. The P gets unlocked and uptodate as result of that.
Process 3:
shrink_slab()>llap_shrink_cache()>llap_shrink_cache_internal()
kswapd shrinks slabs and calls ll_shrink_cache() (registered lustre's slab shrinker, which tries to shrink lustre pages.
llap_shrink_cache_internal() locks the page P and calls ll_truncate_complete_page() which removes the page P from page cache (page->mapping becomes NULL).
llap_shrink_cache_internal() avoids to ll_truncate_complete_page() dirty pages, but it does not care of busy pages: the ones with increase reference count.
Process 1 continues with page->mapping == NULL and calls ll_readahead() passing NULL as 3rd argument:
ll_readpage()
...
/* We have just requested the actual page we want, see if we can tack
- on some readahead to that page's RPC before it is sent. */
if (ll_i2sbi(inode)->ll_ra_info.ra_max_pages_per_file)
ll_readahead(&fd->fd_ras, exp, page->mapping, oig,
fd->fd_flags);
ll_readahead()
{
...
inode = mapping->host; <<<== NULL pointer dereference happens here, as long as mapping is NULL
...
Dump of stacktraces provided by HLRN (https://bugzilla.lustre.org/attachment.cgi?id=33148) contains number of processes:
- handling pagefaults of lustre pages looking like:
0xffff8acffbd72600 839153 838417 1 343 R 0xffff8acffbd72c90 *magic.exe
[<ffffffffa0a9dfda>] ll_readahead+0x4a/0x1d20 [lustre]
[<ffffffffa0aa0ebb>] ll_readpage+0x120b/0x1760 [lustre]
[<ffffffff810b5cf1>] filemap_fault+0x241/0x3d0
[<ffffffffa0aaebd7>] ll_fault+0x3b7/0xdd0 [lustre]
[<ffffffff810ceae7>] __do_fault+0x57/0x520
[<ffffffff810d3459>] handle_mm_fault+0x199/0x430
[<ffffffff8139dcaf>] do_page_fault+0x1bf/0x3e0
[<ffffffff8139b5cf>] page_fault+0x1f/0x30
- processes which are entered lustre shrink code trying to allocate memory and kswapd's shrinking slabs with typical trace like:
0xffff881072408680 7239 2 0 965 D 0xffff881072408d10 kswapd56
[<ffffffff81399094>] thread_return+0x0/0x34c
[<ffffffff810cf589>] mapping_sleep+0x9/0x10
[<ffffffff8139986a>] __wait_on_bit_lock+0x4a/0xa0
[<ffffffff81399939>] out_of_line_wait_on_bit_lock+0x79/0xa0
[<ffffffff810d2309>] unmap_mapping_range+0x149/0x310
[<ffffffffa0aae6c7>] ll_teardown_mmaps+0x67/0x1c0 [lustre]
[<ffffffffa0aa2072>] llap_shrink_cache+0x5a2/0xa40 [lustre]
[<ffffffffa0a83763>] ll_shrink_cache+0x33/0x70 [lustre]
[<ffffffff810c3f8c>] shrink_slab+0x12c/0x190
[<ffffffff810c49d8>] balance_pgdat+0x4c8/0x7b0
[<ffffffff810c4de9>] kswapd+0x129/0x180
[<ffffffff81064366>] kthread+0x96/0xa0
which shows that the race described above is possible.
The proposed fix is to avoid to ll_truncate_complete_page() pages having extra reference counters.
Close old ticket.