[LU-3298] infinite loop in page fault on mmapped file. Created: 08/May/13  Updated: 09/Oct/21  Resolved: 09/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 8165

 Description   

The story on this one is not exactly clear yet.
I first noticed this problem because my sanityn mmap testing hung:

== sanityn test 18: mmap sanity check =================================== 17:19:48 (1367961588)
mmap test1: basic mmap operation (PASS, 0.040102s)
mmap test2: MAP_PRIVATE not write back (PASS, 0.011313s)
mmap test3: concurrent mmap ops on two nodes (SKIPPED, 0s)
mmap test4: c1 write to f1 from mmapped f2, c2 write to f1 from mmapped f1 (PASS, 2.1596s)
mmap test5: read/write file to/from the buffer which mmapped to just this file (PASS, 0.060313s)

the bug hit in test 6

11499 pts/0    R+   1288:17 /home/green/git/lustre-release/lustre/tests/../tests/mmap_sanity -d /mnt/lustre -m /mnt/lustre2 -e 3

Investigation of logs uncovered this:

00000080:00008000:7.0:1367964803.444630:0:11499:0:(vvp_io.c:727:vvp_io_fault_start()) llite: fault and truncate race happened!
00000080:00000001:7.0:1367964803.444630:0:11499:0:(vvp_io.c:731:vvp_io_fault_start()) Process leaving via out (rc=1 : 1 : 0x1)

The check for that is to see if page->mapping is NULL indicating truncate.

After digging some more into it I found that after this return we return all the way to userspace, then page fault repeats and we find this exact page when doing old_page = vm_normal_page(vma, address, orig_pte); in do_wp_page in kernel.
This is most likely a sign of a page that was truncated due to a lock cancellation (we do not have any truncates as part of the test) but somehow not removed from the mapping, so we always arrive at it anyway.

This is basically all info I have for this problem now, we'll see if it repeats anywhere.



 Comments   
Comment by Andreas Dilger [ 09/May/13 ]

This seems related to the same stack that I posted in skype?

Comment by Jinshan Xiong (Inactive) [ 09/May/13 ]

We did some investigation on this issue: one page was truncated but not removed from page table. So it fell into an infinite loop where it could find the page from page table but return in vvp_io_fault_start() because page->mapping is NULL.

I'll work out a debug patch.

Generated at Sat Feb 10 01:32:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.