Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.4.0
-
None
-
3
-
8165
Description
The story on this one is not exactly clear yet.
I first noticed this problem because my sanityn mmap testing hung:
== sanityn test 18: mmap sanity check =================================== 17:19:48 (1367961588) mmap test1: basic mmap operation (PASS, 0.040102s) mmap test2: MAP_PRIVATE not write back (PASS, 0.011313s) mmap test3: concurrent mmap ops on two nodes (SKIPPED, 0s) mmap test4: c1 write to f1 from mmapped f2, c2 write to f1 from mmapped f1 (PASS, 2.1596s) mmap test5: read/write file to/from the buffer which mmapped to just this file (PASS, 0.060313s)
the bug hit in test 6
11499 pts/0 R+ 1288:17 /home/green/git/lustre-release/lustre/tests/../tests/mmap_sanity -d /mnt/lustre -m /mnt/lustre2 -e 3
Investigation of logs uncovered this:
00000080:00008000:7.0:1367964803.444630:0:11499:0:(vvp_io.c:727:vvp_io_fault_start()) llite: fault and truncate race happened! 00000080:00000001:7.0:1367964803.444630:0:11499:0:(vvp_io.c:731:vvp_io_fault_start()) Process leaving via out (rc=1 : 1 : 0x1)
The check for that is to see if page->mapping is NULL indicating truncate.
After digging some more into it I found that after this return we return all the way to userspace, then page fault repeats and we find this exact page when doing old_page = vm_normal_page(vma, address, orig_pte); in do_wp_page in kernel.
This is most likely a sign of a page that was truncated due to a lock cancellation (we do not have any truncates as part of the test) but somehow not removed from the mapping, so we always arrive at it anyway.
This is basically all info I have for this problem now, we'll see if it repeats anywhere.