Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.0.0
-
None
-
3
-
5034
Description
Hi,
During parallel applications execution, either mapping their binary or some of their dynamic-libs from Lustre, CEA at Tera-100 frequently encounters Lustre clients crashes with the following sample stack:
=====================================
crash_kexec()
oops_end()
no_context()
__bad_area_nosemaphore()
bad_area()
do_page_fault()
page_fault()
[exception RIP: cl_page_put+29]
vvp_io_fault_fini()
cl_io_fini()
ll_fault()
__do_fault()
hadle_pte_fault()
handle_mm_fault()
do_page_fault()
page_fault()
=====================================
Further crash dump analysis clearly indicates that in vvp_io_fault_fini() routine, io->u.ci_fault.ft_page is found non-NULL and thus passed to cl_page_put(). The problem is this pointer is not a valid address, but a simple integer instead (or maybe a timestamp), whereas we have ci_type == CIT_FAULT.
I add that the customer is running with the fix from LU-122.
This problem is pretty annoying as it disturbs regular cluster production by preventing normal job launch.
Sebastien.