[LU-300] Oops in cl_page_put() during execve()/page-fault on a binary mapped from a Lustre-filesystem and executed by a parallel application Created: 10/May/11  Updated: 31/May/11  Resolved: 31/May/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.0.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Sebastien Buisson (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 5034

 Description   

Hi,

During parallel applications execution, either mapping their binary or some of their dynamic-libs from Lustre, CEA at Tera-100 frequently encounters Lustre clients crashes with the following sample stack:

=====================================
crash_kexec()
oops_end()
no_context()
__bad_area_nosemaphore()
bad_area()
do_page_fault()
page_fault()
[exception RIP: cl_page_put+29]
vvp_io_fault_fini()
cl_io_fini()
ll_fault()
__do_fault()
hadle_pte_fault()
handle_mm_fault()
do_page_fault()
page_fault()
=====================================

Further crash dump analysis clearly indicates that in vvp_io_fault_fini() routine, io->u.ci_fault.ft_page is found non-NULL and thus passed to cl_page_put(). The problem is this pointer is not a valid address, but a simple integer instead (or maybe a timestamp), whereas we have ci_type == CIT_FAULT.

I add that the customer is running with the fix from LU-122.
This problem is pretty annoying as it disturbs regular cluster production by preventing normal job launch.

Sebastien.



 Comments   
Comment by Peter Jones [ 10/May/11 ]

Oleg

Could you please look into this issue?

Thanks

Peter

Comment by Peter Jones [ 10/May/11 ]

Jinshan will take a look at this one

Comment by Jinshan Xiong (Inactive) [ 10/May/11 ]

This problem is due to unintialization of cl_io in page fault path. Please try this patch: http://review.whamcloud.com/530

Comment by Sebastien Buisson (Inactive) [ 11/May/11 ]

Hi Jinshan,

Thanks for this quick answer. As the customer cluster is in production, I would need at very least one positive inspection on your patch before I can deliver an emergency fix to CEA.

TIA,
Sebastien.

Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/lclient/glimpse.c
  • lustre/lclient/lcommon_misc.c
  • lustre/llite/rw.c
  • lustre/llite/llite_mmap.c
  • lustre/include/lclient.h
  • lustre/lclient/lcommon_cl.c
  • lustre/llite/file.c
  • lustre/liblustre/rw.c
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/llite/llite_mmap.c
  • lustre/llite/rw.c
  • lustre/lclient/lcommon_cl.c
  • lustre/lclient/lcommon_misc.c
  • lustre/liblustre/rw.c
  • lustre/llite/file.c
  • lustre/lclient/glimpse.c
  • lustre/include/lclient.h
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » i686,client,el5,ofa #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/lclient/lcommon_misc.c
  • lustre/include/lclient.h
  • lustre/llite/file.c
  • lustre/lclient/glimpse.c
  • lustre/liblustre/rw.c
  • lustre/llite/rw.c
  • lustre/lclient/lcommon_cl.c
  • lustre/llite/llite_mmap.c
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/llite/llite_mmap.c
  • lustre/lclient/lcommon_misc.c
  • lustre/include/lclient.h
  • lustre/liblustre/rw.c
  • lustre/llite/rw.c
  • lustre/lclient/lcommon_cl.c
  • lustre/lclient/glimpse.c
  • lustre/llite/file.c
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/llite/file.c
  • lustre/lclient/glimpse.c
  • lustre/llite/llite_mmap.c
  • lustre/liblustre/rw.c
  • lustre/include/lclient.h
  • lustre/lclient/lcommon_misc.c
  • lustre/llite/rw.c
  • lustre/lclient/lcommon_cl.c
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/llite/file.c
  • lustre/llite/llite_mmap.c
  • lustre/include/lclient.h
  • lustre/lclient/lcommon_cl.c
  • lustre/lclient/glimpse.c
  • lustre/liblustre/rw.c
  • lustre/llite/rw.c
  • lustre/lclient/lcommon_misc.c
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/llite/rw.c
  • lustre/llite/file.c
  • lustre/lclient/lcommon_misc.c
  • lustre/lclient/glimpse.c
  • lustre/llite/llite_mmap.c
  • lustre/liblustre/rw.c
  • lustre/lclient/lcommon_cl.c
  • lustre/include/lclient.h
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/lclient/lcommon_misc.c
  • lustre/llite/file.c
  • lustre/lclient/glimpse.c
  • lustre/liblustre/rw.c
  • lustre/llite/rw.c
  • lustre/lclient/lcommon_cl.c
  • lustre/include/lclient.h
  • lustre/llite/llite_mmap.c
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,ofa #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/llite/file.c
  • lustre/llite/rw.c
  • lustre/llite/llite_mmap.c
  • lustre/lclient/lcommon_misc.c
  • lustre/lclient/glimpse.c
  • lustre/lclient/lcommon_cl.c
  • lustre/include/lclient.h
  • lustre/liblustre/rw.c
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/lclient/glimpse.c
  • lustre/liblustre/rw.c
  • lustre/llite/rw.c
  • lustre/llite/llite_mmap.c
  • lustre/include/lclient.h
  • lustre/lclient/lcommon_misc.c
  • lustre/lclient/lcommon_cl.c
  • lustre/llite/file.c
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » i686,server,el5,ofa #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/llite/llite_mmap.c
  • lustre/liblustre/rw.c
  • lustre/llite/file.c
  • lustre/lclient/glimpse.c
  • lustre/llite/rw.c
  • lustre/lclient/lcommon_cl.c
  • lustre/include/lclient.h
  • lustre/lclient/lcommon_misc.c
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/include/lclient.h
  • lustre/llite/llite_mmap.c
  • lustre/lclient/lcommon_cl.c
  • lustre/lclient/glimpse.c
  • lustre/llite/rw.c
  • lustre/llite/file.c
  • lustre/liblustre/rw.c
  • lustre/lclient/lcommon_misc.c
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/liblustre/rw.c
  • lustre/include/lclient.h
  • lustre/llite/llite_mmap.c
  • lustre/llite/file.c
  • lustre/lclient/lcommon_misc.c
  • lustre/lclient/glimpse.c
  • lustre/llite/rw.c
  • lustre/lclient/lcommon_cl.c
Comment by Build Master (Inactive) [ 18/May/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #122
LU-300: Oops in cl_page_put of page fault path

Oleg Drokin : 15ac26cb2fc0b9b4c6c4507d8cdab683b9b40b7e
Files :

  • lustre/llite/file.c
  • lustre/liblustre/rw.c
  • lustre/include/lclient.h
  • lustre/llite/llite_mmap.c
  • lustre/lclient/glimpse.c
  • lustre/lclient/lcommon_misc.c
  • lustre/lclient/lcommon_cl.c
  • lustre/llite/rw.c
Comment by Peter Jones [ 30/May/11 ]

Sebastien

How has this patch fared running in production at CEA?

Thanks

Peter

Comment by Sebastien Buisson (Inactive) [ 30/May/11 ]

As far as I know, no new occurrence of this bug since last Tuesday. We will have more news from CEA by the end of the week.

Sebastien.

Comment by Peter Jones [ 31/May/11 ]

Thanks Sebastien. In that case I will mark this ticket as resolved for now. The patch has landed on master for almost two weeks now with no observed side-effects and this issue was previously appearing daily at CEA before the patch was applied. If a problem is found with the patch at CEA then the ticket can simply be reopened.

Generated at Sat Feb 10 01:05:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.