[LU-600] Stack overflow: numerous crashes/Oopses in __switch_to() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
2
Rank (Obsolete):
6569

Description

Stack overflow: numerous crashes/Oopses in __switch_to()

Hi,

At CEA customer site, they experience a lot of system crashes with "invalid_op" on compute nodes, which seem to be caused by a kernel stack owerflow.

I attach to this JIRA ticket the traces of the crash analysis done by our on site support. The two first one are "invalid_op" in __switch_to() and the three other are "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair().

I found in bugzilla that bugs #21128 and #22244 are related to stack overflow, but we do not have the corresponding patches in our Lustre 2.0.0. I don't know whether this is a similar problem or it's a new one.

Below is the detailled description provided by on site support:

#context: 
At tera-100 we experience multiple compute nodes crashes/Oopses for "invalid opcode" in __switch_to() routine
during a task/thread context-switch. It seems to only happen on nodes always running the same MPI-application with a
lot of nodes in // ...

#consequences: 
#details:

Since we are in-between a context switch from one task of the MPI-application and some other, crash-dump analysis has
been a bit tricky, but here is the whole story :

          _ the failing instruction/invalid opcode is a "xsave" and the failure is caused by CR4 register
not having the OSXSAVE bit set.

          _ if the "xsave" instruction is to be executed, it is because in the different routines involved
in previous context save algorithm , it is established that the corresponding (struct thread_info *)->status as TS_XSAVE
bit set !!

          _ finally I found that this is because the "struct thread_info" on top of the Kernel-stack has been
corrupted due to stack overflow !!!

Attachments

Issue Links

duplicates

LU-969 2.1 client stack overruns

Resolved

Activity

[LU-600] Stack overflow: numerous crashes/Oopses in __switch_to()

Peter Jones made changes - 29/Mar/12 9:06 AM

Resolution		New: Duplicate [ 3 ]
Status	Original: Open [ 1 ]	New: Resolved [ 5 ]

Andreas Dilger made changes - 23/Jan/12 1:24 PM

Link

New: This issue duplicates ~~LU-969~~ [ ~~LU-969~~ ]

Peter Jones made changes - 13/Oct/11 10:45 AM

Assignee

Original: Oleg Drokin [ green ]

New: Alex Zhuravlev [ bzzz ]

Peter Jones made changes - 19/Aug/11 8:29 AM

Severity	Original: 3	New: 2
Description	Original: Stack overflow: numerous crashes/Oopses in __switch_to() Hi, At CEA customer site, they experience a lot of system crashes with "invalid_op" on compute nodes, which seem to be caused by a kernel stack owerflow. I attach to this JIRA ticket the traces of the crash analysis done by our on site support. The two first one are "invalid_op" in __switch_to() and the three other are "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair(). I found in bugzilla that bugs #21128 and #22244 are related to stack overflow, but we do not have the corresponding patches in our Lustre 2.0.0. I don't know whether this is a similar problem or it's a new one. Below is the detailled description provided by on site support: {noformat} #context: At tera-100 we experience multiple compute nodes crashes/Oopses for "invalid opcode" in __switch_to() routine during a task/thread context-switch. It seems to only happen on nodes always running the same MPI-application with a lot of nodes in // ... #consequences: #details: Since we are in-between a context switch from one task of the MPI-application and some other, crash-dump analysis has been a bit tricky, but here is the whole story : _ the failing instruction/invalid opcode is a "xsave" and the failure is caused by CR4 register not having the OSXSAVE bit set. _ if the "xsave" instruction is to be executed, it is because in the different routines involved in previous context save algorithm , it is established that the corresponding (struct thread_info *)->status as TS_XSAVE bit set !! _ finally I found that this is because the "struct thread_info" on top of the Kernel-stack has been corrupted due to stack overflow !!! {\noformat}	New: Stack overflow: numerous crashes/Oopses in __switch_to() Hi, At CEA customer site, they experience a lot of system crashes with "invalid_op" on compute nodes, which seem to be caused by a kernel stack owerflow. I attach to this JIRA ticket the traces of the crash analysis done by our on site support. The two first one are "invalid_op" in __switch_to() and the three other are "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair(). I found in bugzilla that bugs #21128 and #22244 are related to stack overflow, but we do not have the corresponding patches in our Lustre 2.0.0. I don't know whether this is a similar problem or it's a new one. Below is the detailled description provided by on site support: {noformat} #context: At tera-100 we experience multiple compute nodes crashes/Oopses for "invalid opcode" in __switch_to() routine during a task/thread context-switch. It seems to only happen on nodes always running the same MPI-application with a lot of nodes in // ... #consequences: #details: Since we are in-between a context switch from one task of the MPI-application and some other, crash-dump analysis has been a bit tricky, but here is the whole story : _ the failing instruction/invalid opcode is a "xsave" and the failure is caused by CR4 register not having the OSXSAVE bit set. _ if the "xsave" instruction is to be executed, it is because in the different routines involved in previous context save algorithm , it is established that the corresponding (struct thread_info *)->status as TS_XSAVE bit set !! _ finally I found that this is because the "struct thread_info" on top of the Kernel-stack has been corrupted due to stack overflow !!! {\noformat}

Peter Jones made changes - 18/Aug/11 6:59 PM

Assignee

Original: Robert Read [ rread ]

New: Oleg Drokin [ green ]

Patrick Valentin (Inactive) created issue - 17/Aug/11 12:09 PM

People

Assignee:: Alex Zhuravlev

Reporter:: Patrick Valentin (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 17/Aug/11 12:09 PM

Updated:: 29/Mar/12 9:06 AM

Resolved:: 29/Mar/12 9:06 AM