Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
None
-
None
-
2
-
6569
Description
Stack overflow: numerous crashes/Oopses in __switch_to()
Hi,
At CEA customer site, they experience a lot of system crashes with "invalid_op" on compute nodes, which seem to be caused by a kernel stack owerflow.
I attach to this JIRA ticket the traces of the crash analysis done by our on site support. The two first one are "invalid_op" in __switch_to() and the three other are "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair().
I found in bugzilla that bugs #21128 and #22244 are related to stack overflow, but we do not have the corresponding patches in our Lustre 2.0.0. I don't know whether this is a similar problem or it's a new one.
Below is the detailled description provided by on site support:
#context: At tera-100 we experience multiple compute nodes crashes/Oopses for "invalid opcode" in __switch_to() routine during a task/thread context-switch. It seems to only happen on nodes always running the same MPI-application with a lot of nodes in // ... #consequences: #details: Since we are in-between a context switch from one task of the MPI-application and some other, crash-dump analysis has been a bit tricky, but here is the whole story : _ the failing instruction/invalid opcode is a "xsave" and the failure is caused by CR4 register not having the OSXSAVE bit set. _ if the "xsave" instruction is to be executed, it is because in the different routines involved in previous context save algorithm , it is established that the corresponding (struct thread_info *)->status as TS_XSAVE bit set !! _ finally I found that this is because the "struct thread_info" on top of the Kernel-stack has been corrupted due to stack overflow !!!
Attachments
Issue Links
- duplicates
-
LU-969 2.1 client stack overruns
- Resolved