[LU-600] Stack overflow: numerous crashes/Oopses in __switch_to() Created: 17/Aug/11 Updated: 29/Mar/12 Resolved: 29/Mar/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Patrick Valentin (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 6569 | ||||||||
| Description |
|
Stack overflow: numerous crashes/Oopses in __switch_to() Hi, At CEA customer site, they experience a lot of system crashes with "invalid_op" on compute nodes, which seem to be caused by a kernel stack owerflow. I attach to this JIRA ticket the traces of the crash analysis done by our on site support. The two first one are "invalid_op" in __switch_to() and the three other are "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair(). I found in bugzilla that bugs #21128 and #22244 are related to stack overflow, but we do not have the corresponding patches in our Lustre 2.0.0. I don't know whether this is a similar problem or it's a new one. Below is the detailled description provided by on site support:
#context:
At tera-100 we experience multiple compute nodes crashes/Oopses for "invalid opcode" in __switch_to() routine
during a task/thread context-switch. It seems to only happen on nodes always running the same MPI-application with a
lot of nodes in // ...
#consequences:
#details:
Since we are in-between a context switch from one task of the MPI-application and some other, crash-dump analysis has
been a bit tricky, but here is the whole story :
_ the failing instruction/invalid opcode is a "xsave" and the failure is caused by CR4 register
not having the OSXSAVE bit set.
_ if the "xsave" instruction is to be executed, it is because in the different routines involved
in previous context save algorithm , it is established that the corresponding (struct thread_info *)->status as TS_XSAVE
bit set !!
_ finally I found that this is because the "struct thread_info" on top of the Kernel-stack has been
corrupted due to stack overflow !!!
|
| Comments |
| Comment by Oleg Drokin [ 17/Aug/11 ] |
|
Can you please provide the backtraces? Without it it's not really possibleto know how the stack overflow occured and see if the two patches you referenced would even help you or not |
| Comment by Patrick Valentin (Inactive) [ 18/Aug/11 ] |
|
Oleg, Yesterday it was not possible to attach files to The trace files corresponding tp "invalid_op" in __switch_to() are:
The trace files correspondig to "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair() are:
|
| Comment by Alexandre Louvet [ 18/Aug/11 ] |
|
One more from a MDS : gaia14-MDS.txt There might be various corruption since they were triggered on various node types (MDS/OSS/client). Alex. |
| Comment by Oleg Drokin [ 18/Aug/11 ] |
|
I essentially see two patterns. The MDS and OSS traces look different from what we have seen so far I believe. I am still digging in that direction. the oss crash in particular does not look to have enormous stack trace, but I guess some stack frames might just be big |
| Comment by Oleg Drokin [ 01/Sep/11 ] |
|
Alex will look into possible theory of mballoc causing too big stack allocations. |
| Comment by Alex Zhuravlev [ 06/Oct/11 ] |
|
the stack on MDS was quite big and I wouldn't be that surprised this is a stack overflow case. but that is quite unexpected on OST where the code and logic are much simpler .. anyway, i'm checking this idea. |
| Comment by Andreas Dilger [ 04/Nov/11 ] |
|
We have done some work on the orion development branch to reduce the stack usage, and it improved things a fair amount. One was do disable GCC from moving static functions inline. If GCC inlines these functions it adds their stack uses together instead of using the union of the largest inline functions stack, and it has helped noticably: http://review.whamcloud.com/#change,1614 These patches will not apply directly to your tree, but could be a starting point for trying to resolve the problem of stack overflows on the server. As for the client, there were quite a number of fixes made to the Client IO code in the 2.1 release, so it wouldn't surprise me if this problem was fixed in the newer 2.1.0 release. Are you planning to upgrade to Lustre 2.1.0 any time soon? |
| Comment by Andreas Dilger [ 23/Jan/12 ] |
|
The stack reduction patches are being tracked in |
| Comment by Peter Jones [ 29/Mar/12 ] |
|
Duplicate of |