[LU-600] Stack overflow: numerous crashes/Oopses in __switch_to() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
None

Severity:
2
Rank (Obsolete):
6569

Description

Stack overflow: numerous crashes/Oopses in __switch_to()

Hi,

At CEA customer site, they experience a lot of system crashes with "invalid_op" on compute nodes, which seem to be caused by a kernel stack owerflow.

I attach to this JIRA ticket the traces of the crash analysis done by our on site support. The two first one are "invalid_op" in __switch_to() and the three other are "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair().

I found in bugzilla that bugs #21128 and #22244 are related to stack overflow, but we do not have the corresponding patches in our Lustre 2.0.0. I don't know whether this is a similar problem or it's a new one.

Below is the detailled description provided by on site support:

#context: 
At tera-100 we experience multiple compute nodes crashes/Oopses for "invalid opcode" in __switch_to() routine
during a task/thread context-switch. It seems to only happen on nodes always running the same MPI-application with a
lot of nodes in // ...

#consequences: 
#details:

Since we are in-between a context switch from one task of the MPI-application and some other, crash-dump analysis has
been a bit tricky, but here is the whole story :

          _ the failing instruction/invalid opcode is a "xsave" and the failure is caused by CR4 register
not having the OSXSAVE bit set.

          _ if the "xsave" instruction is to be executed, it is because in the different routines involved
in previous context save algorithm , it is established that the corresponding (struct thread_info *)->status as TS_XSAVE
bit set !!

          _ finally I found that this is because the "struct thread_info" on top of the Kernel-stack has been
corrupted due to stack overflow !!!

Attachments

Issue Links

duplicates

LU-969 2.1 client stack overruns

Resolved

Activity

[LU-600] Stack overflow: numerous crashes/Oopses in __switch_to()

Peter Jones added a comment - 29/Mar/12 9:06 AM

Duplicate of ~~LU-969~~

Peter Jones added a comment - 29/Mar/12 9:06 AM Duplicate of LU-969

Andreas Dilger added a comment - 23/Jan/12 1:24 PM

The stack reduction patches are being tracked in ~~LU-969~~.

Andreas Dilger added a comment - 23/Jan/12 1:24 PM The stack reduction patches are being tracked in LU-969 .

Andreas Dilger added a comment - 04/Nov/11 12:33 AM

We have done some work on the orion development branch to reduce the stack usage, and it improved things a fair amount. One was do disable GCC from moving static functions inline. If GCC inlines these functions it adds their stack uses together instead of using the union of the largest inline functions stack, and it has helped noticably:

http://review.whamcloud.com/#change,1614
http://review.whamcloud.com/#change,1599
http://review.whamcloud.com/#change,1598

These patches will not apply directly to your tree, but could be a starting point for trying to resolve the problem of stack overflows on the server.

As for the client, there were quite a number of fixes made to the Client IO code in the 2.1 release, so it wouldn't surprise me if this problem was fixed in the newer 2.1.0 release.

Are you planning to upgrade to Lustre 2.1.0 any time soon?

Andreas Dilger added a comment - 04/Nov/11 12:33 AM We have done some work on the orion development branch to reduce the stack usage, and it improved things a fair amount. One was do disable GCC from moving static functions inline. If GCC inlines these functions it adds their stack uses together instead of using the union of the largest inline functions stack, and it has helped noticably: http://review.whamcloud.com/#change,1614 http://review.whamcloud.com/#change,1599 http://review.whamcloud.com/#change,1598 These patches will not apply directly to your tree, but could be a starting point for trying to resolve the problem of stack overflows on the server. As for the client, there were quite a number of fixes made to the Client IO code in the 2.1 release, so it wouldn't surprise me if this problem was fixed in the newer 2.1.0 release. Are you planning to upgrade to Lustre 2.1.0 any time soon?

Alex Zhuravlev added a comment - 06/Oct/11 9:36 AM

the stack on MDS was quite big and I wouldn't be that surprised this is a stack overflow case. but that is quite unexpected on OST where the code and logic are much simpler .. anyway, i'm checking this idea.

Alex Zhuravlev added a comment - 06/Oct/11 9:36 AM the stack on MDS was quite big and I wouldn't be that surprised this is a stack overflow case. but that is quite unexpected on OST where the code and logic are much simpler .. anyway, i'm checking this idea.

Oleg Drokin added a comment - 01/Sep/11 9:32 PM

Alex will look into possible theory of mballoc causing too big stack allocations.
Hopefully soon.

Oleg Drokin added a comment - 01/Sep/11 9:32 PM Alex will look into possible theory of mballoc causing too big stack allocations. Hopefully soon.

Oleg Drokin added a comment - 18/Aug/11 4:14 PM

I essentially see two patterns.
One on the client side is similar to the ones you referenced and I think they are already fixed in 2.1

The MDS and OSS traces look different from what we have seen so far I believe.
I was a bit worried by schedule_bug, but it turns out to be warning about scheduling while atomic which very well might happen if the thread_info was overwritten.

I am still digging in that direction. the oss crash in particular does not look to have enormous stack trace, but I guess some stack frames might just be big

Oleg Drokin added a comment - 18/Aug/11 4:14 PM I essentially see two patterns. One on the client side is similar to the ones you referenced and I think they are already fixed in 2.1 The MDS and OSS traces look different from what we have seen so far I believe. I was a bit worried by schedule_bug, but it turns out to be warning about scheduling while atomic which very well might happen if the thread_info was overwritten. I am still digging in that direction. the oss crash in particular does not look to have enormous stack trace, but I guess some stack frames might just be big

Alexandre Louvet (Inactive) added a comment - 18/Aug/11 6:15 AM

One more from a MDS : gaia14-MDS.txt

There might be various corruption since they were triggered on various node types (MDS/OSS/client).
Node type has been put in the .txt file name.

Alex.

Alexandre Louvet (Inactive) added a comment - 18/Aug/11 6:15 AM One more from a MDS : gaia14-MDS.txt There might be various corruption since they were triggered on various node types (MDS/OSS/client). Node type has been put in the .txt file name. Alex.

Patrick Valentin (Inactive) added a comment - 18/Aug/11 3:56 AM

Oleg,

Yesterday it was not possible to attach files to ~~LU-600~~, and it was not possible either to send them on ftp.whamcloud.com.
This morning it's still impossible to attach files, but I was able to upload them on ftp.whamcloud.com, under uploads/~~LU-600~~.

The trace files corresponding tp "invalid_op" in __switch_to() are:

cartan1000_crash_debug.txt
cartan1329_crash_debug.txt

The trace files correspondig to "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair() are:

lascaux205_OSS_crash_debug.txt
cartan111_MDS_crash_at_shine_stop.txt
cartan1411_crash_debug.txt

Patrick Valentin (Inactive) added a comment - 18/Aug/11 3:56 AM Oleg, Yesterday it was not possible to attach files to LU-600 , and it was not possible either to send them on ftp.whamcloud.com. This morning it's still impossible to attach files, but I was able to upload them on ftp.whamcloud.com, under uploads/ LU-600 . The trace files corresponding tp "invalid_op" in __switch_to() are: cartan1000_crash_debug.txt cartan1329_crash_debug.txt The trace files correspondig to "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair() are: lascaux205_OSS_crash_debug.txt cartan111_MDS_crash_at_shine_stop.txt cartan1411_crash_debug.txt

Oleg Drokin added a comment - 17/Aug/11 3:04 PM

Can you please provide the backtraces? Without it it's not really possibleto know how the stack overflow occured and see if the two patches you referenced would even help you or not

Oleg Drokin added a comment - 17/Aug/11 3:04 PM Can you please provide the backtraces? Without it it's not really possibleto know how the stack overflow occured and see if the two patches you referenced would even help you or not

People

Assignee:: Alex Zhuravlev

Reporter:: Patrick Valentin (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 17/Aug/11 12:09 PM

Updated:: 29/Mar/12 9:06 AM

Resolved:: 29/Mar/12 9:06 AM