Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-600

Stack overflow: numerous crashes/Oopses in __switch_to()

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 2
    • 6569

    Description

      Stack overflow: numerous crashes/Oopses in __switch_to()

      Hi,

      At CEA customer site, they experience a lot of system crashes with "invalid_op" on compute nodes, which seem to be caused by a kernel stack owerflow.

      I attach to this JIRA ticket the traces of the crash analysis done by our on site support. The two first one are "invalid_op" in __switch_to() and the three other are "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair().

      I found in bugzilla that bugs #21128 and #22244 are related to stack overflow, but we do not have the corresponding patches in our Lustre 2.0.0. I don't know whether this is a similar problem or it's a new one.

      Below is the detailled description provided by on site support:

      #context: 
      At tera-100 we experience multiple compute nodes crashes/Oopses for "invalid opcode" in __switch_to() routine
      during a task/thread context-switch. It seems to only happen on nodes always running the same MPI-application with a
      lot of nodes in // ...
      
      #consequences: 
      #details:
      
      Since we are in-between a context switch from one task of the MPI-application and some other, crash-dump analysis has
      been a bit tricky, but here is the whole story :
      
                _ the failing instruction/invalid opcode is a "xsave" and the failure is caused by CR4 register
      not having the OSXSAVE bit set.
      
                _ if the "xsave" instruction is to be executed, it is because in the different routines involved
      in previous context save algorithm , it is established that the corresponding (struct thread_info *)->status as TS_XSAVE
      bit set !!
      
                _ finally I found that this is because the "struct thread_info" on top of the Kernel-stack has been
      corrupted due to stack overflow !!!
      
      

      Attachments

        Issue Links

          Activity

            [LU-600] Stack overflow: numerous crashes/Oopses in __switch_to()
            pjones Peter Jones added a comment -

            Duplicate of LU-969

            pjones Peter Jones added a comment - Duplicate of LU-969

            The stack reduction patches are being tracked in LU-969.

            adilger Andreas Dilger added a comment - The stack reduction patches are being tracked in LU-969 .

            We have done some work on the orion development branch to reduce the stack usage, and it improved things a fair amount. One was do disable GCC from moving static functions inline. If GCC inlines these functions it adds their stack uses together instead of using the union of the largest inline functions stack, and it has helped noticably:

            http://review.whamcloud.com/#change,1614
            http://review.whamcloud.com/#change,1599
            http://review.whamcloud.com/#change,1598

            These patches will not apply directly to your tree, but could be a starting point for trying to resolve the problem of stack overflows on the server.

            As for the client, there were quite a number of fixes made to the Client IO code in the 2.1 release, so it wouldn't surprise me if this problem was fixed in the newer 2.1.0 release.

            Are you planning to upgrade to Lustre 2.1.0 any time soon?

            adilger Andreas Dilger added a comment - We have done some work on the orion development branch to reduce the stack usage, and it improved things a fair amount. One was do disable GCC from moving static functions inline. If GCC inlines these functions it adds their stack uses together instead of using the union of the largest inline functions stack, and it has helped noticably: http://review.whamcloud.com/#change,1614 http://review.whamcloud.com/#change,1599 http://review.whamcloud.com/#change,1598 These patches will not apply directly to your tree, but could be a starting point for trying to resolve the problem of stack overflows on the server. As for the client, there were quite a number of fixes made to the Client IO code in the 2.1 release, so it wouldn't surprise me if this problem was fixed in the newer 2.1.0 release. Are you planning to upgrade to Lustre 2.1.0 any time soon?

            the stack on MDS was quite big and I wouldn't be that surprised this is a stack overflow case. but that is quite unexpected on OST where the code and logic are much simpler .. anyway, i'm checking this idea.

            bzzz Alex Zhuravlev added a comment - the stack on MDS was quite big and I wouldn't be that surprised this is a stack overflow case. but that is quite unexpected on OST where the code and logic are much simpler .. anyway, i'm checking this idea.
            green Oleg Drokin added a comment -

            Alex will look into possible theory of mballoc causing too big stack allocations.
            Hopefully soon.

            green Oleg Drokin added a comment - Alex will look into possible theory of mballoc causing too big stack allocations. Hopefully soon.
            green Oleg Drokin added a comment -

            I essentially see two patterns.
            One on the client side is similar to the ones you referenced and I think they are already fixed in 2.1

            The MDS and OSS traces look different from what we have seen so far I believe.
            I was a bit worried by schedule_bug, but it turns out to be warning about scheduling while atomic which very well might happen if the thread_info was overwritten.

            I am still digging in that direction. the oss crash in particular does not look to have enormous stack trace, but I guess some stack frames might just be big

            green Oleg Drokin added a comment - I essentially see two patterns. One on the client side is similar to the ones you referenced and I think they are already fixed in 2.1 The MDS and OSS traces look different from what we have seen so far I believe. I was a bit worried by schedule_bug, but it turns out to be warning about scheduling while atomic which very well might happen if the thread_info was overwritten. I am still digging in that direction. the oss crash in particular does not look to have enormous stack trace, but I guess some stack frames might just be big

            One more from a MDS : gaia14-MDS.txt

            There might be various corruption since they were triggered on various node types (MDS/OSS/client).
            Node type has been put in the .txt file name.

            Alex.

            louveta Alexandre Louvet (Inactive) added a comment - One more from a MDS : gaia14-MDS.txt There might be various corruption since they were triggered on various node types (MDS/OSS/client). Node type has been put in the .txt file name. Alex.

            Oleg,

            Yesterday it was not possible to attach files to LU-600, and it was not possible either to send them on ftp.whamcloud.com.
            This morning it's still impossible to attach files, but I was able to upload them on ftp.whamcloud.com, under uploads/LU-600.

            The trace files corresponding tp "invalid_op" in __switch_to() are:

            • cartan1000_crash_debug.txt
            • cartan1329_crash_debug.txt

            The trace files correspondig to "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair() are:

            • lascaux205_OSS_crash_debug.txt
            • cartan111_MDS_crash_at_shine_stop.txt
            • cartan1411_crash_debug.txt
            patrick.valentin Patrick Valentin (Inactive) added a comment - Oleg, Yesterday it was not possible to attach files to LU-600 , and it was not possible either to send them on ftp.whamcloud.com. This morning it's still impossible to attach files, but I was able to upload them on ftp.whamcloud.com, under uploads/ LU-600 . The trace files corresponding tp "invalid_op" in __switch_to() are: cartan1000_crash_debug.txt cartan1329_crash_debug.txt The trace files correspondig to "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair() are: lascaux205_OSS_crash_debug.txt cartan111_MDS_crash_at_shine_stop.txt cartan1411_crash_debug.txt
            green Oleg Drokin added a comment -

            Can you please provide the backtraces? Without it it's not really possibleto know how the stack overflow occured and see if the two patches you referenced would even help you or not

            green Oleg Drokin added a comment - Can you please provide the backtraces? Without it it's not really possibleto know how the stack overflow occured and see if the two patches you referenced would even help you or not

            People

              bzzz Alex Zhuravlev
              patrick.valentin Patrick Valentin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: