Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-600

Stack overflow: numerous crashes/Oopses in __switch_to()

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 2
    • 6569

    Description

      Stack overflow: numerous crashes/Oopses in __switch_to()

      Hi,

      At CEA customer site, they experience a lot of system crashes with "invalid_op" on compute nodes, which seem to be caused by a kernel stack owerflow.

      I attach to this JIRA ticket the traces of the crash analysis done by our on site support. The two first one are "invalid_op" in __switch_to() and the three other are "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair().

      I found in bugzilla that bugs #21128 and #22244 are related to stack overflow, but we do not have the corresponding patches in our Lustre 2.0.0. I don't know whether this is a similar problem or it's a new one.

      Below is the detailled description provided by on site support:

      #context: 
      At tera-100 we experience multiple compute nodes crashes/Oopses for "invalid opcode" in __switch_to() routine
      during a task/thread context-switch. It seems to only happen on nodes always running the same MPI-application with a
      lot of nodes in // ...
      
      #consequences: 
      #details:
      
      Since we are in-between a context switch from one task of the MPI-application and some other, crash-dump analysis has
      been a bit tricky, but here is the whole story :
      
                _ the failing instruction/invalid opcode is a "xsave" and the failure is caused by CR4 register
      not having the OSXSAVE bit set.
      
                _ if the "xsave" instruction is to be executed, it is because in the different routines involved
      in previous context save algorithm , it is established that the corresponding (struct thread_info *)->status as TS_XSAVE
      bit set !!
      
                _ finally I found that this is because the "struct thread_info" on top of the Kernel-stack has been
      corrupted due to stack overflow !!!
      
      

      Attachments

        Issue Links

          Activity

            [LU-600] Stack overflow: numerous crashes/Oopses in __switch_to()
            pjones Peter Jones made changes -
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            adilger Andreas Dilger made changes -
            Link New: This issue duplicates LU-969 [ LU-969 ]
            pjones Peter Jones made changes -
            Assignee Original: Oleg Drokin [ green ] New: Alex Zhuravlev [ bzzz ]
            pjones Peter Jones made changes -
            Severity Original: 3 New: 2
            Description Original:
            Stack overflow: numerous crashes/Oopses in __switch_to()

            Hi,

            At CEA customer site, they experience a lot of system crashes with "invalid_op" on compute nodes, which seem to be caused by a kernel stack owerflow.

            I attach to this JIRA ticket the traces of the crash analysis done by our on site support. The two first one are "invalid_op" in __switch_to() and the three other are "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair().

            I found in bugzilla that bugs #21128 and #22244 are related to stack overflow, but we do not have the corresponding patches in our Lustre 2.0.0. I don't know whether this is a similar problem or it's a new one.

            Below is the detailled description provided by on site support:

            {noformat}

            #context:
            At tera-100 we experience multiple compute nodes crashes/Oopses for "invalid opcode" in __switch_to() routine
            during a task/thread context-switch. It seems to only happen on nodes always running the same MPI-application with a
            lot of nodes in // ...

            #consequences:
            #details:

            Since we are in-between a context switch from one task of the MPI-application and some other, crash-dump analysis has
            been a bit tricky, but here is the whole story :

                      _ the failing instruction/invalid opcode is a "xsave" and the failure is caused by CR4 register
            not having the OSXSAVE bit set.

                      _ if the "xsave" instruction is to be executed, it is because in the different routines involved
            in previous context save algorithm , it is established that the corresponding (struct thread_info *)->status as TS_XSAVE
            bit set !!

                      _ finally I found that this is because the "struct thread_info" on top of the Kernel-stack has been
            corrupted due to stack overflow !!!

            {\noformat}
            New: Stack overflow: numerous crashes/Oopses in __switch_to()

            Hi,

            At CEA customer site, they experience a lot of system crashes with "invalid_op" on compute nodes, which seem to be caused by a kernel stack owerflow.

            I attach to this JIRA ticket the traces of the crash analysis done by our on site support. The two first one are "invalid_op" in __switch_to() and the three other are "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair().

            I found in bugzilla that bugs #21128 and #22244 are related to stack overflow, but we do not have the corresponding patches in our Lustre 2.0.0. I don't know whether this is a similar problem or it's a new one.

            Below is the detailled description provided by on site support:

            {noformat}

            #context:
            At tera-100 we experience multiple compute nodes crashes/Oopses for "invalid opcode" in __switch_to() routine
            during a task/thread context-switch. It seems to only happen on nodes always running the same MPI-application with a
            lot of nodes in // ...

            #consequences:
            #details:

            Since we are in-between a context switch from one task of the MPI-application and some other, crash-dump analysis has
            been a bit tricky, but here is the whole story :

                      _ the failing instruction/invalid opcode is a "xsave" and the failure is caused by CR4 register
            not having the OSXSAVE bit set.

                      _ if the "xsave" instruction is to be executed, it is because in the different routines involved
            in previous context save algorithm , it is established that the corresponding (struct thread_info *)->status as TS_XSAVE
            bit set !!

                      _ finally I found that this is because the "struct thread_info" on top of the Kernel-stack has been
            corrupted due to stack overflow !!!

            {\noformat}
            pjones Peter Jones made changes -
            Assignee Original: Robert Read [ rread ] New: Oleg Drokin [ green ]
            patrick.valentin Patrick Valentin (Inactive) created issue -

            People

              bzzz Alex Zhuravlev
              patrick.valentin Patrick Valentin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: