Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-600

Stack overflow: numerous crashes/Oopses in __switch_to()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • None
    • None
    • 2
    • 6569

    Description

      Stack overflow: numerous crashes/Oopses in __switch_to()

      Hi,

      At CEA customer site, they experience a lot of system crashes with "invalid_op" on compute nodes, which seem to be caused by a kernel stack owerflow.

      I attach to this JIRA ticket the traces of the crash analysis done by our on site support. The two first one are "invalid_op" in __switch_to() and the three other are "page_fault" in fair_dequeue_task_fair() and fair_pick_next_task_fair().

      I found in bugzilla that bugs #21128 and #22244 are related to stack overflow, but we do not have the corresponding patches in our Lustre 2.0.0. I don't know whether this is a similar problem or it's a new one.

      Below is the detailled description provided by on site support:

      #context: 
      At tera-100 we experience multiple compute nodes crashes/Oopses for "invalid opcode" in __switch_to() routine
      during a task/thread context-switch. It seems to only happen on nodes always running the same MPI-application with a
      lot of nodes in // ...
      
      #consequences: 
      #details:
      
      Since we are in-between a context switch from one task of the MPI-application and some other, crash-dump analysis has
      been a bit tricky, but here is the whole story :
      
                _ the failing instruction/invalid opcode is a "xsave" and the failure is caused by CR4 register
      not having the OSXSAVE bit set.
      
                _ if the "xsave" instruction is to be executed, it is because in the different routines involved
      in previous context save algorithm , it is established that the corresponding (struct thread_info *)->status as TS_XSAVE
      bit set !!
      
                _ finally I found that this is because the "struct thread_info" on top of the Kernel-stack has been
      corrupted due to stack overflow !!!
      
      

      Attachments

        Issue Links

          Activity

            People

              bzzz Alex Zhuravlev
              patrick.valentin Patrick Valentin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: