Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-300

Oops in cl_page_put() during execve()/page-fault on a binary mapped from a Lustre-filesystem and executed by a parallel application

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.1.0
    • Lustre 2.0.0
    • None
    • 3
    • 5034

    Description

      Hi,

      During parallel applications execution, either mapping their binary or some of their dynamic-libs from Lustre, CEA at Tera-100 frequently encounters Lustre clients crashes with the following sample stack:

      =====================================
      crash_kexec()
      oops_end()
      no_context()
      __bad_area_nosemaphore()
      bad_area()
      do_page_fault()
      page_fault()
      [exception RIP: cl_page_put+29]
      vvp_io_fault_fini()
      cl_io_fini()
      ll_fault()
      __do_fault()
      hadle_pte_fault()
      handle_mm_fault()
      do_page_fault()
      page_fault()
      =====================================

      Further crash dump analysis clearly indicates that in vvp_io_fault_fini() routine, io->u.ci_fault.ft_page is found non-NULL and thus passed to cl_page_put(). The problem is this pointer is not a valid address, but a simple integer instead (or maybe a timestamp), whereas we have ci_type == CIT_FAULT.

      I add that the customer is running with the fix from LU-122.
      This problem is pretty annoying as it disturbs regular cluster production by preventing normal job launch.

      Sebastien.

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: