Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7564

(out_handler.c:854:out_tx_end()) ... rc = -524

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • lola
      build: tip of master (commit ae3a2891f10a19acf855a90337316dda704da5d)
    • 3
    • 9223372036854775807

    Description

      The error happens during soak testing of build '20151214' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214)

      Approximately 10% of the total amount of batch jobs using simul crash with 'typical' error message:

      ...
      ...
      03:01:31: Running test #10(iter 42): mkdir, shared mode.
      03:01:31: Running test #10(iter 43): mkdir, shared mode.
      03:01:31: Process 0(lola-27.lola.whamcloud.com): FAILED in remove_dirs, rmdir failed: Input/output error
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
      with errorcode 1.
      
      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
      In: PMI_Abort(1, N/A)
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      [lola-29][[616,1],2][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      srun: error: lola-32: task 3: Killed
      srun: Terminating job step 393832.0
      srun: error: lola-29: task 2: Killed
      srun: error: lola-27: task 0: Exited with exit code 1
      srun: error: lola-27: task 1: Killed
      

      Each job crash can be temporal correlated perfectly to an event on a MDS:

      lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
      lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) Skipped 3 previous similar messages
      

      Attachments

        1. dangling-files-after-restart-build-20160126
          2 kB
          Frank Heckes
        2. dangling-files-before-restart-build-20160126
          16 kB
          Frank Heckes
        3. lola-10-lustre-log.for-LU-7564.20151217T0035.bz2
          5 kB
          Frank Heckes
        4. lola-11-lustre-log-LU-7565-20151217-0125.bz2
          0.3 kB
          Frank Heckes
        5. lustre-log.1453899092.15929.bz2
          3.36 MB
          Frank Heckes
        6. lustre-log.1453899093.15184.bz2
          80 kB
          Frank Heckes

        Activity

          People

            di.wang Di Wang
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: