Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7564

(out_handler.c:854:out_tx_end()) ... rc = -524

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • lola
      build: tip of master (commit ae3a2891f10a19acf855a90337316dda704da5d)
    • 3
    • 9223372036854775807

    Description

      The error happens during soak testing of build '20151214' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214)

      Approximately 10% of the total amount of batch jobs using simul crash with 'typical' error message:

      ...
      ...
      03:01:31: Running test #10(iter 42): mkdir, shared mode.
      03:01:31: Running test #10(iter 43): mkdir, shared mode.
      03:01:31: Process 0(lola-27.lola.whamcloud.com): FAILED in remove_dirs, rmdir failed: Input/output error
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
      with errorcode 1.
      
      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
      In: PMI_Abort(1, N/A)
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      [lola-29][[616,1],2][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      srun: error: lola-32: task 3: Killed
      srun: Terminating job step 393832.0
      srun: error: lola-29: task 2: Killed
      srun: error: lola-27: task 0: Exited with exit code 1
      srun: error: lola-27: task 1: Killed
      

      Each job crash can be temporal correlated perfectly to an event on a MDS:

      lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
      lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) Skipped 3 previous similar messages
      

      Attachments

        Activity

          [LU-7564] (out_handler.c:854:out_tx_end()) ... rc = -524
          adilger Andreas Dilger made changes -
          Labels Original: DNE2 soak New: dne2 soak
          jgmitter Joseph Gmitter (Inactive) made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: In Progress [ 3 ] New: Resolved [ 5 ]

          Patch has landed for 2.8

          jgmitter Joseph Gmitter (Inactive) added a comment - Patch has landed for 2.8

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18206/
          Subject: LU-7564 osp: Do not match the lock for OSP
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: beab72b475c6006f53d5cab628cfdbe6dca09b32

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18206/ Subject: LU-7564 osp: Do not match the lock for OSP Project: fs/lustre-release Branch: master Current Patch Set: Commit: beab72b475c6006f53d5cab628cfdbe6dca09b32
          di.wang Di Wang made changes -
          Affects Version/s New: Lustre 2.8.0 [ 11113 ]
          di.wang Di Wang made changes -
          Labels Original: soak New: DNE2 soak

          wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18206
          Subject: LU-7564 osp: lock remote object exclusively
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 468a6d9ee4854740353cc41c04aa70ab6155e069

          gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18206 Subject: LU-7564 osp: lock remote object exclusively Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 468a6d9ee4854740353cc41c04aa70ab6155e069
          heckes Frank Heckes (Inactive) made changes -
          Attachment New: lustre-log.1453899092.15929.bz2 [ 20213 ]
          heckes Frank Heckes (Inactive) made changes -
          Attachment New: lustre-log.1453899093.15184.bz2 [ 20207 ]
          heckes Frank Heckes (Inactive) made changes -

          People

            di.wang Di Wang
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: