Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7564

(out_handler.c:854:out_tx_end()) ... rc = -524

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • lola
      build: tip of master (commit ae3a2891f10a19acf855a90337316dda704da5d)
    • 3
    • 9223372036854775807

    Description

      The error happens during soak testing of build '20151214' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214)

      Approximately 10% of the total amount of batch jobs using simul crash with 'typical' error message:

      ...
      ...
      03:01:31: Running test #10(iter 42): mkdir, shared mode.
      03:01:31: Running test #10(iter 43): mkdir, shared mode.
      03:01:31: Process 0(lola-27.lola.whamcloud.com): FAILED in remove_dirs, rmdir failed: Input/output error
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
      with errorcode 1.
      
      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
      In: PMI_Abort(1, N/A)
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      [lola-29][[616,1],2][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      srun: error: lola-32: task 3: Killed
      srun: Terminating job step 393832.0
      srun: error: lola-29: task 2: Killed
      srun: error: lola-27: task 0: Exited with exit code 1
      srun: error: lola-27: task 1: Killed
      

      Each job crash can be temporal correlated perfectly to an event on a MDS:

      lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
      lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) Skipped 3 previous similar messages
      

      Attachments

        Activity

          [LU-7564] (out_handler.c:854:out_tx_end()) ... rc = -524

          wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18165
          Subject: LU-7564 llog: separate llog creation with initialization
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 734c40e674eb14055648d26af99aca9de4d0fd4f

          gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18165 Subject: LU-7564 llog: separate llog creation with initialization Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 734c40e674eb14055648d26af99aca9de4d0fd4f
          di.wang Di Wang added a comment -

          Another possible reason is that soak-test is doing double MDT failover, without COS, that might cause corruption during failover, considering following scenarios

          1. Client1 send the operation(Op1) to MDT1, and MDT1 distribute updates of Op1 to MDT2, then after finish Op1, MDT1 sends reply to client1
          2. Client2 send the operation Op2 to MDT3, and MDT3 distribute updates of Op2 to MDT2, (Note: Op2 depends on Op1).
          3. Before Op1 committed to MDT1 and MDT2, both MDT1 and MDT2 reboots.
          4. After MDT1 restarts, client1 will send Op1 to MDT1, and MDT1 will distribute updates of Op1 to MDT2, but with different xid and 0 transno.
          5. After MDT2 restarts and recovers, of course it will ignore the updates of Op1 because of its 0 transno, instead it will receive the replay updates from MDT3 (Op2). Then it will fail of course.

          If we have COS here, Op1 will be committed before Op2 starts, then it will help a lot here. I includes the COS patch in 16838, see how it goes.

          di.wang Di Wang added a comment - Another possible reason is that soak-test is doing double MDT failover, without COS, that might cause corruption during failover, considering following scenarios 1. Client1 send the operation(Op1) to MDT1, and MDT1 distribute updates of Op1 to MDT2, then after finish Op1, MDT1 sends reply to client1 2. Client2 send the operation Op2 to MDT3, and MDT3 distribute updates of Op2 to MDT2, (Note: Op2 depends on Op1). 3. Before Op1 committed to MDT1 and MDT2, both MDT1 and MDT2 reboots. 4. After MDT1 restarts, client1 will send Op1 to MDT1, and MDT1 will distribute updates of Op1 to MDT2, but with different xid and 0 transno. 5. After MDT2 restarts and recovers, of course it will ignore the updates of Op1 because of its 0 transno, instead it will receive the replay updates from MDT3 (Op2). Then it will fail of course. If we have COS here, Op1 will be committed before Op2 starts, then it will help a lot here. I includes the COS patch in 16838, see how it goes.
          di.wang Di Wang added a comment -

          Pushed a patch http://review.whamcloud.com/16838, which includes all of DNE fixes and some fixes on remote llog (in llog_cat_new_log()), which might cause recovery failure.

          Please try this one.

          di.wang Di Wang added a comment - Pushed a patch http://review.whamcloud.com/16838 , which includes all of DNE fixes and some fixes on remote llog (in llog_cat_new_log()), which might cause recovery failure. Please try this one.

          Debug mask is set to default. Please let me know if you like to set to an other value that is more appropriate.

          heckes Frank Heckes (Inactive) added a comment - Debug mask is set to default. Please let me know if you like to set to an other value that is more appropriate.

          Attached the debug logs taken on two MDSes after the event occurred. For the sequence of timing, here are the
          error messages for each event and node:
          lola-10

          lola-10.log:Dec 17 00:29:51 lola-10 kernel: LustreError: 5454:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
          

          lola-11

          lola-11.log:Dec 17 01:22:34 lola-11 kernel: LustreError: 6293:0:(out_handler.c:854:out_tx_end()) soaked-MDT0007-osd: undo for /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6.6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
          
          heckes Frank Heckes (Inactive) added a comment - Attached the debug logs taken on two MDSes after the event occurred. For the sequence of timing, here are the error messages for each event and node: lola-10 lola-10.log:Dec 17 00:29:51 lola-10 kernel: LustreError: 5454:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524 lola-11 lola-11.log:Dec 17 01:22:34 lola-11 kernel: LustreError: 6293:0:(out_handler.c:854:out_tx_end()) soaked-MDT0007-osd: undo for /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6.6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524

          The errors happen during normal operations, i.e. at times when no fault was injected.

          heckes Frank Heckes (Inactive) added a comment - The errors happen during normal operations, i.e. at times when no fault was injected.

          People

            di.wang Di Wang
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: