Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7564

(out_handler.c:854:out_tx_end()) ... rc = -524

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • lola
      build: tip of master (commit ae3a2891f10a19acf855a90337316dda704da5d)
    • 3
    • 9223372036854775807

    Description

      The error happens during soak testing of build '20151214' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214)

      Approximately 10% of the total amount of batch jobs using simul crash with 'typical' error message:

      ...
      ...
      03:01:31: Running test #10(iter 42): mkdir, shared mode.
      03:01:31: Running test #10(iter 43): mkdir, shared mode.
      03:01:31: Process 0(lola-27.lola.whamcloud.com): FAILED in remove_dirs, rmdir failed: Input/output error
      --------------------------------------------------------------------------
      MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
      with errorcode 1.
      
      NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
      You may or may not see output from other processes, depending on
      exactly when Open MPI kills them.
      --------------------------------------------------------------------------
      In: PMI_Abort(1, N/A)
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      [lola-29][[616,1],2][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
      slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
      srun: error: lola-32: task 3: Killed
      srun: Terminating job step 393832.0
      srun: error: lola-29: task 2: Killed
      srun: error: lola-27: task 0: Exited with exit code 1
      srun: error: lola-27: task 1: Killed
      

      Each job crash can be temporal correlated perfectly to an event on a MDS:

      lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
      lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) Skipped 3 previous similar messages
      

      Attachments

        Activity

          [LU-7564] (out_handler.c:854:out_tx_end()) ... rc = -524

          Patch has landed for 2.8

          jgmitter Joseph Gmitter (Inactive) added a comment - Patch has landed for 2.8

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18206/
          Subject: LU-7564 osp: Do not match the lock for OSP
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: beab72b475c6006f53d5cab628cfdbe6dca09b32

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18206/ Subject: LU-7564 osp: Do not match the lock for OSP Project: fs/lustre-release Branch: master Current Patch Set: Commit: beab72b475c6006f53d5cab628cfdbe6dca09b32

          wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18206
          Subject: LU-7564 osp: lock remote object exclusively
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 468a6d9ee4854740353cc41c04aa70ab6155e069

          gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18206 Subject: LU-7564 osp: lock remote object exclusively Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 468a6d9ee4854740353cc41c04aa70ab6155e069

          Error is present for build '20160126' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160126)

          Test was executed without any fault being injected (aka no mds restart/failover, oss failover)

          After mounting the FS even the previous dangling files don't disappear (see list in attached file 'dangling-files-before-restart-build-20160126')
          Also new dangling files are leftover from job - crashes (see list in attached file 'dangling-files-after-restart-build-20160126')
          Here the job crash can be corrleated to rc == -524 - event and the dangling file(s) in the end. E.g.
          JOB 420993:
          from job output file

           
          01/27/2016 04:51:32: Process 5(lola-33.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
          01/27/2016 04:51:32: Process 3(lola-31.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
          01/27/2016 04:51:32: Process 2(lola-30.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
          slurmd[lola-34]: *** STEP 420993.0 KILLED AT 2016-01-27T04:51:32 WITH SIGNAL 9 ***
          srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
          slurmd[lola-32]: *** STEP 420993.0 KILLED AT 2016-01-27T04:51:32 WITH SIGNAL 9 ***
          

          server error:

          lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:846:out_tx_end()) error during execution of #8 from /lbuilds/soak-builds/workspace/lustr
          e-soaked-20160126/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.65/lustre/ptlrpc/../../lustre/target/out_handler.c:503: rc = -17
          lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:846:out_tx_end()) Skipped 4 previous similar messages
          lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:856:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soa
          ked-20160126/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.65/lustre/ptlrpc/../../lustre/target/out_handler.c:387: rc = -524
          lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:856:out_tx_end()) Skipped 17 previous similar messages
          lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1453899092.15929
          lola-10.log:Jan 27 04:51:33 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1453899093.15184
          

          dangling files in FS

          [root@lola-16 ~]# ll /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.0.0/
          ls: cannot access /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.0.0/dir.mdtest.0.61: No such file or directory
          total 0
          d????????? ? ? ? ?            ? dir.mdtest.0.61
          [root@lola-16 ~]# ll /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.5.0
          ls: cannot access /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.5.0/dir.mdtest.5.61: No such file or directory
          total 0
          d????????? ? ? ? ?            ? dir.mdtest.5.61
          

          Attached file containing lists of dangling files and debug log files mentioned in the MDT error message above.

          heckes Frank Heckes (Inactive) added a comment - Error is present for build '20160126' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160126 ) Test was executed without any fault being injected (aka no mds restart/failover, oss failover) After mounting the FS even the previous dangling files don't disappear (see list in attached file 'dangling-files-before-restart-build-20160126') Also new dangling files are leftover from job - crashes (see list in attached file 'dangling-files-after-restart-build-20160126') Here the job crash can be corrleated to rc == -524 - event and the dangling file(s) in the end. E.g. JOB 420993: from job output file 01/27/2016 04:51:32: Process 5(lola-33.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error 01/27/2016 04:51:32: Process 3(lola-31.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error 01/27/2016 04:51:32: Process 2(lola-30.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error slurmd[lola-34]: *** STEP 420993.0 KILLED AT 2016-01-27T04:51:32 WITH SIGNAL 9 *** srun: Job step aborted: Waiting up to 2 seconds for job step to finish. slurmd[lola-32]: *** STEP 420993.0 KILLED AT 2016-01-27T04:51:32 WITH SIGNAL 9 *** server error: lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:846:out_tx_end()) error during execution of #8 from /lbuilds/soak-builds/workspace/lustr e-soaked-20160126/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.65/lustre/ptlrpc/../../lustre/target/out_handler.c:503: rc = -17 lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:846:out_tx_end()) Skipped 4 previous similar messages lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:856:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soa ked-20160126/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.65/lustre/ptlrpc/../../lustre/target/out_handler.c:387: rc = -524 lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:856:out_tx_end()) Skipped 17 previous similar messages lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1453899092.15929 lola-10.log:Jan 27 04:51:33 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1453899093.15184 dangling files in FS [root@lola-16 ~]# ll /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.0.0/ ls: cannot access /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.0.0/dir.mdtest.0.61: No such file or directory total 0 d????????? ? ? ? ? ? dir.mdtest.0.61 [root@lola-16 ~]# ll /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.5.0 ls: cannot access /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.5.0/dir.mdtest.5.61: No such file or directory total 0 d????????? ? ? ? ? ? dir.mdtest.5.61 Attached file containing lists of dangling files and debug log files mentioned in the MDT error message above.

          wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18165
          Subject: LU-7564 llog: separate llog creation with initialization
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 734c40e674eb14055648d26af99aca9de4d0fd4f

          gerrit Gerrit Updater added a comment - wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18165 Subject: LU-7564 llog: separate llog creation with initialization Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 734c40e674eb14055648d26af99aca9de4d0fd4f
          di.wang Di Wang added a comment -

          Another possible reason is that soak-test is doing double MDT failover, without COS, that might cause corruption during failover, considering following scenarios

          1. Client1 send the operation(Op1) to MDT1, and MDT1 distribute updates of Op1 to MDT2, then after finish Op1, MDT1 sends reply to client1
          2. Client2 send the operation Op2 to MDT3, and MDT3 distribute updates of Op2 to MDT2, (Note: Op2 depends on Op1).
          3. Before Op1 committed to MDT1 and MDT2, both MDT1 and MDT2 reboots.
          4. After MDT1 restarts, client1 will send Op1 to MDT1, and MDT1 will distribute updates of Op1 to MDT2, but with different xid and 0 transno.
          5. After MDT2 restarts and recovers, of course it will ignore the updates of Op1 because of its 0 transno, instead it will receive the replay updates from MDT3 (Op2). Then it will fail of course.

          If we have COS here, Op1 will be committed before Op2 starts, then it will help a lot here. I includes the COS patch in 16838, see how it goes.

          di.wang Di Wang added a comment - Another possible reason is that soak-test is doing double MDT failover, without COS, that might cause corruption during failover, considering following scenarios 1. Client1 send the operation(Op1) to MDT1, and MDT1 distribute updates of Op1 to MDT2, then after finish Op1, MDT1 sends reply to client1 2. Client2 send the operation Op2 to MDT3, and MDT3 distribute updates of Op2 to MDT2, (Note: Op2 depends on Op1). 3. Before Op1 committed to MDT1 and MDT2, both MDT1 and MDT2 reboots. 4. After MDT1 restarts, client1 will send Op1 to MDT1, and MDT1 will distribute updates of Op1 to MDT2, but with different xid and 0 transno. 5. After MDT2 restarts and recovers, of course it will ignore the updates of Op1 because of its 0 transno, instead it will receive the replay updates from MDT3 (Op2). Then it will fail of course. If we have COS here, Op1 will be committed before Op2 starts, then it will help a lot here. I includes the COS patch in 16838, see how it goes.
          di.wang Di Wang added a comment -

          Pushed a patch http://review.whamcloud.com/16838, which includes all of DNE fixes and some fixes on remote llog (in llog_cat_new_log()), which might cause recovery failure.

          Please try this one.

          di.wang Di Wang added a comment - Pushed a patch http://review.whamcloud.com/16838 , which includes all of DNE fixes and some fixes on remote llog (in llog_cat_new_log()), which might cause recovery failure. Please try this one.

          Debug mask is set to default. Please let me know if you like to set to an other value that is more appropriate.

          heckes Frank Heckes (Inactive) added a comment - Debug mask is set to default. Please let me know if you like to set to an other value that is more appropriate.

          Attached the debug logs taken on two MDSes after the event occurred. For the sequence of timing, here are the
          error messages for each event and node:
          lola-10

          lola-10.log:Dec 17 00:29:51 lola-10 kernel: LustreError: 5454:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
          

          lola-11

          lola-11.log:Dec 17 01:22:34 lola-11 kernel: LustreError: 6293:0:(out_handler.c:854:out_tx_end()) soaked-MDT0007-osd: undo for /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6.6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
          
          heckes Frank Heckes (Inactive) added a comment - Attached the debug logs taken on two MDSes after the event occurred. For the sequence of timing, here are the error messages for each event and node: lola-10 lola-10.log:Dec 17 00:29:51 lola-10 kernel: LustreError: 5454:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524 lola-11 lola-11.log:Dec 17 01:22:34 lola-11 kernel: LustreError: 6293:0:(out_handler.c:854:out_tx_end()) soaked-MDT0007-osd: undo for /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6.6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524

          The errors happen during normal operations, i.e. at times when no fault was injected.

          heckes Frank Heckes (Inactive) added a comment - The errors happen during normal operations, i.e. at times when no fault was injected.

          People

            di.wang Di Wang
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: