Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11765

during failover test run, mdtest job fails, numerous stat failures 'No such file or directory'

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      Running a failover test(random failover of OSSs + mdtest), an mdtest job failed reporting stat failures. This looked similar to LU-11760, except that this time all the files were actually present, valid after the job failure.

      V-1: Entering create_remove_items_helper...
      V-1: Entering unique_dir_access...
      V-1: Entering mdtest_stat...
      08/19/2018 07:15:43: Process 10(nid00265): FAILED in mdtest_stat, unable to stat file: No such file or directory
      08/19/2018 07:15:43: Process 15(nid00279): FAILED in mdtest_stat, unable to stat file: No such file or directory
      Rank 10 [Sun Aug 19 07:15:43 2018] [c1-0c0s4n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 10
      Rank 15 [Sun Aug 19 07:15:43 2018] [c1-0c0s4n3] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15

      Details of fail reason could be found in a patch commit message. I will upload it in the nearest time.

      Attachments

        Activity

          [LU-11765] during failover test run, mdtest job fails, numerous stat failures 'No such file or directory'
          spitzcor Cory Spitz added a comment -

          This is resolved with L2.13.0.

          spitzcor Cory Spitz added a comment - This is resolved with L2.13.0.
          spitzcor Cory Spitz added a comment -

          What work remains here?

          spitzcor Cory Spitz added a comment - What work remains here?

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33836/
          Subject: LU-11765 ofd: return EAGAIN during 1st CLEANUP_ORPHAN
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: dc52a88cde1e7cea093b25fc9a15509fe0ac527a

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33836/ Subject: LU-11765 ofd: return EAGAIN during 1st CLEANUP_ORPHAN Project: fs/lustre-release Branch: master Current Patch Set: Commit: dc52a88cde1e7cea093b25fc9a15509fe0ac527a

          Below is example how to start replay-ost-single_12 on local virtual machines. Note OST1 should be on a separate node.

          LOAD_MODULES_REMOTE=true PDSH="/usr/local/bin/pdsh -S -R ssh -w" POWER_UP="" FAILURE_MODE=HARD POWER_DOWN="pdsh -S -R ssh -w dhcppc4 echo c > /proc/sysrq-trigger&" NOFORMAT=yes mds_HOST=dhcppc3 mgs_HOST=dhcppc3 OSTCOUNT=1 ost1_HOST=dhcppc4 ost2_HOST=dhcppc3 ONLY=12 bash /root/src/lustre-wc-rel/lustre/tests/replay-ost-single.sh

          With a t-f patch if modules on the second node don't start automatically:

          [root@dhcppc3 tests]# git diff test-framework.sh 
          diff --git a/lustre/tests/test-framework.sh b/lustre/tests/test-framework.sh
          index b42bc9c..2c2daae 100755
          --- a/lustre/tests/test-framework.sh
          +++ b/lustre/tests/test-framework.sh
          @@ -1902,7 +1902,7 @@ mount_facet() {
                  local devicelabel
                  local dm_dev=${!dev}
           
          -       module_loaded lustre || load_modules
          +       load_modules
           
                  case $fstype in
          
          scherementsev Sergey Cheremencev added a comment - Below is example how to start replay-ost-single_12 on local virtual machines. Note OST1 should be on a separate node. LOAD_MODULES_REMOTE=true PDSH="/usr/local/bin/pdsh -S -R ssh -w" POWER_UP="" FAILURE_MODE=HARD POWER_DOWN="pdsh -S -R ssh -w dhcppc4 echo c > /proc/sysrq-trigger&" NOFORMAT=yes mds_HOST=dhcppc3 mgs_HOST=dhcppc3 OSTCOUNT=1 ost1_HOST=dhcppc4 ost2_HOST=dhcppc3 ONLY=12 bash /root/src/lustre-wc-rel/lustre/tests/replay-ost-single.sh With a t-f patch if modules on the second node don't start automatically: [root@dhcppc3 tests]# git diff test-framework.sh diff --git a/lustre/tests/test-framework.sh b/lustre/tests/test-framework.sh index b42bc9c..2c2daae 100755 --- a/lustre/tests/test-framework.sh +++ b/lustre/tests/test-framework.sh @@ -1902,7 +1902,7 @@ mount_facet() { local devicelabel local dm_dev=${!dev} - module_loaded lustre || load_modules + load_modules case $fstype in

          Sergey Cheremencev (c17829@cray.com) uploaded a new patch: https://review.whamcloud.com/33836
          Subject: LU-11765 ofd: return EAGAIN during 1st CLEANUP_ORPHAN
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: c13e591a85bfab7df8a341137a160e5690175220

          gerrit Gerrit Updater added a comment - Sergey Cheremencev (c17829@cray.com) uploaded a new patch: https://review.whamcloud.com/33836 Subject: LU-11765 ofd: return EAGAIN during 1st CLEANUP_ORPHAN Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c13e591a85bfab7df8a341137a160e5690175220

          People

            scherementsev Sergey Cheremencev
            scherementsev Sergey Cheremencev
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: