Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10553

d23b.replay-dual: Directory not empty, FAIL: remove sub-test dirs failed

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.11.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Cliff White <cliff.white@intel.com>

      This issue relates to the following test suite run:

      On multiple runs, we see permission errors when cleaning up the test, the files showing in the error report appear to be artifacts from previous (replay-*) tests.
      Examples:

      == sanity-pfl test complete, duration 777 sec ======================================================== 02:29:35 (1516357775)
      rm: cannot remove '/mnt/lustre/d23b.replay-dual': Directory not empty
      ....
      == sanity-pfl test complete, duration 773 sec ======================================================== 00:38:53 (1516437533)
      rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted
       sanity-pfl : @@@@@@ FAIL: remove sub-test dirs failed 
      ...
      == sanity-pfl test complete, duration 773 sec ======================================================== 12:42:04 (1516567324)
      rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted
       sanity-pfl : @@@@@@ FAIL: remove sub-test dirs failed 
      ...
      

      Attachments

        Issue Links

          Activity

            [LU-10553] d23b.replay-dual: Directory not empty, FAIL: remove sub-test dirs failed

            I don't think this is a problem from the test-framework.sh not trying to delete the test directories, but rather a defect in Lustre/DNE where the directory simply cannot be deleted because it has a file in it that is not visible on the client for some reason.

            adilger Andreas Dilger added a comment - I don't think this is a problem from the test-framework.sh not trying to delete the test directories, but rather a defect in Lustre/DNE where the directory simply cannot be deleted because it has a file in it that is not visible on the client for some reason.

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36747
            Subject: LU-10553 tests: create and cleanup test specific working dir
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b3a63eb54d40d414be2d03303ce63ab4832440cc

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36747 Subject: LU-10553 tests: create and cleanup test specific working dir Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b3a63eb54d40d414be2d03303ce63ab4832440cc

            I guess we need to update the test suite to reboot (reformat?) in such cases, so at least we only have one test script failing instead of a whole series.

            adilger Andreas Dilger added a comment - I guess we need to update the test suite to reboot (reformat?) in such cases, so at least we only have one test script failing instead of a whole series.
            yujian Jian Yu added a comment - - edited

            After replay-dual test 23b failed as follows:

            CMD: onyx-40vm2 mkdir /mnt/lustre2/d23b.replay-dual/remote_dir
            onyx-40vm2: mkdir: cannot create directory `/mnt/lustre2/d23b.replay-dual/remote_dir': File exists
             replay-dual test_23b: @@@@@@ FAIL: Remote creation failed 1 
            

            The following test suites failed with:

            rm: cannot remove `/mnt/lustre/d23b.replay-dual': Directory not empty
            

            https://testing.hpdd.intel.com/test_sessions/b7b66042-d0a6-4a0a-8c96-d3dc4110333f

            yujian Jian Yu added a comment - - edited After replay-dual test 23b failed as follows: CMD: onyx-40vm2 mkdir /mnt/lustre2/d23b.replay-dual/remote_dir onyx-40vm2: mkdir: cannot create directory `/mnt/lustre2/d23b.replay-dual/remote_dir': File exists replay-dual test_23b: @@@@@@ FAIL: Remote creation failed 1 The following test suites failed with: rm: cannot remove `/mnt/lustre/d23b.replay-dual': Directory not empty https://testing.hpdd.intel.com/test_sessions/b7b66042-d0a6-4a0a-8c96-d3dc4110333f
            jamesanunez James Nunez (Inactive) added a comment - - edited

            I started to open a new ticket until I saw this ticket. Here is just a little more detail on what we see in Maloo for these failed test sessions.

            Lustre test suites fail because “rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted”

            We have many cases of a Lustre test suite have a FAIL status, but, when you look at all the subtests, all of the subtests PASS. By looking at the end of the suite_log for the failed test suite, you will see an error when the suite tries to clean up the file system. For example (https://testing.hpdd.intel.com/test_sets/14461086-0359-11e8-bd00-52540065bddc ), a recent insanity test suite failure:

            == insanity test complete, duration 1255 sec ========================================================= 19:55:20 (1517025320)
            rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted
             insanity : @@@@@@ FAIL: remove sub-test dirs failed 
              Trace dump:
              = /usr/lib64/lustre/tests/test-framework.sh:5336:error()
              = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre()
            

            In the example above, we can’t clean up (rm) the files in the file system because a file remains. Yet, I don’t know why we would get an “Operation not permitted” when trying to delete a file. When one test suite completes and another starts, there should not be any tasks running from previous test suites. The solution may be related/similar to the patch for LU-6609; https://review.whamcloud.com/#/c/14843.

            In the same test session referenced above, sanity-quota cannot clean up the file system and sanity-pfl, lustre-rsync-test, metadata-updates, ost-pools, mds-survey, performance-sanity, parallel-scale, large-scale, and obdfilter-survey fails due to the f4h.replay-vbr file.

            Looking at the replay-vbr results, we see that replay-vbr test 4h did fail and, looking at relay-vbr suite_log, we see that replay-vbr, couldn’t remove that file:

            == replay-vbr test complete, duration 890 sec ======================================================== 19:34:23 (1517024063)
            replay-vbr: FAIL: test_1b trevis-7vm9 not evicted
            replay-vbr: FAIL: test_2b trevis-7vm9 not evicted
            replay-vbr: FAIL: test_3b trevis-7vm9 not evicted
            replay-vbr: FAIL: test_4c trevis-7vm9 not evicted
            replay-vbr: FAIL: test_4d trevis-7vm9 not evicted
            replay-vbr: FAIL: test_4f trevis-7vm9 not evicted
            replay-vbr: FAIL: test_4h trevis-7vm9 not evicted
            replay-vbr: FAIL: test_5b trevis-7vm9 not evicted
            replay-vbr: FAIL: test_5c trevis-7vm9 not evicted
            replay-vbr: FAIL: test_6c trevis-7vm9 not evicted
            replay-vbr: FAIL: test_6d trevis-7vm9 not evicted
            replay-vbr: FAIL: test_7a Test 7a.1 failed
            replay-vbr: FAIL: test_7b Test 7b.1 failed
            replay-vbr: FAIL: test_7c Test 7c.1 failed
            replay-vbr: FAIL: test_7e Test 7e.1 failed
            replay-vbr: FAIL: test_7f Test 7f.1 failed
            replay-vbr: FAIL: test_7h Test 7h.1 failed
            replay-vbr: FAIL: test_7i Test 7i.1 failed
            replay-vbr: FAIL: test_10b trevis-7vm9:/mnt/lustre not evicted
            replay-vbr: FAIL: test_12a test_12a failed with 4
            rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted
             replay-vbr : @@@@@@ FAIL: remove sub-test dirs failed 
              Trace dump:
              = /usr/lib64/lustre/tests/test-framework.sh:5336:error()
              = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre()
            

            Lustre test suites fail because “rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty”

            We have many cases of a Lustre test suite FAIL testing, but, when you look at all the subtests, all of the subtests PASS. By looking at the end of the suite_log for the failed test suite, you will see an error when the suite tries to clean up the file system. For example (https://testing.hpdd.intel.com/test_sets/294da78c-0363-11e8-a10a-52540065bddc), a recent insanity test suite failure:

            == insanity test complete, duration 2289 sec ========================================================= 20:59:08 (1517029148)
            rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty
             insanity : @@@@@@ FAIL: remove sub-test dirs failed 
              Trace dump:
              = /usr/lib64/lustre/tests/test-framework.sh:5336:error()
              = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre()
            

            Looking at replay-single, test 81d does fail and we get the same error message when trying to clean up the file system at the end of the test suite:

            == replay-single test complete, duration 7320 sec ==================================================== 20:20:35 (1517026835)
            replay-single: FAIL: test_0c File exists and it shouldn't
            replay-single: FAIL: test_44c unliked after fail abort
            replay-single: FAIL: test_80d /usr/bin/lfs getstripe -M /mnt/lustre/d80d.replay-single/remote_dir failed
            replay-single: FAIL: test_81d rmdir failed
            replay-single: FAIL: test_120 dir-0 still exists
            rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty
             replay-single : @@@@@@ FAIL: remove sub-test dirs failed 
            

            In the same test session referenced above, recovery-small, replay-ost-single, replay-dual, replay-vbr, sanity-quota, sanity-pfl, lustre-rsync-test, metadata-updates, ost-pools, mds-survey, performance-sanity, parallel-scale, large-scale, and obdfilter-survey all are unable to remove the f4h.replay-vbr file and some of those tests fail solely due to this.

            jamesanunez James Nunez (Inactive) added a comment - - edited I started to open a new ticket until I saw this ticket. Here is just a little more detail on what we see in Maloo for these failed test sessions. Lustre test suites fail because “rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted” We have many cases of a Lustre test suite have a FAIL status, but, when you look at all the subtests, all of the subtests PASS. By looking at the end of the suite_log for the failed test suite, you will see an error when the suite tries to clean up the file system. For example ( https://testing.hpdd.intel.com/test_sets/14461086-0359-11e8-bd00-52540065bddc ), a recent insanity test suite failure: == insanity test complete, duration 1255 sec ========================================================= 19:55:20 (1517025320) rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted insanity : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5336:error() = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre() In the example above, we can’t clean up (rm) the files in the file system because a file remains. Yet, I don’t know why we would get an “Operation not permitted” when trying to delete a file. When one test suite completes and another starts, there should not be any tasks running from previous test suites. The solution may be related/similar to the patch for LU-6609 ; https://review.whamcloud.com/#/c/14843 . In the same test session referenced above, sanity-quota cannot clean up the file system and sanity-pfl, lustre-rsync-test, metadata-updates, ost-pools, mds-survey, performance-sanity, parallel-scale, large-scale, and obdfilter-survey fails due to the f4h.replay-vbr file. Looking at the replay-vbr results, we see that replay-vbr test 4h did fail and, looking at relay-vbr suite_log, we see that replay-vbr, couldn’t remove that file: == replay-vbr test complete, duration 890 sec ======================================================== 19:34:23 (1517024063) replay-vbr: FAIL: test_1b trevis-7vm9 not evicted replay-vbr: FAIL: test_2b trevis-7vm9 not evicted replay-vbr: FAIL: test_3b trevis-7vm9 not evicted replay-vbr: FAIL: test_4c trevis-7vm9 not evicted replay-vbr: FAIL: test_4d trevis-7vm9 not evicted replay-vbr: FAIL: test_4f trevis-7vm9 not evicted replay-vbr: FAIL: test_4h trevis-7vm9 not evicted replay-vbr: FAIL: test_5b trevis-7vm9 not evicted replay-vbr: FAIL: test_5c trevis-7vm9 not evicted replay-vbr: FAIL: test_6c trevis-7vm9 not evicted replay-vbr: FAIL: test_6d trevis-7vm9 not evicted replay-vbr: FAIL: test_7a Test 7a.1 failed replay-vbr: FAIL: test_7b Test 7b.1 failed replay-vbr: FAIL: test_7c Test 7c.1 failed replay-vbr: FAIL: test_7e Test 7e.1 failed replay-vbr: FAIL: test_7f Test 7f.1 failed replay-vbr: FAIL: test_7h Test 7h.1 failed replay-vbr: FAIL: test_7i Test 7i.1 failed replay-vbr: FAIL: test_10b trevis-7vm9:/mnt/lustre not evicted replay-vbr: FAIL: test_12a test_12a failed with 4 rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted replay-vbr : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5336:error() = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre() Lustre test suites fail because “rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty” We have many cases of a Lustre test suite FAIL testing, but, when you look at all the subtests, all of the subtests PASS. By looking at the end of the suite_log for the failed test suite, you will see an error when the suite tries to clean up the file system. For example ( https://testing.hpdd.intel.com/test_sets/294da78c-0363-11e8-a10a-52540065bddc ), a recent insanity test suite failure: == insanity test complete, duration 2289 sec ========================================================= 20:59:08 (1517029148) rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty insanity : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5336:error() = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre() Looking at replay-single, test 81d does fail and we get the same error message when trying to clean up the file system at the end of the test suite: == replay-single test complete, duration 7320 sec ==================================================== 20:20:35 (1517026835) replay-single: FAIL: test_0c File exists and it shouldn't replay-single: FAIL: test_44c unliked after fail abort replay-single: FAIL: test_80d /usr/bin/lfs getstripe -M /mnt/lustre/d80d.replay-single/remote_dir failed replay-single: FAIL: test_81d rmdir failed replay-single: FAIL: test_120 dir-0 still exists rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty replay-single : @@@@@@ FAIL: remove sub-test dirs failed In the same test session referenced above, recovery-small, replay-ost-single, replay-dual, replay-vbr, sanity-quota, sanity-pfl, lustre-rsync-test, metadata-updates, ost-pools, mds-survey, performance-sanity, parallel-scale, large-scale, and obdfilter-survey all are unable to remove the f4h.replay-vbr file and some of those tests fail solely due to this.

            Typically, scripts like sanity.sh will clean up test files at the start to avoid issues like this. Also, most test scripts should only be accessing files that they created, so there may be some cleanup work needed in these sanity-pfl tests.

            adilger Andreas Dilger added a comment - Typically, scripts like sanity.sh will clean up test files at the start to avoid issues like this. Also, most test scripts should only be accessing files that they created, so there may be some cleanup work needed in these sanity-pfl tests.

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: