[LU-10553] d23b.replay-dual: Directory not empty, FAIL: remove sub-test dirs failed Created: 23/Jan/18 Updated: 27/Mar/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | easy, tests | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Cliff White <cliff.white@intel.com> This issue relates to the following test suite run: On multiple runs, we see permission errors when cleaning up the test, the files showing in the error report appear to be artifacts from previous (replay-*) tests. == sanity-pfl test complete, duration 777 sec ======================================================== 02:29:35 (1516357775) rm: cannot remove '/mnt/lustre/d23b.replay-dual': Directory not empty .... == sanity-pfl test complete, duration 773 sec ======================================================== 00:38:53 (1516437533) rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted sanity-pfl : @@@@@@ FAIL: remove sub-test dirs failed ... == sanity-pfl test complete, duration 773 sec ======================================================== 12:42:04 (1516567324) rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted sanity-pfl : @@@@@@ FAIL: remove sub-test dirs failed ... |
| Comments |
| Comment by Andreas Dilger [ 24/Jan/18 ] |
|
Typically, scripts like sanity.sh will clean up test files at the start to avoid issues like this. Also, most test scripts should only be accessing files that they created, so there may be some cleanup work needed in these sanity-pfl tests. |
| Comment by James Nunez (Inactive) [ 29/Jan/18 ] |
|
I started to open a new ticket until I saw this ticket. Here is just a little more detail on what we see in Maloo for these failed test sessions. Lustre test suites fail because “rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted” We have many cases of a Lustre test suite have a FAIL status, but, when you look at all the subtests, all of the subtests PASS. By looking at the end of the suite_log for the failed test suite, you will see an error when the suite tries to clean up the file system. For example (https://testing.hpdd.intel.com/test_sets/14461086-0359-11e8-bd00-52540065bddc ), a recent insanity test suite failure: == insanity test complete, duration 1255 sec ========================================================= 19:55:20 (1517025320) rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted insanity : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5336:error() = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre() In the example above, we can’t clean up (rm) the files in the file system because a file remains. Yet, I don’t know why we would get an “Operation not permitted” when trying to delete a file. When one test suite completes and another starts, there should not be any tasks running from previous test suites. The solution may be related/similar to the patch for In the same test session referenced above, sanity-quota cannot clean up the file system and sanity-pfl, lustre-rsync-test, metadata-updates, ost-pools, mds-survey, performance-sanity, parallel-scale, large-scale, and obdfilter-survey fails due to the f4h.replay-vbr file. Looking at the replay-vbr results, we see that replay-vbr test 4h did fail and, looking at relay-vbr suite_log, we see that replay-vbr, couldn’t remove that file: == replay-vbr test complete, duration 890 sec ======================================================== 19:34:23 (1517024063) replay-vbr: FAIL: test_1b trevis-7vm9 not evicted replay-vbr: FAIL: test_2b trevis-7vm9 not evicted replay-vbr: FAIL: test_3b trevis-7vm9 not evicted replay-vbr: FAIL: test_4c trevis-7vm9 not evicted replay-vbr: FAIL: test_4d trevis-7vm9 not evicted replay-vbr: FAIL: test_4f trevis-7vm9 not evicted replay-vbr: FAIL: test_4h trevis-7vm9 not evicted replay-vbr: FAIL: test_5b trevis-7vm9 not evicted replay-vbr: FAIL: test_5c trevis-7vm9 not evicted replay-vbr: FAIL: test_6c trevis-7vm9 not evicted replay-vbr: FAIL: test_6d trevis-7vm9 not evicted replay-vbr: FAIL: test_7a Test 7a.1 failed replay-vbr: FAIL: test_7b Test 7b.1 failed replay-vbr: FAIL: test_7c Test 7c.1 failed replay-vbr: FAIL: test_7e Test 7e.1 failed replay-vbr: FAIL: test_7f Test 7f.1 failed replay-vbr: FAIL: test_7h Test 7h.1 failed replay-vbr: FAIL: test_7i Test 7i.1 failed replay-vbr: FAIL: test_10b trevis-7vm9:/mnt/lustre not evicted replay-vbr: FAIL: test_12a test_12a failed with 4 rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted replay-vbr : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5336:error() = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre() Lustre test suites fail because “rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty” We have many cases of a Lustre test suite FAIL testing, but, when you look at all the subtests, all of the subtests PASS. By looking at the end of the suite_log for the failed test suite, you will see an error when the suite tries to clean up the file system. For example (https://testing.hpdd.intel.com/test_sets/294da78c-0363-11e8-a10a-52540065bddc), a recent insanity test suite failure: == insanity test complete, duration 2289 sec ========================================================= 20:59:08 (1517029148) rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty insanity : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5336:error() = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre() Looking at replay-single, test 81d does fail and we get the same error message when trying to clean up the file system at the end of the test suite: == replay-single test complete, duration 7320 sec ==================================================== 20:20:35 (1517026835) replay-single: FAIL: test_0c File exists and it shouldn't replay-single: FAIL: test_44c unliked after fail abort replay-single: FAIL: test_80d /usr/bin/lfs getstripe -M /mnt/lustre/d80d.replay-single/remote_dir failed replay-single: FAIL: test_81d rmdir failed replay-single: FAIL: test_120 dir-0 still exists rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty replay-single : @@@@@@ FAIL: remove sub-test dirs failed In the same test session referenced above, recovery-small, replay-ost-single, replay-dual, replay-vbr, sanity-quota, sanity-pfl, lustre-rsync-test, metadata-updates, ost-pools, mds-survey, performance-sanity, parallel-scale, large-scale, and obdfilter-survey all are unable to remove the f4h.replay-vbr file and some of those tests fail solely due to this. |
| Comment by Jian Yu [ 06/Feb/18 ] |
|
After replay-dual test 23b failed as follows: CMD: onyx-40vm2 mkdir /mnt/lustre2/d23b.replay-dual/remote_dir onyx-40vm2: mkdir: cannot create directory `/mnt/lustre2/d23b.replay-dual/remote_dir': File exists replay-dual test_23b: @@@@@@ FAIL: Remote creation failed 1 The following test suites failed with: rm: cannot remove `/mnt/lustre/d23b.replay-dual': Directory not empty https://testing.hpdd.intel.com/test_sessions/b7b66042-d0a6-4a0a-8c96-d3dc4110333f |
| Comment by Andreas Dilger [ 15/Sep/18 ] |
|
I guess we need to update the test suite to reboot (reformat?) in such cases, so at least we only have one test script failing instead of a whole series. |
| Comment by Gerrit Updater [ 13/Nov/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36747 |
| Comment by Andreas Dilger [ 14/Nov/19 ] |
|
I don't think this is a problem from the test-framework.sh not trying to delete the test directories, but rather a defect in Lustre/DNE where the directory simply cannot be deleted because it has a file in it that is not visible on the client for some reason. |