[LU-10553] d23b.replay-dual: Directory not empty, FAIL: remove sub-test dirs failed Created: 23/Jan/18  Updated: 27/Mar/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: easy, tests

Issue Links:
Duplicate
is duplicated by LU-15710 runtests: FAIL: remove sub-test dirs ... Resolved
Related
is related to LU-9827 sanityn fail “remove sub-test dirs fa... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Cliff White <cliff.white@intel.com>

This issue relates to the following test suite run:

On multiple runs, we see permission errors when cleaning up the test, the files showing in the error report appear to be artifacts from previous (replay-*) tests.
Examples:

== sanity-pfl test complete, duration 777 sec ======================================================== 02:29:35 (1516357775)
rm: cannot remove '/mnt/lustre/d23b.replay-dual': Directory not empty
....
== sanity-pfl test complete, duration 773 sec ======================================================== 00:38:53 (1516437533)
rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted
 sanity-pfl : @@@@@@ FAIL: remove sub-test dirs failed 
...
== sanity-pfl test complete, duration 773 sec ======================================================== 12:42:04 (1516567324)
rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted
 sanity-pfl : @@@@@@ FAIL: remove sub-test dirs failed 
...


 Comments   
Comment by Andreas Dilger [ 24/Jan/18 ]

Typically, scripts like sanity.sh will clean up test files at the start to avoid issues like this. Also, most test scripts should only be accessing files that they created, so there may be some cleanup work needed in these sanity-pfl tests.

Comment by James Nunez (Inactive) [ 29/Jan/18 ]

I started to open a new ticket until I saw this ticket. Here is just a little more detail on what we see in Maloo for these failed test sessions.

Lustre test suites fail because “rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted”

We have many cases of a Lustre test suite have a FAIL status, but, when you look at all the subtests, all of the subtests PASS. By looking at the end of the suite_log for the failed test suite, you will see an error when the suite tries to clean up the file system. For example (https://testing.hpdd.intel.com/test_sets/14461086-0359-11e8-bd00-52540065bddc ), a recent insanity test suite failure:

== insanity test complete, duration 1255 sec ========================================================= 19:55:20 (1517025320)
rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted
 insanity : @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5336:error()
  = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre()

In the example above, we can’t clean up (rm) the files in the file system because a file remains. Yet, I don’t know why we would get an “Operation not permitted” when trying to delete a file. When one test suite completes and another starts, there should not be any tasks running from previous test suites. The solution may be related/similar to the patch for LU-6609; https://review.whamcloud.com/#/c/14843.

In the same test session referenced above, sanity-quota cannot clean up the file system and sanity-pfl, lustre-rsync-test, metadata-updates, ost-pools, mds-survey, performance-sanity, parallel-scale, large-scale, and obdfilter-survey fails due to the f4h.replay-vbr file.

Looking at the replay-vbr results, we see that replay-vbr test 4h did fail and, looking at relay-vbr suite_log, we see that replay-vbr, couldn’t remove that file:

== replay-vbr test complete, duration 890 sec ======================================================== 19:34:23 (1517024063)
replay-vbr: FAIL: test_1b trevis-7vm9 not evicted
replay-vbr: FAIL: test_2b trevis-7vm9 not evicted
replay-vbr: FAIL: test_3b trevis-7vm9 not evicted
replay-vbr: FAIL: test_4c trevis-7vm9 not evicted
replay-vbr: FAIL: test_4d trevis-7vm9 not evicted
replay-vbr: FAIL: test_4f trevis-7vm9 not evicted
replay-vbr: FAIL: test_4h trevis-7vm9 not evicted
replay-vbr: FAIL: test_5b trevis-7vm9 not evicted
replay-vbr: FAIL: test_5c trevis-7vm9 not evicted
replay-vbr: FAIL: test_6c trevis-7vm9 not evicted
replay-vbr: FAIL: test_6d trevis-7vm9 not evicted
replay-vbr: FAIL: test_7a Test 7a.1 failed
replay-vbr: FAIL: test_7b Test 7b.1 failed
replay-vbr: FAIL: test_7c Test 7c.1 failed
replay-vbr: FAIL: test_7e Test 7e.1 failed
replay-vbr: FAIL: test_7f Test 7f.1 failed
replay-vbr: FAIL: test_7h Test 7h.1 failed
replay-vbr: FAIL: test_7i Test 7i.1 failed
replay-vbr: FAIL: test_10b trevis-7vm9:/mnt/lustre not evicted
replay-vbr: FAIL: test_12a test_12a failed with 4
rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted
 replay-vbr : @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5336:error()
  = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre()

Lustre test suites fail because “rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty”

We have many cases of a Lustre test suite FAIL testing, but, when you look at all the subtests, all of the subtests PASS. By looking at the end of the suite_log for the failed test suite, you will see an error when the suite tries to clean up the file system. For example (https://testing.hpdd.intel.com/test_sets/294da78c-0363-11e8-a10a-52540065bddc), a recent insanity test suite failure:

== insanity test complete, duration 2289 sec ========================================================= 20:59:08 (1517029148)
rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty
 insanity : @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5336:error()
  = /usr/lib64/lustre/tests/test-framework.sh:4830:check_and_cleanup_lustre()

Looking at replay-single, test 81d does fail and we get the same error message when trying to clean up the file system at the end of the test suite:

== replay-single test complete, duration 7320 sec ==================================================== 20:20:35 (1517026835)
replay-single: FAIL: test_0c File exists and it shouldn't
replay-single: FAIL: test_44c unliked after fail abort
replay-single: FAIL: test_80d /usr/bin/lfs getstripe -M /mnt/lustre/d80d.replay-single/remote_dir failed
replay-single: FAIL: test_81d rmdir failed
replay-single: FAIL: test_120 dir-0 still exists
rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty
 replay-single : @@@@@@ FAIL: remove sub-test dirs failed 

In the same test session referenced above, recovery-small, replay-ost-single, replay-dual, replay-vbr, sanity-quota, sanity-pfl, lustre-rsync-test, metadata-updates, ost-pools, mds-survey, performance-sanity, parallel-scale, large-scale, and obdfilter-survey all are unable to remove the f4h.replay-vbr file and some of those tests fail solely due to this.

Comment by Jian Yu [ 06/Feb/18 ]

After replay-dual test 23b failed as follows:

CMD: onyx-40vm2 mkdir /mnt/lustre2/d23b.replay-dual/remote_dir
onyx-40vm2: mkdir: cannot create directory `/mnt/lustre2/d23b.replay-dual/remote_dir': File exists
 replay-dual test_23b: @@@@@@ FAIL: Remote creation failed 1 

The following test suites failed with:

rm: cannot remove `/mnt/lustre/d23b.replay-dual': Directory not empty

https://testing.hpdd.intel.com/test_sessions/b7b66042-d0a6-4a0a-8c96-d3dc4110333f

Comment by Andreas Dilger [ 15/Sep/18 ]

I guess we need to update the test suite to reboot (reformat?) in such cases, so at least we only have one test script failing instead of a whole series.

Comment by Gerrit Updater [ 13/Nov/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36747
Subject: LU-10553 tests: create and cleanup test specific working dir
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b3a63eb54d40d414be2d03303ce63ab4832440cc

Comment by Andreas Dilger [ 14/Nov/19 ]

I don't think this is a problem from the test-framework.sh not trying to delete the test directories, but rather a defect in Lustre/DNE where the directory simply cannot be deleted because it has a file in it that is not visible on the client for some reason.

Generated at Sat Feb 10 02:36:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.