I started to open a new ticket until I saw this ticket. Here is just a little more detail on what we see in Maloo for these failed test sessions.
Lustre test suites fail because “rm: cannot remove '/mnt/lustre/f4h.replay-vbr': Operation not permitted”
We have many cases of a Lustre test suite have a FAIL status, but, when you look at all the subtests, all of the subtests PASS. By looking at the end of the suite_log for the failed test suite, you will see an error when the suite tries to clean up the file system. For example (https://testing.hpdd.intel.com/test_sets/14461086-0359-11e8-bd00-52540065bddc ), a recent insanity test suite failure:
In the example above, we can’t clean up (rm) the files in the file system because a file remains. Yet, I don’t know why we would get an “Operation not permitted” when trying to delete a file. When one test suite completes and another starts, there should not be any tasks running from previous test suites. The solution may be related/similar to the patch for LU-6609; https://review.whamcloud.com/#/c/14843.
In the same test session referenced above, sanity-quota cannot clean up the file system and sanity-pfl, lustre-rsync-test, metadata-updates, ost-pools, mds-survey, performance-sanity, parallel-scale, large-scale, and obdfilter-survey fails due to the f4h.replay-vbr file.
Looking at the replay-vbr results, we see that replay-vbr test 4h did fail and, looking at relay-vbr suite_log, we see that replay-vbr, couldn’t remove that file:
Lustre test suites fail because “rm: cannot remove '/mnt/lustre/d81d.replay-single': Directory not empty”
We have many cases of a Lustre test suite FAIL testing, but, when you look at all the subtests, all of the subtests PASS. By looking at the end of the suite_log for the failed test suite, you will see an error when the suite tries to clean up the file system. For example (https://testing.hpdd.intel.com/test_sets/294da78c-0363-11e8-a10a-52540065bddc), a recent insanity test suite failure:
Looking at replay-single, test 81d does fail and we get the same error message when trying to clean up the file system at the end of the test suite:
In the same test session referenced above, recovery-small, replay-ost-single, replay-dual, replay-vbr, sanity-quota, sanity-pfl, lustre-rsync-test, metadata-updates, ost-pools, mds-survey, performance-sanity, parallel-scale, large-scale, and obdfilter-survey all are unable to remove the f4h.replay-vbr file and some of those tests fail solely due to this.
I don't think this is a problem from the test-framework.sh not trying to delete the test directories, but rather a defect in Lustre/DNE where the directory simply cannot be deleted because it has a file in it that is not visible on the client for some reason.