[LU-5810] sanity: rm: cannot remove `/mnt/lustre/d0.tar-shadow-23vm5/etc/init.d/rc3.d': Directory not empty Created: 27/Oct/14  Updated: 11/Sep/20  Resolved: 11/Sep/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-5064 sanity-scrub test_13: ls should fail Resolved
Severity: 3
Rank (Obsolete): 16296

 Description   

This issue was created by maloo for John Hammond <john.hammond@intel.com>

-----============= acceptance-small: sanity ============----- Sat Oct 25 17:04:33 UTC 2014
Running: bash /usr/lib64/lustre/tests/sanity.sh
== sanity test complete, duration -o sec == 17:04:34 (1414256674)
CMD: shadow-23vm10.shadow.whamcloud.com,shadow-23vm9 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh check_config_client /mnt/lustre 
shadow-23vm9: Checking config lustre mounted on /mnt/lustre
shadow-23vm10: Checking config lustre mounted on /mnt/lustre
Checking servers environments
CMD: shadow-23vm11 running=\$(grep -c /mnt/ost1' ' /proc/mounts);
mpts=\$(mount | grep -c /mnt/ost1' ');
if [ \$running -ne \$mpts ]; then
    echo \$(hostname) env are INSANE!;
    exit 1;
fi
...
CMD: shadow-23vm12 lctl get_param -n timeout
Using TIMEOUT=20
CMD: shadow-23vm12 lctl dl | grep ' IN osc ' 2>/dev/null | wc -l
CMD: shadow-23vm10.shadow.whamcloud.com lctl dl | grep ' IN osc ' 2>/dev/null | wc -l
disable quota as required
CMD: shadow-23vm11,shadow-23vm12,shadow-23vm8,shadow-23vm9 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/bin:/bin:/usr/sbin:/sbin::/sbin:/sbin:/bin:/usr/sbin: NAME=autotest_config sh rpc.sh set_default_debug \"vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck\" \"all -lnet -lnd -pinger\" 4 
CMD: shadow-23vm11,shadow-23vm12,shadow-23vm8 /usr/sbin/lctl set_param 				 osd-ldiskfs.track_declares_assert=1 || true
osd-ldiskfs.track_declares_assert=1
osd-ldiskfs.track_declares_assert=1
osd-ldiskfs.track_declares_assert=1
rm: cannot remove `/mnt/lustre/d0.tar-shadow-23vm5/etc/init.d/rc3.d': Directory not empty
status        script            Total(sec) E(xcluded) S(low) 
------------------------------------------------------------------------------------
test-framework exiting on error

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/6ad49578-5c8e-11e4-b08a-5254006e85c2.



 Comments   
Comment by Andreas Dilger [ 28/Oct/14 ]

It is strange that there is a shadow-23vm5 directory that is not empty, yet according to the config for the test session, the nodes listed are shadow-23vm[8-12]. This implies that the shadow-23vm5 node mounted the wrong filesystem for some reason and proceeded to write there.

Comment by Andreas Dilger [ 28/Oct/14 ]

A similar configuration problem appeared in LU-5064 and LU-5076, and possibly others, where nodes that were not part of the test configuration were accessing the filesystem.

They have been marked duplicates of TEI-1993.

Comment by Minh Diep [ 28/Oct/14 ]

After researched, I doubt this is a problem where we cross mount.
The file /mnt/lustre/d0.tar-$hostname comes from a script run_tar.sh, this is only called from recovery-[mds|double|random]-scale.sh. I have checked report around the time of this report (up to a few days) and can not find any shadow-23vm5 report running recovery-[mds|double|random]-scale tests. It's possible that someone/somescript that accidently login to shadow-23vm5 and run the test and left it behind.

Comment by Andreas Dilger [ 28/Oct/14 ]

Can you please check if shadow-23vm5 is reserved for some user job, or if it is some stuck or forgotten process that is still running there? That also happened with LU-5064.

Comment by Minh Diep [ 29/Oct/14 ]

shadow-23vm5 has always been in autotest. at the time around the failure shadow-23vm5 wasn't running recover-mds-scale.

Comment by James Nunez (Inactive) [ 19/Dec/14 ]

I've experienced a similar problem on the OpenSFS cluster; the test framework can't remove a directory from a previous test, not another node/VM. If you think this is a different problem, I can open a new ticket for this.

Results are at https://testing.hpdd.intel.com/test_sessions/f13ba544-8618-11e4-ac52-5254006e85c2

replay-dual had several tests fail including 22a and 22c. When replay-vbr starts up, no tests run due to the remove at the top of the script. This remove fails with

rm: cannot remove `/lustre/scratch/d22a.replay-dual': Directory not empty
rm: cannot remove `/lustre/scratch/d22c.replay-dual': Directory not empty
status        script            Total(sec) E(xcluded) S(low) 
------------------------------------------------------------------------------------
test-framework exiting on error

replay-vbr is marked as FAIL with 0/0 subtests passed.

Then insanity starts running and runs 16 tests. The test suite is marked as FAIL with no subtest actually failing, but the remove during the test cleanup must have triggered the failure. In the test logs, we see:

== insanity test complete, duration 2157 sec == 14:50:03 (1418770203)
rm: cannot remove `/lustre/scratch/d22a.replay-dual': Directory not empty
rm: cannot remove `/lustre/scratch/d22c.replay-dual': Directory not empty
 insanity : @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4665:error_noexit()
  = /usr/lib64/lustre/tests/test-framework.sh:4696:error()
  = /usr/lib64/lustre/tests/test-framework.sh:4210:check_and_cleanup_lustre()
  = /usr/lib64/lustre/tests/insanity.sh:781:main()
Dumping lctl log to /tmp/test_logs/2014-12-15/220919/insanity..*.1418770204.log
Comment by Andreas Dilger [ 19/May/16 ]

Debug patch for this:

LU-5810 tests: add client hostname to lctl mark

Improve debug messages to include the originating hostname.

Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Change-Id: I441bf8294c38135276a5a0f0853dbebf4358c563
Comment by Gerrit Updater [ 27/May/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13113/
Subject: LU-5810 tests: add client hostname to lctl mark
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9c4156e6fc146a198bb342e28eb246f1076889bd

Comment by Gerrit Updater [ 21/Jun/16 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/20894
Subject: Revert "LU-5810 tests: add client hostname to lctl mark"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: dc25382d26a409c12b79d0c9a82ba5a0fa7c521c

Comment by Gerrit Updater [ 22/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20894/
Subject: Revert "LU-5810 tests: add client hostname to lctl mark"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d700bd76aadb7b3ae8fda27dec1d58723b9b95fe

Generated at Sat Feb 10 01:54:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.