[LU-15140] recovery-random-scale: No sub tests failed in this test set, FAIL: remove sub-test dirs failed Created: 21/Oct/21 Updated: 09/Jan/24 Resolved: 06/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.8, Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Elena Gryaznova |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com> This issue relates to the following test suite run: == recovery-random-scale test complete, duration 85944 sec =========================================== 19:34:45 (1634672085)
rm: cannot remove '/mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com/etc/selinux/targeted/active/modules/100': Directory not empty
recovery-random-scale : @@@@@@ FAIL: remove sub-test dirs failed
Stopping clients: trevis-68vm1.trevis.whamcloud.com,trevis-68vm3,trevis-68vm4 /mnt/lustre (opts:)
while umount /mnt/lustre 2>&1 | grep -q busy; do
echo /mnt/lustre is still busy, wait one second && sleep 1;
done;
fi
Stopping client trevis-68vm1.trevis.whamcloud.com /mnt/lustre opts:
Stopping client trevis-68vm3.trevis.whamcloud.com /mnt/lustre opts:
Stopping client trevis-68vm4.trevis.whamcloud.com /mnt/lustre opts:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
run_tar.s 2934 root cwd DIR 1273,181606 11264 144116446786489609 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com
tar 3222 root cwd DIR 1273,181606 11264 144116446786489609 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com
tar 3223 root cwd DIR 1273,181606 11264 144116446786489609 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com
tar 3223 root 3w REG 1273,181606 5156 144117587637177621 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com/etc/selinux/targeted/active/modules/100/gpg/cil
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
run_dd.sh 2747 root cwd unknown 1273,181606 /mnt/lustre/d0.dd-trevis-68vm3.trevis.whamcloud.com
dd 2772 root cwd unknown 1273,181606 /mnt/lustre/d0.dd-trevis-68vm3.trevis.whamcloud.com
dd 2772 root 1w REG 1273,181606 1160388608 144117486973878288 /mnt/lustre/d0.dd-trevis-68vm3.trevis.whamcloud.com/dd-file
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
:
:
Stopping clients: trevis-68vm1.trevis.whamcloud.com,trevis-68vm3,trevis-68vm4 /mnt/lustre2 (opts:)
so it looks like the client filesystem is eventually unmounted correctly, after the running jobs complete. It appears from the jobs that are still running that "tar" may still be writing into that directory tree at the time that "rm -r" is called, causing the directory not to be empty. It would make sense to ensure that the running jobs are stopped before trying to delete the directory tree. |
| Comments |
| Comment by Alena Nikitenko [ 03/Dec/21 ] |
|
Very similar issues can be observed in recovery-double-scale, recovery-mds-scale, recovery-double-scale test sets on 2.12.8: https://testing.whamcloud.com/test_sets/700f5bee-22e3-4ea7-b49d-b42ce30895f5 https://testing.whamcloud.com/test_sets/66753f7e-69b3-43a7-bd3e-2d4b415a204d https://testing.whamcloud.com/test_sets/226e5c40-191f-4929-b0d8-a41775b12ecd Subtests are either skipped or passed, but the test set is marked as failed. Very similar logs can be found in these test runs, for example: == recovery-double-scale test complete, duration 1846 sec ============================================ 19:24:04 (1637436244) rm: cannot remove '/mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com/etc/selinux/targeted/active/modules/100': Directory not empty recovery-double-scale : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5919:error() = /usr/lib64/lustre/tests/test-framework.sh:5404:check_and_cleanup_lustre() = /usr/lib64/lustre/tests/recovery-double-scale.sh:309:main() Dumping lctl log to /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..*.1637436272.log CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /usr/sbin/lctl dk > /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..debug_log.\$(hostname -s).1637436272.log; dmesg > /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..dmesg.\$(hostname -s).1637436272.log CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 rsync -az /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..*.1637436272.log onyx-64vm1.onyx.whamcloud.com:/autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e/ Resetting fail_loc on all nodes...CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null done. Stopping clients: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /mnt/lustre (opts:) CMD: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 running=\$(grep -c /mnt/lustre' ' /proc/mounts); if [ \$running -ne 0 ] ; then echo Stopping client \$(hostname) /mnt/lustre opts:; lsof /mnt/lustre || need_kill=no; if [ x != x -a x\$need_kill != xno ]; then pids=\$(lsof -t /mnt/lustre | sort -u); if [ -n \"\$pids\" ]; then kill -9 \$pids; fi fi; while umount /mnt/lustre 2>&1 | grep -q busy; do echo /mnt/lustre is still busy, wait one second && sleep 1; done; fi Stopping client onyx-64vm4.onyx.whamcloud.com /mnt/lustre opts: Stopping client onyx-64vm3.onyx.whamcloud.com /mnt/lustre opts: Stopping client onyx-64vm1.onyx.whamcloud.com /mnt/lustre opts: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME run_tar.s 2826 root cwd DIR 1273,181606 4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com tar 2938 root cwd DIR 1273,181606 4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com tar 2939 root cwd DIR 1273,181606 4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME run_dd.sh 2920 root cwd unknown 1273,181606 /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com dd 3353 root cwd unknown 1273,181606 /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com dd 3353 root 1w REG 1273,181606 2920480768 144116681667510276 /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com/dd-file /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second ... Stopping clients: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /mnt/lustre2 (opts:) |
| Comment by Gerrit Updater [ 10/Dec/21 ] |
|
"Elena Gryaznova <elena.gryaznova@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45824 |
| Comment by Gerrit Updater [ 06/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45824/ |
| Comment by Peter Jones [ 06/Jan/22 ] |
|
Landed for 2.15 |