[LU-15140] recovery-random-scale: No sub tests failed in this test set, FAIL: remove sub-test dirs failed Created: 21/Oct/21  Updated: 09/Jan/24  Resolved: 06/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.8, Lustre 2.15.0
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Elena Gryaznova
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9602 recovery-random-scale test_fail_clien... In Progress
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run:
https://testing.whamcloud.com/test_sets/dfbec373-e4f6-416f-83b9-2265475a3b80
https://testing.whamcloud.com/test_sets/0d2d3b02-4871-4160-a995-a51a74e4cd3b
https://testing.whamcloud.com/test_sets/f060ce64-9e1e-4546-b4bd-2f740787c589
https://testing.whamcloud.com/test_sets/2640a2ac-2fff-48d5-82a6-cab054add322
https://testing.whamcloud.com/test_sets/530acf01-f304-4dfa-94aa-16b5c1280cc5
https://testing.whamcloud.com/test_sets/902b1bb9-ed8b-400b-895a-baa328d4b7c5

== recovery-random-scale test complete, duration 85944 sec =========================================== 19:34:45 (1634672085)
rm: cannot remove '/mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com/etc/selinux/targeted/active/modules/100': Directory not empty
 recovery-random-scale : @@@@@@ FAIL: remove sub-test dirs failed 
Stopping clients: trevis-68vm1.trevis.whamcloud.com,trevis-68vm3,trevis-68vm4 /mnt/lustre (opts:)
while umount  /mnt/lustre 2>&1 | grep -q busy; do
    echo /mnt/lustre is still busy, wait one second && sleep 1;
done;
fi
Stopping client trevis-68vm1.trevis.whamcloud.com /mnt/lustre opts:
Stopping client trevis-68vm3.trevis.whamcloud.com /mnt/lustre opts:
Stopping client trevis-68vm4.trevis.whamcloud.com /mnt/lustre opts:
COMMAND    PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
run_tar.s 2934 root  cwd    DIR 1273,181606    11264 144116446786489609 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com
tar       3222 root  cwd    DIR 1273,181606    11264 144116446786489609 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com
tar       3223 root  cwd    DIR 1273,181606    11264 144116446786489609 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com
tar       3223 root    3w   REG 1273,181606     5156 144117587637177621 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com/etc/selinux/targeted/active/modules/100/gpg/cil
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
COMMAND    PID USER   FD      TYPE      DEVICE   SIZE/OFF               NODE NAME
run_dd.sh 2747 root  cwd   unknown 1273,181606                               /mnt/lustre/d0.dd-trevis-68vm3.trevis.whamcloud.com
dd        2772 root  cwd   unknown 1273,181606                               /mnt/lustre/d0.dd-trevis-68vm3.trevis.whamcloud.com
dd        2772 root    1w      REG 1273,181606 1160388608 144117486973878288 /mnt/lustre/d0.dd-trevis-68vm3.trevis.whamcloud.com/dd-file
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
:
:
Stopping clients: trevis-68vm1.trevis.whamcloud.com,trevis-68vm3,trevis-68vm4 /mnt/lustre2 (opts:)

so it looks like the client filesystem is eventually unmounted correctly, after the running jobs complete. It appears from the jobs that are still running that "tar" may still be writing into that directory tree at the time that "rm -r" is called, causing the directory not to be empty.

It would make sense to ensure that the running jobs are stopped before trying to delete the directory tree.



 Comments   
Comment by Alena Nikitenko [ 03/Dec/21 ]

Very similar issues can be observed in recovery-double-scale, recovery-mds-scale, recovery-double-scale test sets on 2.12.8:

https://testing.whamcloud.com/test_sets/700f5bee-22e3-4ea7-b49d-b42ce30895f5

https://testing.whamcloud.com/test_sets/66753f7e-69b3-43a7-bd3e-2d4b415a204d

https://testing.whamcloud.com/test_sets/226e5c40-191f-4929-b0d8-a41775b12ecd

Subtests are either skipped or passed, but the test set is marked as failed. Very similar logs can be found in these test runs, for example:

== recovery-double-scale test complete, duration 1846 sec ============================================ 19:24:04 (1637436244)
rm: cannot remove '/mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com/etc/selinux/targeted/active/modules/100': Directory not empty
 recovery-double-scale : @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5919:error()
  = /usr/lib64/lustre/tests/test-framework.sh:5404:check_and_cleanup_lustre()
  = /usr/lib64/lustre/tests/recovery-double-scale.sh:309:main()
Dumping lctl log to /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..*.1637436272.log
CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /usr/sbin/lctl dk > /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..debug_log.\$(hostname -s).1637436272.log;
         dmesg > /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..dmesg.\$(hostname -s).1637436272.log
CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 rsync -az /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..*.1637436272.log onyx-64vm1.onyx.whamcloud.com:/autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e/
Resetting fail_loc on all nodes...CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
done.
Stopping clients: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /mnt/lustre (opts:)
CMD: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 running=\$(grep -c /mnt/lustre' ' /proc/mounts);
if [ \$running -ne 0 ] ; then
echo Stopping client \$(hostname) /mnt/lustre opts:;
lsof /mnt/lustre || need_kill=no;
if [ x != x -a x\$need_kill != xno ]; then
    pids=\$(lsof -t /mnt/lustre | sort -u);
    if [ -n \"\$pids\" ]; then
             kill -9 \$pids;
    fi
fi;
while umount  /mnt/lustre 2>&1 | grep -q busy; do
    echo /mnt/lustre is still busy, wait one second && sleep 1;
done;
fi
Stopping client onyx-64vm4.onyx.whamcloud.com /mnt/lustre opts:
Stopping client onyx-64vm3.onyx.whamcloud.com /mnt/lustre opts:
Stopping client onyx-64vm1.onyx.whamcloud.com /mnt/lustre opts:
COMMAND    PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
run_tar.s 2826 root  cwd    DIR 1273,181606     4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com
tar       2938 root  cwd    DIR 1273,181606     4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com
tar       2939 root  cwd    DIR 1273,181606     4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com
COMMAND    PID USER   FD      TYPE      DEVICE   SIZE/OFF               NODE NAME
run_dd.sh 2920 root  cwd   unknown 1273,181606                               /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com
dd        3353 root  cwd   unknown 1273,181606                               /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com
dd        3353 root    1w      REG 1273,181606 2920480768 144116681667510276 /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com/dd-file
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
/mnt/lustre is still busy, wait one second
... 
Stopping clients: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /mnt/lustre2 (opts:)
Comment by Gerrit Updater [ 10/Dec/21 ]

"Elena Gryaznova <elena.gryaznova@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45824
Subject: LU-15140 tests: cleanup of recovery-*-scale tests fails
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c53265963c38501247c1d5063490164838f967dd

Comment by Gerrit Updater [ 06/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45824/
Subject: LU-15140 tests: cleanup of recovery-*-scale tests fails
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f252abc6690247ee9608dbde80238add0ecaed8c

Comment by Peter Jones [ 06/Jan/22 ]

Landed for 2.15

Generated at Sat Feb 10 03:15:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.