Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15140

recovery-random-scale: No sub tests failed in this test set, FAIL: remove sub-test dirs failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • Lustre 2.12.8, Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run:
      https://testing.whamcloud.com/test_sets/dfbec373-e4f6-416f-83b9-2265475a3b80
      https://testing.whamcloud.com/test_sets/0d2d3b02-4871-4160-a995-a51a74e4cd3b
      https://testing.whamcloud.com/test_sets/f060ce64-9e1e-4546-b4bd-2f740787c589
      https://testing.whamcloud.com/test_sets/2640a2ac-2fff-48d5-82a6-cab054add322
      https://testing.whamcloud.com/test_sets/530acf01-f304-4dfa-94aa-16b5c1280cc5
      https://testing.whamcloud.com/test_sets/902b1bb9-ed8b-400b-895a-baa328d4b7c5

      == recovery-random-scale test complete, duration 85944 sec =========================================== 19:34:45 (1634672085)
      rm: cannot remove '/mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com/etc/selinux/targeted/active/modules/100': Directory not empty
       recovery-random-scale : @@@@@@ FAIL: remove sub-test dirs failed 
      Stopping clients: trevis-68vm1.trevis.whamcloud.com,trevis-68vm3,trevis-68vm4 /mnt/lustre (opts:)
      while umount  /mnt/lustre 2>&1 | grep -q busy; do
          echo /mnt/lustre is still busy, wait one second && sleep 1;
      done;
      fi
      Stopping client trevis-68vm1.trevis.whamcloud.com /mnt/lustre opts:
      Stopping client trevis-68vm3.trevis.whamcloud.com /mnt/lustre opts:
      Stopping client trevis-68vm4.trevis.whamcloud.com /mnt/lustre opts:
      COMMAND    PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
      run_tar.s 2934 root  cwd    DIR 1273,181606    11264 144116446786489609 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com
      tar       3222 root  cwd    DIR 1273,181606    11264 144116446786489609 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com
      tar       3223 root  cwd    DIR 1273,181606    11264 144116446786489609 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com
      tar       3223 root    3w   REG 1273,181606     5156 144117587637177621 /mnt/lustre/d0.tar-trevis-68vm4.trevis.whamcloud.com/etc/selinux/targeted/active/modules/100/gpg/cil
      /mnt/lustre is still busy, wait one second
      /mnt/lustre is still busy, wait one second
      /mnt/lustre is still busy, wait one second
      /mnt/lustre is still busy, wait one second
      COMMAND    PID USER   FD      TYPE      DEVICE   SIZE/OFF               NODE NAME
      run_dd.sh 2747 root  cwd   unknown 1273,181606                               /mnt/lustre/d0.dd-trevis-68vm3.trevis.whamcloud.com
      dd        2772 root  cwd   unknown 1273,181606                               /mnt/lustre/d0.dd-trevis-68vm3.trevis.whamcloud.com
      dd        2772 root    1w      REG 1273,181606 1160388608 144117486973878288 /mnt/lustre/d0.dd-trevis-68vm3.trevis.whamcloud.com/dd-file
      /mnt/lustre is still busy, wait one second
      /mnt/lustre is still busy, wait one second
      /mnt/lustre is still busy, wait one second
      /mnt/lustre is still busy, wait one second
      :
      :
      Stopping clients: trevis-68vm1.trevis.whamcloud.com,trevis-68vm3,trevis-68vm4 /mnt/lustre2 (opts:)
      

      so it looks like the client filesystem is eventually unmounted correctly, after the running jobs complete. It appears from the jobs that are still running that "tar" may still be writing into that directory tree at the time that "rm -r" is called, causing the directory not to be empty.

      It would make sense to ensure that the running jobs are stopped before trying to delete the directory tree.

      Attachments

        Issue Links

          Activity

            [LU-15140] recovery-random-scale: No sub tests failed in this test set, FAIL: remove sub-test dirs failed
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45824/
            Subject: LU-15140 tests: cleanup of recovery-*-scale tests fails
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: f252abc6690247ee9608dbde80238add0ecaed8c

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45824/ Subject: LU-15140 tests: cleanup of recovery-*-scale tests fails Project: fs/lustre-release Branch: master Current Patch Set: Commit: f252abc6690247ee9608dbde80238add0ecaed8c

            "Elena Gryaznova <elena.gryaznova@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45824
            Subject: LU-15140 tests: cleanup of recovery-*-scale tests fails
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: c53265963c38501247c1d5063490164838f967dd

            gerrit Gerrit Updater added a comment - "Elena Gryaznova <elena.gryaznova@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45824 Subject: LU-15140 tests: cleanup of recovery-*-scale tests fails Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: c53265963c38501247c1d5063490164838f967dd

            Very similar issues can be observed in recovery-double-scale, recovery-mds-scale, recovery-double-scale test sets on 2.12.8:

            https://testing.whamcloud.com/test_sets/700f5bee-22e3-4ea7-b49d-b42ce30895f5

            https://testing.whamcloud.com/test_sets/66753f7e-69b3-43a7-bd3e-2d4b415a204d

            https://testing.whamcloud.com/test_sets/226e5c40-191f-4929-b0d8-a41775b12ecd

            Subtests are either skipped or passed, but the test set is marked as failed. Very similar logs can be found in these test runs, for example:

            == recovery-double-scale test complete, duration 1846 sec ============================================ 19:24:04 (1637436244)
            rm: cannot remove '/mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com/etc/selinux/targeted/active/modules/100': Directory not empty
             recovery-double-scale : @@@@@@ FAIL: remove sub-test dirs failed 
              Trace dump:
              = /usr/lib64/lustre/tests/test-framework.sh:5919:error()
              = /usr/lib64/lustre/tests/test-framework.sh:5404:check_and_cleanup_lustre()
              = /usr/lib64/lustre/tests/recovery-double-scale.sh:309:main()
            Dumping lctl log to /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..*.1637436272.log
            CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /usr/sbin/lctl dk > /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..debug_log.\$(hostname -s).1637436272.log;
                     dmesg > /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..dmesg.\$(hostname -s).1637436272.log
            CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 rsync -az /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e//recovery-double-scale..*.1637436272.log onyx-64vm1.onyx.whamcloud.com:/autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e/
            Resetting fail_loc on all nodes...CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
            done.
            Stopping clients: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /mnt/lustre (opts:)
            CMD: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 running=\$(grep -c /mnt/lustre' ' /proc/mounts);
            if [ \$running -ne 0 ] ; then
            echo Stopping client \$(hostname) /mnt/lustre opts:;
            lsof /mnt/lustre || need_kill=no;
            if [ x != x -a x\$need_kill != xno ]; then
                pids=\$(lsof -t /mnt/lustre | sort -u);
                if [ -n \"\$pids\" ]; then
                         kill -9 \$pids;
                fi
            fi;
            while umount  /mnt/lustre 2>&1 | grep -q busy; do
                echo /mnt/lustre is still busy, wait one second && sleep 1;
            done;
            fi
            Stopping client onyx-64vm4.onyx.whamcloud.com /mnt/lustre opts:
            Stopping client onyx-64vm3.onyx.whamcloud.com /mnt/lustre opts:
            Stopping client onyx-64vm1.onyx.whamcloud.com /mnt/lustre opts:
            COMMAND    PID USER   FD   TYPE      DEVICE SIZE/OFF               NODE NAME
            run_tar.s 2826 root  cwd    DIR 1273,181606     4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com
            tar       2938 root  cwd    DIR 1273,181606     4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com
            tar       2939 root  cwd    DIR 1273,181606     4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com
            COMMAND    PID USER   FD      TYPE      DEVICE   SIZE/OFF               NODE NAME
            run_dd.sh 2920 root  cwd   unknown 1273,181606                               /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com
            dd        3353 root  cwd   unknown 1273,181606                               /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com
            dd        3353 root    1w      REG 1273,181606 2920480768 144116681667510276 /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com/dd-file
            /mnt/lustre is still busy, wait one second
            /mnt/lustre is still busy, wait one second
            /mnt/lustre is still busy, wait one second
            /mnt/lustre is still busy, wait one second
            /mnt/lustre is still busy, wait one second
            /mnt/lustre is still busy, wait one second
            /mnt/lustre is still busy, wait one second
            ... 
            Stopping clients: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /mnt/lustre2 (opts:)
            anikitenko Alena Nikitenko (Inactive) added a comment - - edited Very similar issues can be observed in recovery-double-scale, recovery-mds-scale, recovery-double-scale test sets on 2.12.8: https://testing.whamcloud.com/test_sets/700f5bee-22e3-4ea7-b49d-b42ce30895f5 https://testing.whamcloud.com/test_sets/66753f7e-69b3-43a7-bd3e-2d4b415a204d https://testing.whamcloud.com/test_sets/226e5c40-191f-4929-b0d8-a41775b12ecd Subtests are either skipped or passed, but the test set is marked as failed. Very similar logs can be found in these test runs, for example: == recovery- double -scale test complete, duration 1846 sec ============================================ 19:24:04 (1637436244) rm: cannot remove '/mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com/etc/selinux/targeted/active/modules/100' : Directory not empty recovery- double -scale : @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5919:error() = /usr/lib64/lustre/tests/test-framework.sh:5404:check_and_cleanup_lustre() = /usr/lib64/lustre/tests/recovery- double -scale.sh:309:main() Dumping lctl log to /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e //recovery- double -scale..*.1637436272.log CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /usr/sbin/lctl dk > /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e //recovery- double -scale..debug_log.\$(hostname -s).1637436272.log; dmesg > /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e //recovery- double -scale..dmesg.\$(hostname -s).1637436272.log CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 rsync -az /autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e //recovery- double -scale..*.1637436272.log onyx-64vm1.onyx.whamcloud.com:/autotest/autotest-1/2021-11-20/lustre-b2_12_failover-part-1_150_1_40_13db9919-f21e-4132-8be9-3d11b4f5908e/ Resetting fail_loc on all nodes...CMD: onyx-109vm9,onyx-24vm8,onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 lctl set_param -n fail_loc=0 fail_val=0 2>/dev/ null done. Stopping clients: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /mnt/lustre (opts:) CMD: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 running=\$(grep -c /mnt/lustre ' ' /proc/mounts); if [ \$running -ne 0 ] ; then echo Stopping client \$(hostname) /mnt/lustre opts:; lsof /mnt/lustre || need_kill=no; if [ x != x -a x\$need_kill != xno ]; then pids=\$(lsof -t /mnt/lustre | sort -u); if [ -n \ "\$pids\" ]; then kill -9 \$pids; fi fi; while umount /mnt/lustre 2>&1 | grep -q busy; do echo /mnt/lustre is still busy, wait one second && sleep 1; done; fi Stopping client onyx-64vm4.onyx.whamcloud.com /mnt/lustre opts: Stopping client onyx-64vm3.onyx.whamcloud.com /mnt/lustre opts: Stopping client onyx-64vm1.onyx.whamcloud.com /mnt/lustre opts: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME run_tar.s 2826 root cwd DIR 1273,181606 4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com tar 2938 root cwd DIR 1273,181606 4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com tar 2939 root cwd DIR 1273,181606 4096 144116614575426332 /mnt/lustre/d0.tar-onyx-64vm4.onyx.whamcloud.com COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME run_dd.sh 2920 root cwd unknown 1273,181606 /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com dd 3353 root cwd unknown 1273,181606 /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com dd 3353 root 1w REG 1273,181606 2920480768 144116681667510276 /mnt/lustre/d0.dd-onyx-64vm3.onyx.whamcloud.com/dd-file /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second /mnt/lustre is still busy, wait one second ... Stopping clients: onyx-64vm1.onyx.whamcloud.com,onyx-64vm3,onyx-64vm4 /mnt/lustre2 (opts:)

            People

              egryaznova Elena Gryaznova
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: