Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9827

sanityn fail “remove sub-test dirs failed d80b.sanityn/migrate_dir”

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.0, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.10.6
    • None
    • 3
    • 9223372036854775807

    Description

      The sanityn test suite fails, but no subtests fail. If you look at the end of the suite_stdout log, we see that one of the directories that test_80b created cannot be deleted

      03:30:00:== sanityn test complete, duration 3688 sec ========================================================== 03:29:56 (1501730996)
      03:30:11:rm: cannot remove '/mnt/lustre/d80b.sanityn/migrate_dir/link_file': Stale file handle
      03:30:11: sanityn : @@@@@@ FAIL: remove sub-test dirs failed 
      03:30:11:  Trace dump:
      03:30:11:  = /usr/lib64/lustre/tests/test-framework.sh:4980:error()
      03:30:11:  = /usr/lib64/lustre/tests/test-framework.sh:4499:check_and_cleanup_lustre()
      03:30:11:  = /usr/lib64/lustre/tests/sanityn.sh:4012:main()
      

      test 80b does not fail, but we see the following in the test_log:

      16:02:42:== sanityn test 80b: Accessing directory during migration ============================================ 16:02:39 (1501516959)
      16:02:42:start migration thread 11920
      16:02:42:accessing the migrating directory for 5 minutes...
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:02:53:...10 seconds
      16:02:53:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      …
      16:03:41:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:03:41:...60 seconds
      16:03:41:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:03:45:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:03:45:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:03:46:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:03:46:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:03:46:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:03:46:rm: cannot remove '/mnt/lustre2/d80b.sanityn/migrate_dir/link_file': Stale file handle
      16:03:46:diff: /mnt/lustre2/d80b.sanityn/migrate_dir/file1: No such file or directory
      16:03:46:access file1 fails
      16:03:46:Resetting fail_loc on all nodes.../usr/lib64/lustre/tests/test-framework.sh: line 3146: 11920 Killed                  ( while true; do
      16:03:46:    mdt_idx=$((RANDOM % MDSCOUNT)); $LFS migrate -m $mdt_idx $migrate_dir1 &>/dev/null || rc=$?; [ $rc -ne 0 -o $rc -ne 16 ] || break;
      16:03:46:done )
      16:03:46:CMD: trevis-3vm1.trevis.hpdd.intel.com,trevis-3vm2,trevis-3vm3,trevis-3vm4,trevis-3vm8 lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
      16:03:46:done.
      

      We’ve seen this test suite failure with a Lustre setup with DNE:
      https://testing.hpdd.intel.com/test_sets/e98fd2a4-7826-11e7-9a7b-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/e5ffaca8-763d-11e7-bbe0-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/42bd9ada-7289-11e7-bb95-5254006e85c2

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: