Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5765

sanity test_123a test_123b: rm: no such file or directory

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.7.0
    • 3
    • 16184

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      In sanity.sh test_123a and test_123b a large number of errors are being reported when "rm -r" is running:

      rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity2': No such file or directory
      rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity5': No such file or directory
      rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity8': No such file or directory
      rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity11': No such file or directory
      rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity12': No such file or directory
      rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity16': No such file or directory
      rm: cannot remove `/mnt/lustre/d123a.sanity/f123a.sanity18': No such file or directory
      

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/186592e2-5577-11e4-8542-5254006e85c2.

      Info required for matching: sanity 123a
      Info required for matching: sanity 123b

      Attachments

        Issue Links

          Activity

            [LU-5765] sanity test_123a test_123b: rm: no such file or directory

            Haven't seen this again for the past 4 weeks.

            Closing it again. Maybe LU-6101 will fix the final trigger.

            adilger Andreas Dilger added a comment - Haven't seen this again for the past 4 weeks. Closing it again. Maybe LU-6101 will fix the final trigger.

            This might also be related to LU-6101, which also has a patch.

            adilger Andreas Dilger added a comment - This might also be related to LU-6101 , which also has a patch.
            adilger Andreas Dilger added a comment - Saw this again on a recent patch run: https://testing.hpdd.intel.com/test_sets/e55f7412-881b-11e4-aa28-5254006e85c2 https://testing.hpdd.intel.com/test_sets/c69c83ca-87f9-11e4-a70f-5254006e85c2

            Is it possible that this problem is the same as LU-3573 and was fixed by http://review.whamcloud.com/12904 ?

            The most recent failure https://testing.hpdd.intel.com/test_sets/97ea0fd0-84f6-11e4-a60f-5254006e85c2 was on a patch based on a tree that doesn't contain the LU-3573 fix.

            Before that, the most recent test failure was https://testing.hpdd.intel.com/test_sets/863b9434-fcb2-11e2-9222-52540035b04c on 2014-08-03.

            adilger Andreas Dilger added a comment - Is it possible that this problem is the same as LU-3573 and was fixed by http://review.whamcloud.com/12904 ? The most recent failure https://testing.hpdd.intel.com/test_sets/97ea0fd0-84f6-11e4-a60f-5254006e85c2 was on a patch based on a tree that doesn't contain the LU-3573 fix. Before that, the most recent test failure was https://testing.hpdd.intel.com/test_sets/863b9434-fcb2-11e2-9222-52540035b04c on 2014-08-03.

            In one test today, test_123a took 18127 seconds to complete (more than 100x the usual time on the same hw), and there were 11575754 "no such file" errors in the test log.

            isaac Isaac Huang (Inactive) added a comment - In one test today, test_123a took 18127 seconds to complete (more than 100x the usual time on the same hw), and there were 11575754 "no such file" errors in the test log.

            With a build on master, I'm now able to reproduce it almost 100%. I'll look into it.

            isaac Isaac Huang (Inactive) added a comment - With a build on master, I'm now able to reproduce it almost 100%. I'll look into it.

            Isaac, another possible condition is that our Maloo test clusters run a lots VMs on single node, so the system load on Maloo clusters should be much higher than our personal test environment. So if we can simulate the similar test environment as Maloo clusters do, then it may be helpful to reproduce the issue locally.

            yong.fan nasf (Inactive) added a comment - Isaac, another possible condition is that our Maloo test clusters run a lots VMs on single node, so the system load on Maloo clusters should be much higher than our personal test environment. So if we can simulate the similar test environment as Maloo clusters do, then it may be helpful to reproduce the issue locally.

            Isaac, I think you should be testing with master and ZFS 0.6.3, and not b2_5, since all of the failures I've seen have been on master so far.

            Maybe a patch should be landed to sanity.sh test_123[ab] to help diagnose the problem if it happens again under review testing? For example, doing an "ls" of the directory after the test is done, always running "rm" under strace and logging it to a file that is attached to Maloo so that we can see if the problem is in readdir() data returned to userspace, etc. There have been 8 failures in the last 165 review-zfs tests in the past week, so I think if a patch is landed to sanity.sh on master it should be possible to get more information within a day or two. I don't think just running review testing on the patch itself is likely to see problems, unless we run sanity.sh 20x in a loop with Test-Parameters:.

            adilger Andreas Dilger added a comment - Isaac, I think you should be testing with master and ZFS 0.6.3, and not b2_5, since all of the failures I've seen have been on master so far. Maybe a patch should be landed to sanity.sh test_123 [ab] to help diagnose the problem if it happens again under review testing? For example, doing an "ls" of the directory after the test is done, always running "rm" under strace and logging it to a file that is attached to Maloo so that we can see if the problem is in readdir() data returned to userspace, etc. There have been 8 failures in the last 165 review-zfs tests in the past week, so I think if a patch is landed to sanity.sh on master it should be possible to get more information within a day or two. I don't think just running review testing on the patch itself is likely to see problems, unless we run sanity.sh 20x in a loop with Test-Parameters: .

            Another 50 repetitions, still can't reproduce, although the eagle VMs I used all had just 1 CPU.

            isaac Isaac Huang (Inactive) added a comment - Another 50 repetitions, still can't reproduce, although the eagle VMs I used all had just 1 CPU.

            I repeated "sanity --only 123a,123b" for 100 times, and couldn't reproduce it. I was using build lustre-b2_5/96/ with 1 OSS (2 OSTs) 1 MDS and 1 client, any other build/configuration I should try?

            isaac Isaac Huang (Inactive) added a comment - I repeated "sanity --only 123a,123b" for 100 times, and couldn't reproduce it. I was using build lustre-b2_5/96/ with 1 OSS (2 OSTs) 1 MDS and 1 client, any other build/configuration I should try?

            People

              isaac Isaac Huang (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: