Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7118

sanity-scrub: No sub tests failed in this test set

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      I've seen a lot of sanity-scrub instances entirely fail lately. Looks like some kind of TEI issue to me as it shows up in test runs on lots of different and unrelated mods. no logs are collected, summary always says:

      Failed subtests
      
      No sub tests failed in this test set.
      
      All subtests
      
      This test set does not have any sub tests.
      

      Maybe something really bad landed that blocks any sanity-scrub from running.

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/aaf64806-5682-11e5-a9bc-5254006e85c2.

      Attachments

        Issue Links

          Activity

            [LU-7118] sanity-scrub: No sub tests failed in this test set
            pjones Peter Jones added a comment -

            Original regularly occurring failure fixed. Dealing with less frequent occasional failure under LU-7193

            pjones Peter Jones added a comment - Original regularly occurring failure fixed. Dealing with less frequent occasional failure under LU-7193
            ys Yang Sheng made changes -
            Link New: This issue is related to LU-7193 [ LU-7193 ]
            ys Yang Sheng made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Reopened [ 4 ] New: Resolved [ 5 ]

            Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/16483
            Subject: LU-7118 tests: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ba6612cda149851fff0eea6ee548a99afb95dacf

            gerrit Gerrit Updater added a comment - Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/16483 Subject: LU-7118 tests: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ba6612cda149851fff0eea6ee548a99afb95dacf

            Looks like still the same problem, there are 7 OSTs, and only 4 OSTs are stopped.

            CMD: shadow-49vm3 grep -c /mnt/mds1' ' /proc/mounts
            Stopping /mnt/mds1 (opts:-f) on shadow-49vm3
            CMD: shadow-49vm3 umount -d -f /mnt/mds1
            CMD: shadow-49vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
            CMD: shadow-49vm7 grep -c /mnt/mds2' ' /proc/mounts
            Stopping /mnt/mds2 (opts:-f) on shadow-49vm7
            CMD: shadow-49vm7 umount -d -f /mnt/mds2
            CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
            CMD: shadow-49vm7 grep -c /mnt/mds3' ' /proc/mounts
            Stopping /mnt/mds3 (opts:-f) on shadow-49vm7
            CMD: shadow-49vm7 umount -d -f /mnt/mds3
            CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
            CMD: shadow-49vm7 grep -c /mnt/mds4' ' /proc/mounts
            Stopping /mnt/mds4 (opts:-f) on shadow-49vm7
            CMD: shadow-49vm7 umount -d -f /mnt/mds4
            CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
            CMD: shadow-49vm4 grep -c /mnt/ost1' ' /proc/mounts
            Stopping /mnt/ost1 (opts:-f) on shadow-49vm4
            CMD: shadow-49vm4 umount -d -f /mnt/ost1
            CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
            CMD: shadow-49vm4 grep -c /mnt/ost2' ' /proc/mounts
            Stopping /mnt/ost2 (opts:-f) on shadow-49vm4
            CMD: shadow-49vm4 umount -d -f /mnt/ost2
            CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
            CMD: shadow-49vm4 grep -c /mnt/ost3' ' /proc/mounts
            Stopping /mnt/ost3 (opts:-f) on shadow-49vm4
            CMD: shadow-49vm4 umount -d -f /mnt/ost3
            CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
            CMD: shadow-49vm4 grep -c /mnt/ost4' ' /proc/mounts
            Stopping /mnt/ost4 (opts:-f) on shadow-49vm4
            CMD: shadow-49vm4 umount -d -f /mnt/ost4
            CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
            

            I guess we need find out where the OSTCOUNT has been changed? or probably stopall should use "lov.xxxxx.numobd" instead of env var ?

            di.wang Di Wang (Inactive) added a comment - Looks like still the same problem, there are 7 OSTs, and only 4 OSTs are stopped. CMD: shadow-49vm3 grep -c /mnt/mds1' ' /proc/mounts Stopping /mnt/mds1 (opts:-f) on shadow-49vm3 CMD: shadow-49vm3 umount -d -f /mnt/mds1 CMD: shadow-49vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm7 grep -c /mnt/mds2' ' /proc/mounts Stopping /mnt/mds2 (opts:-f) on shadow-49vm7 CMD: shadow-49vm7 umount -d -f /mnt/mds2 CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm7 grep -c /mnt/mds3' ' /proc/mounts Stopping /mnt/mds3 (opts:-f) on shadow-49vm7 CMD: shadow-49vm7 umount -d -f /mnt/mds3 CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm7 grep -c /mnt/mds4' ' /proc/mounts Stopping /mnt/mds4 (opts:-f) on shadow-49vm7 CMD: shadow-49vm7 umount -d -f /mnt/mds4 CMD: shadow-49vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm4 grep -c /mnt/ost1' ' /proc/mounts Stopping /mnt/ost1 (opts:-f) on shadow-49vm4 CMD: shadow-49vm4 umount -d -f /mnt/ost1 CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm4 grep -c /mnt/ost2' ' /proc/mounts Stopping /mnt/ost2 (opts:-f) on shadow-49vm4 CMD: shadow-49vm4 umount -d -f /mnt/ost2 CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm4 grep -c /mnt/ost3' ' /proc/mounts Stopping /mnt/ost3 (opts:-f) on shadow-49vm4 CMD: shadow-49vm4 umount -d -f /mnt/ost3 CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' CMD: shadow-49vm4 grep -c /mnt/ost4' ' /proc/mounts Stopping /mnt/ost4 (opts:-f) on shadow-49vm4 CMD: shadow-49vm4 umount -d -f /mnt/ost4 CMD: shadow-49vm4 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' I guess we need find out where the OSTCOUNT has been changed? or probably stopall should use "lov.xxxxx.numobd" instead of env var ?
            ys Yang Sheng made changes -
            Resolution Original: Fixed [ 1 ]
            Status Original: Resolved [ 5 ] New: Reopened [ 4 ]
            ys Yang Sheng added a comment - https://testing.hpdd.intel.com/test_sets/bdf2ffa6-5cfb-11e5-945a-5254006e85c2 https://testing.hpdd.intel.com/test_sets/9cdbc984-5c91-11e5-b8a8-5254006e85c2 https://testing.hpdd.intel.com/test_sets/94b10788-5c19-11e5-9dac-5254006e85c2 https://testing.hpdd.intel.com/test_sets/e7fd2a3e-5bd8-11e5-96c9-5254006e85c2 Looks like last patch still not resolved this issue. But it is really not so frequently.
            jgmitter Joseph Gmitter (Inactive) made changes -
            Fix Version/s New: Lustre 2.8.0 [ 11113 ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Landed for 2.8.

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8.

            This definitely seems like a regression that landed recently on master, since sanity-scrub has had the [ $OSTCOUNT -gt 4 ] && OSTCOUNT=4 line since commit v2_5_58_0-41-g1dbba32 and there are not any failures on b2_7 testing. It might be useful to run a series of tests with different commits going back once per day running review-ldiskfs multiple times (if this is possible):

            Test-Parameters: fortestonly testgroup=review-ldiskfs
            Test-Parameters: fortestonly testgroup=review-ldiskfs
            Test-Parameters: fortestonly testgroup=review-ldiskfs
            Test-Parameters: fortestonly testgroup=review-ldiskfs
            

            since sanity-scrub only fails about 50% of the time this would give us a 94% chance of catching the regression patch at each stage. The earliest failures I see outside RHEL7.1 testing is 2015-09-08 (http://review.whamcloud.com/16315 based on commit 01ca8993247383 "LU-7079 ptlrpc: imp_peer_committed_transno should increase") and then it starts hitting hard on 2015-09-09, so I suspect it was something landed on 2015-09-08 that caused it.

            adilger Andreas Dilger added a comment - This definitely seems like a regression that landed recently on master, since sanity-scrub has had the [ $OSTCOUNT -gt 4 ] && OSTCOUNT=4 line since commit v2_5_58_0-41-g1dbba32 and there are not any failures on b2_7 testing. It might be useful to run a series of tests with different commits going back once per day running review-ldiskfs multiple times (if this is possible): Test-Parameters: fortestonly testgroup=review-ldiskfs Test-Parameters: fortestonly testgroup=review-ldiskfs Test-Parameters: fortestonly testgroup=review-ldiskfs Test-Parameters: fortestonly testgroup=review-ldiskfs since sanity-scrub only fails about 50% of the time this would give us a 94% chance of catching the regression patch at each stage. The earliest failures I see outside RHEL7.1 testing is 2015-09-08 ( http://review.whamcloud.com/16315 based on commit 01ca8993247383 " LU-7079 ptlrpc: imp_peer_committed_transno should increase") and then it starts hitting hard on 2015-09-09, so I suspect it was something landed on 2015-09-08 that caused it.

            People

              ys Yang Sheng
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: