Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6920

sanity test_205 failed with old jobstats not expired

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      Test 205 was executed for 5 iterations

      Using JobID environment variable FAKE_JOBID=id.205.mkdir.24449
      sanity test_205: @@@@@@ FAIL: old jobstats not expired
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:4732:error_noexit()
      = /usr/lib64/lustre/tests/test-framework.sh:4763:error()
      = /usr/lib64/lustre/tests/sanity.sh:11638:test_205()
      = /usr/lib64/lustre/tests/test-framework.sh:5010:run_one()
      = /usr/lib64/lustre/tests/test-framework.sh:5047:run_one_logged()
      = /usr/lib64/lustre/tests/test-framework.sh:4864:run_test()
      = /usr/lib64/lustre/tests/sanity.sh:11660:main()
      Dumping lctl log to /tmp/test_logs/1438010993/sanity.test_205.*.1438011013.log
      fre0106: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

      fre0105: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

      fre0108: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

      mdt.lustre-MDT0000.job_cleanup_interval=600
      fre0105: warning: 'lctl conf_param' is deprecated, use 'lctl set_param -P' instead
      Waiting 90 secs for update
      Updated after 9s: wanted 'procname_uid' got 'procname_uid'
      lustre-MDT0000: Deregistered changelog user 'cl4'
      FAIL 205 (24s)
      sanity: FAIL: test_205 old jobstats not expired
      debug=super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck

      Attachments

        Issue Links

          Activity

            [LU-6920] sanity test_205 failed with old jobstats not expired

            > Given that the fix that did land was in the already released 2.8 release it would probably be clearer to open a new ticket linked to this one
            Ok, I'll create a new ticket.

            maximus Ashish Purkar (Inactive) added a comment - > Given that the fix that did land was in the already released 2.8 release it would probably be clearer to open a new ticket linked to this one Ok, I'll create a new ticket.
            pjones Peter Jones added a comment -

            Ashish

            Given that the fix that did land was in the already released 2.8 release it would probably be clearer to open a new ticket linked to this one

            Peter

            pjones Peter Jones added a comment - Ashish Given that the fix that did land was in the already released 2.8 release it would probably be clearer to open a new ticket linked to this one Peter
            maximus Ashish Purkar (Inactive) added a comment - - edited
            • This should be reopened as intermittently sanity test_205 is still failing on master.
              > It may be that increasing the timeout for jobid expiry by a second or two would be enough. The most efficient way might be to use wait_update() with a maximum of (left + 5) seconds so that it doesn't wait longer than needed.
            • I'll push patch with changes suggested by Andreas.
            maximus Ashish Purkar (Inactive) added a comment - - edited This should be reopened as intermittently sanity test_205 is still failing on master. > It may be that increasing the timeout for jobid expiry by a second or two would be enough. The most efficient way might be to use wait_update() with a maximum of (left + 5) seconds so that it doesn't wait longer than needed. I'll push patch with changes suggested by Andreas.

            Landed for 2.8

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16753/
            Subject: LU-6920 test: add some slack to jobstats expiry in test_205
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 13e34c1d0e5472759d1350b62fa0663bbcd59fa0

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16753/ Subject: LU-6920 test: add some slack to jobstats expiry in test_205 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 13e34c1d0e5472759d1350b62fa0663bbcd59fa0

            Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/16753
            Subject: LU-6920 test: try to reproduce failure on sanity test_205
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 376ea32167938839f209772e12b68a7acd96cc29

            gerrit Gerrit Updater added a comment - Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/16753 Subject: LU-6920 test: try to reproduce failure on sanity test_205 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 376ea32167938839f209772e12b68a7acd96cc29

            It may be that increasing the timeout for jobid expiry by a second or two would be enough. The most efficient way might be to use wait_update() with a maximum of (left + 5) seconds so that it doesn't wait longer than needed.

            adilger Andreas Dilger added a comment - It may be that increasing the timeout for jobid expiry by a second or two would be enough. The most efficient way might be to use wait_update() with a maximum of (left + 5) seconds so that it doesn't wait longer than needed.

            Bob, assigning this to you for further investigation. It seems to be failing regularly for SLES11 SP3/SP4 tests, and even if we could get those patches to pass once, this is failing for more test runs than it is passing and we would just be introducing a regression by landing those patches.

            adilger Andreas Dilger added a comment - Bob, assigning this to you for further investigation. It seems to be failing regularly for SLES11 SP3/SP4 tests, and even if we could get those patches to pass once, this is failing for more test runs than it is passing and we would just be introducing a regression by landing those patches.

            It looks like this test is only failing on SLES11 SP3/SP4 and not others.

            adilger Andreas Dilger added a comment - It looks like this test is only failing on SLES11 SP3/SP4 and not others.
            bogl Bob Glossman (Inactive) added a comment - another sles11sp4 on master: https://testing.hpdd.intel.com/test_sets/5f9b33a8-6cdf-11e5-a8d6-5254006e85c2

            People

              bogl Bob Glossman (Inactive)
              aditya.pandit@seagate.com Aditya Pandit (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: