[LU-6920] sanity test_205 failed with old jobstats not expired Created: 28/Jul/15  Updated: 24/Aug/16  Resolved: 28/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Minor
Reporter: Aditya Pandit (Inactive) Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File 205__3.lctl.tgz    
Issue Links:
Related
is related to LU-7151 sanity test_205:No jobstats for id.20... Resolved
is related to LU-7200 kernel update [SLES11 SP3 3.0.101-0.4... Resolved
is related to LU-6889 new kernel [SLES11 SP4 3.0.101-65] Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Test 205 was executed for 5 iterations

Using JobID environment variable FAKE_JOBID=id.205.mkdir.24449
sanity test_205: @@@@@@ FAIL: old jobstats not expired
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:4732:error_noexit()
= /usr/lib64/lustre/tests/test-framework.sh:4763:error()
= /usr/lib64/lustre/tests/sanity.sh:11638:test_205()
= /usr/lib64/lustre/tests/test-framework.sh:5010:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:5047:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:4864:run_test()
= /usr/lib64/lustre/tests/sanity.sh:11660:main()
Dumping lctl log to /tmp/test_logs/1438010993/sanity.test_205.*.1438011013.log
fre0106: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

fre0105: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

fre0108: Warning: Permanently added 'fre0107,192.168.101.7' (RSA) to the list of known hosts.

mdt.lustre-MDT0000.job_cleanup_interval=600
fre0105: warning: 'lctl conf_param' is deprecated, use 'lctl set_param -P' instead
Waiting 90 secs for update
Updated after 9s: wanted 'procname_uid' got 'procname_uid'
lustre-MDT0000: Deregistered changelog user 'cl4'
FAIL 205 (24s)
sanity: FAIL: test_205 old jobstats not expired
debug=super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck



 Comments   
Comment by Ashish Purkar [ 08/Sep/15 ]

It's easily reproducible on master

Lustre: DEBUG MARKER: Test: mkdir /mnt/lustre/d205.sanity.expire
Lustre: DEBUG MARKER: Using JobID environment variable FAKE_JOBID=id.205.mkdir.18733
Lustre: DEBUG MARKER: sanity test_205: @@@@@@ FAIL: old jobstats not expired
Lustre: lustre-MDD0000: changelog off
Lustre: DEBUG MARKER: == sanity test complete, duration 34 sec == 19:52:57 (1441191177)
Comment by Bob Glossman (Inactive) [ 04/Oct/15 ]

another one seen with sles11sp4 on master:
https://testing.hpdd.intel.com/test_sets/a1bab5ba-6a0f-11e5-b8d9-5254006e85c2

Comment by Bob Glossman (Inactive) [ 07/Oct/15 ]

another sles11sp4 on master:
https://testing.hpdd.intel.com/test_sets/5f9b33a8-6cdf-11e5-a8d6-5254006e85c2

Comment by Andreas Dilger [ 07/Oct/15 ]

It looks like this test is only failing on SLES11 SP3/SP4 and not others.

Comment by Andreas Dilger [ 07/Oct/15 ]

Bob, assigning this to you for further investigation. It seems to be failing regularly for SLES11 SP3/SP4 tests, and even if we could get those patches to pass once, this is failing for more test runs than it is passing and we would just be introducing a regression by landing those patches.

Comment by Andreas Dilger [ 07/Oct/15 ]

It may be that increasing the timeout for jobid expiry by a second or two would be enough. The most efficient way might be to use wait_update() with a maximum of (left + 5) seconds so that it doesn't wait longer than needed.

Comment by Gerrit Updater [ 07/Oct/15 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/16753
Subject: LU-6920 test: try to reproduce failure on sanity test_205
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 376ea32167938839f209772e12b68a7acd96cc29

Comment by Gerrit Updater [ 28/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16753/
Subject: LU-6920 test: add some slack to jobstats expiry in test_205
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 13e34c1d0e5472759d1350b62fa0663bbcd59fa0

Comment by Joseph Gmitter (Inactive) [ 28/Oct/15 ]

Landed for 2.8

Comment by Ashish Purkar (Inactive) [ 24/Aug/16 ]
  • This should be reopened as intermittently sanity test_205 is still failing on master.
    > It may be that increasing the timeout for jobid expiry by a second or two would be enough. The most efficient way might be to use wait_update() with a maximum of (left + 5) seconds so that it doesn't wait longer than needed.
  • I'll push patch with changes suggested by Andreas.
Comment by Peter Jones [ 24/Aug/16 ]

Ashish

Given that the fix that did land was in the already released 2.8 release it would probably be clearer to open a new ticket linked to this one

Peter

Comment by Ashish Purkar (Inactive) [ 24/Aug/16 ]

> Given that the fix that did land was in the already released 2.8 release it would probably be clearer to open a new ticket linked to this one
Ok, I'll create a new ticket.

Generated at Sat Feb 10 02:04:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.