[LU-7418] sanity test_29 failed with 'CURRENT: 7 > 6' Created: 11/Nov/15 Updated: 14/Jul/17 Resolved: 24/Jul/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
autotest |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
sanity test 29 failed review-zfs-part-1 with 'CURRENT: 7 > 6' With logs at https://testing.hpdd.intel.com/test_sets/cb93e408-87c7-11e5-b19e-5254006e85c2 The test_log contains: == sanity test 29: IT_GETATTR regression ============================== 11:01:22 (1447153282) first d29 total 1 -rw-r--r-- 1 root root 0 Nov 10 11:01 foo second d29 total 1 -rw-r--r-- 1 root root 0 Nov 10 11:01 foo done sanity test_29: @@@@@@ FAIL: CURRENT: 7 > 6 There’s nothing interesting in the dmesg and console logs on all nodes. This is the only failure of this test with this error message this year. |
| Comments |
| Comment by Sarah Liu [ 18/Feb/16 ] |
|
also hit this after rolling upgrade system from 2.5.5 RHEL6.6 ldiskfs to b2_8/#2 RHEL6.7 |
| Comment by nasf (Inactive) [ 06/Jun/16 ] |
|
Another failure instance on master: |
| Comment by Andreas Dilger [ 06/Jun/16 ] |
|
This is currently the #3 cause of review test failures. |
| Comment by Minh Diep [ 08/Jun/16 ] |
|
hit today https://testing.hpdd.intel.com/test_sessions/ff196568-2d95-11e6-bbf5-5254006e85c2 |
| Comment by Emoly Liu [ 12/Jun/16 ] |
|
Hit on master: |
| Comment by Jian Yu [ 12/Jun/16 ] |
|
More on master branch: https://testing.hpdd.intel.com/test_sets/7931ae9c-2f6b-11e6-a0ce-5254006e85c2 |
| Comment by Oleg Drokin [ 13/Jun/16 ] |
|
Looking at the logs in https://testing.hpdd.intel.com/test_sets/b2a207c4-30e4-11e6-acf3-5254006e85c2 as a sample: 00000100:00100000:0.0:1465735321.625309:0:28110:0:(client.c:1589:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc kworker/u4:1:097ecaa0-b8ab-67bd-1d7b-908b373c173b:28110:1536932700233984:10.2.4.153@tcp:101 i.e. a previous test created a file and left the pages in the cache apparently. Then kernel decided to write out the pages and called ll_writepages 00000100:00000040:1.0:1465735321.484485:0:30811:0:(client.c:2477:__ptlrpc_req_finished()) @@@ refcount now 0 req@ffff88004220db00 x1536932700233968/t30064876329(30064876329) o35->lustre-MDT0001-mdc-ffff88007af6e800@10.2.4.149@tcp:23/10 lens 512/424 e 0 to 0 dl 1465735328 ref 1 fl Complete:R/4/0 rc 0/0 But second ls did not start yet: 00000001:02000400:1.0:1465735321.769332:0:30927:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: onyx-36vm1.onyx.hpdd.intel.com: /usr/sbin/lctl mark onyx-36vm1.onyx.hpdd.intel.com: second d29 apparently we sampled the lock count in between the ls finish and writeout. I imagine to fix this we need to totally clean the cache at the start of this test (and any other tests that sample lock counts like this). |
| Comment by Gerrit Updater [ 13/Jun/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20741 |
| Comment by Gerrit Updater [ 13/Jun/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20741/ |
| Comment by Joseph Gmitter (Inactive) [ 15/Jun/16 ] |
|
patch has landed to master for 2.9.0 |
| Comment by nasf (Inactive) [ 21/Jun/16 ] |
|
Hit is again on master: |
| Comment by Emoly Liu [ 21/Jun/16 ] |
|
Hit again on master: |
| Comment by Oleg Drokin [ 21/Jun/16 ] |
|
the latest failures are the same wrt the cause - background write (from https://testing.hpdd.intel.com/test_sets/bd38a7d4-36e7-11e6-80b9-5254006e85c2) : 00000001:02000400:0.0:1466403207.184612:0:17958:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: onyx-36vm1.onyx.hpdd.intel.com: second d29 00000001:00000001:0.0:1466403207.187480:0:17958:0:(debug.c:336:libcfs_debug_mark_buffer()) *************************************************** ... 00000080:00000001:1.0:1466403207.223716:0:28269:0:(rw.c:991:ll_writepages()) Process entered ... 00000080:00010000:1.0:1466403207.223772:0:28269:0:(file.c:4127:ll_layout_refresh_locked()) ### lustre: requeue layout lock for file [0x200000bd0:0x6873:0x0](ffff88004c474098) ... 00000100:00100000:1.0:1466403207.223890:0:28269:0:(client.c:1620:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc kworker/u4:1:00a2d979-bc2c-75ed-91f3-0366d99422fc:28269:1537633944611264:10.2.4.153@tcp:101 |
| Comment by Andreas Dilger [ 21/Jun/16 ] |
|
I guess that the cancel_lru_locks call will not cancel locks that are actively being held. It might be worthwhile to add sync; sleep 1; sync before this call to ensure that the dirty pages are flushed before the locks are cancelled. |
| Comment by Gerrit Updater [ 22/Jun/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20929 |
| Comment by Gerrit Updater [ 24/Jun/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20929/ |
| Comment by Bob Glossman (Inactive) [ 24/Jun/16 ] |
|
another on master: |
| Comment by Peter Jones [ 24/Jul/16 ] |
|
Fix seems to have landed in 2.9. Most recent report probably did not include the fix yet |