[LU-7418] sanity test_29 failed with 'CURRENT: 7 > 6' Created: 11/Nov/15  Updated: 14/Jul/17  Resolved: 24/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None
Environment:

autotest


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 29 failed review-zfs-part-1 with

 'CURRENT: 7 > 6' 

With logs at https://testing.hpdd.intel.com/test_sets/cb93e408-87c7-11e5-b19e-5254006e85c2

The test_log contains:

== sanity test 29: IT_GETATTR regression  ============================== 11:01:22 (1447153282)
first d29
total 1
-rw-r--r-- 1 root root 0 Nov 10 11:01 foo
second d29
total 1
-rw-r--r-- 1 root root 0 Nov 10 11:01 foo
done
 sanity test_29: @@@@@@ FAIL: CURRENT: 7 > 6 

There’s nothing interesting in the dmesg and console logs on all nodes.

This is the only failure of this test with this error message this year.



 Comments   
Comment by Sarah Liu [ 18/Feb/16 ]

also hit this after rolling upgrade system from 2.5.5 RHEL6.6 ldiskfs to b2_8/#2 RHEL6.7

Comment by nasf (Inactive) [ 06/Jun/16 ]

Another failure instance on master:
https://testing.hpdd.intel.com/test_sets/6dbb2b0e-2b21-11e6-80b9-5254006e85c2

Comment by Andreas Dilger [ 06/Jun/16 ]

This is currently the #3 cause of review test failures.

Comment by Minh Diep [ 08/Jun/16 ]

hit today https://testing.hpdd.intel.com/test_sessions/ff196568-2d95-11e6-bbf5-5254006e85c2

Comment by Emoly Liu [ 12/Jun/16 ]

Hit on master:
https://testing.hpdd.intel.com/test_sets/4581ce42-2eb7-11e6-bbf5-5254006e85c2

Comment by Jian Yu [ 12/Jun/16 ]

More on master branch: https://testing.hpdd.intel.com/test_sets/7931ae9c-2f6b-11e6-a0ce-5254006e85c2

Comment by Oleg Drokin [ 13/Jun/16 ]

Looking at the logs in https://testing.hpdd.intel.com/test_sets/b2a207c4-30e4-11e6-acf3-5254006e85c2 as a sample:

00000100:00100000:0.0:1465735321.625309:0:28110:0:(client.c:1589:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc kworker/u4:1:097ecaa0-b8ab-67bd-1d7b-908b373c173b:28110:1536932700233984:10.2.4.153@tcp:101

i.e. a previous test created a file and left the pages in the cache apparently. Then kernel decided to write out the pages and called ll_writepages
and that fetched a layout lock.
I clearly see from the log that this unfortunate request came after "first" ls, because I see final o35 from that ls earlier:

00000100:00000040:1.0:1465735321.484485:0:30811:0:(client.c:2477:__ptlrpc_req_finished()) @@@ refcount now 0  req@ffff88004220db00 x1536932700233968/t30064876329(30064876329) o35->lustre-MDT0001-mdc-ffff88007af6e800@10.2.4.149@tcp:23/10 lens 512/424 e 0 to 0 dl 1465735328 ref 1 fl Complete:R/4/0 rc 0/0

But second ls did not start yet:

00000001:02000400:1.0:1465735321.769332:0:30927:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: onyx-36vm1.onyx.hpdd.intel.com: /usr/sbin/lctl mark onyx-36vm1.onyx.hpdd.intel.com: second d29

apparently we sampled the lock count in between the ls finish and writeout.

I imagine to fix this we need to totally clean the cache at the start of this test (and any other tests that sample lock counts like this).

Comment by Gerrit Updater [ 13/Jun/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20741
Subject: LU-7418 tests: cancel all locks in sanity test_29
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eeb676c9c6922e73a4e4061edeaed3e971bc880a

Comment by Gerrit Updater [ 13/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20741/
Subject: LU-7418 tests: cancel all locks in sanity test_29
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e14c5ba62b4bc317960417e88586ba335719f876

Comment by Joseph Gmitter (Inactive) [ 15/Jun/16 ]

patch has landed to master for 2.9.0

Comment by nasf (Inactive) [ 21/Jun/16 ]

Hit is again on master:
https://testing.hpdd.intel.com/test_sets/87dd677a-36d1-11e6-acf3-5254006e85c2

Comment by Emoly Liu [ 21/Jun/16 ]

Hit again on master:
https://testing.hpdd.intel.com/test_sets/bd38a7d4-36e7-11e6-80b9-5254006e85c2

Comment by Oleg Drokin [ 21/Jun/16 ]

the latest failures are the same wrt the cause - background write (from https://testing.hpdd.intel.com/test_sets/bd38a7d4-36e7-11e6-80b9-5254006e85c2) :

00000001:02000400:0.0:1466403207.184612:0:17958:0:(debug.c:335:libcfs_debug_mark_buffer()) DEBUG MARKER: onyx-36vm1.onyx.hpdd.intel.com: second d29
00000001:00000001:0.0:1466403207.187480:0:17958:0:(debug.c:336:libcfs_debug_mark_buffer()) ***************************************************
...
00000080:00000001:1.0:1466403207.223716:0:28269:0:(rw.c:991:ll_writepages()) Process entered
...
00000080:00010000:1.0:1466403207.223772:0:28269:0:(file.c:4127:ll_layout_refresh_locked()) ### lustre: requeue layout lock for file [0x200000bd0:0x6873:0x0](ffff88004c474098)
...
00000100:00100000:1.0:1466403207.223890:0:28269:0:(client.c:1620:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc kworker/u4:1:00a2d979-bc2c-75ed-91f3-0366d99422fc:28269:1537633944611264:10.2.4.153@tcp:101
Comment by Andreas Dilger [ 21/Jun/16 ]

I guess that the cancel_lru_locks call will not cancel locks that are actively being held. It might be worthwhile to add sync; sleep 1; sync before this call to ensure that the dirty pages are flushed before the locks are cancelled.

Comment by Gerrit Updater [ 22/Jun/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/20929
Subject: LU-7418 tests: avoid background writeout in sanity 29
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1136d753db5a5306236bb2f33a042a6780f9db7c

Comment by Gerrit Updater [ 24/Jun/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20929/
Subject: LU-7418 tests: avoid background writeout in sanity 29
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a84250b902515e7f5c9c4cd44885ca98b738b392

Comment by Bob Glossman (Inactive) [ 24/Jun/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/2925361c-3a38-11e6-bbf5-5254006e85c2

Comment by Peter Jones [ 24/Jul/16 ]

Fix seems to have landed in 2.9. Most recent report probably did not include the fix yet

Generated at Sat Feb 10 02:08:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.